More on Search Engines and Crawlers
Bill Slawski, September 13, 2012
On Monday night, I had the chance to give a presentation for the Agile SEO Meetup at the Webimax headquarters in Mt. Laurel, New Jersey. There was a nice turnout, including many Maxers* who started their day early, and stayed late or returned for the presentation. It takes a team to make a meetup work, and I wanted to express my thanks to the Maxers who set up the presentation equipment and the live-streamed webinar, who set up tables and chairs and signs, who made sure we had something to drink, and who helped promote the event. I'm not going to mention names, because I'm sure I'd leave someone out, but you know who you are.
The topic is one dear to my heart, since many of the issues I often see on websites involves how pages of a site might be crawled by search engines. The title for my presentation was, "Everything you wanted to know about crawling, but didn't know where to ask (Including Importance Metrics and Link Merging)". OK, I got carried away with the title, but it seemed to fit what I wanted to talk about.
Here's the slides from the presentation:
I wanted to share some behind the scenes thoughts about the presentation with this post.
I mention the robots mailing list early on, and one of the things that amazes me is the role of the then very young Martijn Koster in spearheading something like the Robots Exclusion Standard and the robots.txt file that we all know and love. If you look at my slide that shows a Usenet message from him, you'll see he contacted an interesting group of people to work on some way of lessening the impact of crawling programs on Web pages. They include, among others, Jonathan Fletcher, who invented Jumpstation one of the earliest modern search engines. Another name on that list is Guido van Rossum, who invented the programming language Python, and who presently works for Google.
I didn't include a link to the robots.txt pages on the presentation, nor to the specification that Google follows with robots.txt. Bing also includes a lot of information about their implementation of robots.txt for Web pages. Incidentally, here's Bing's robot.txt file, Google's robot.txt file, and Yahoo's robot.txt file if you're curious about how they are doing it. Why is Yahoo's so much simpler and shorter?
Ok, I'm not sure how many of you knew that Google was granted a patent on a politeness protocol for robots visiting web pages so that they wouldn't overwhelm them (like a distributed denial of service attack), but it surprised me when I first came across it. Incidentially the picture I included of sliced wires was one my webhost included on a page apologizing for down server time when the connection to their data center had an incident with a backhoe. They've since added an additional alternative set of lines to reach the world with in case such a catastrophe happens again.
I have a slide showing Lawrence Page and a paper that he co-authored about web crawling and importance metrics (pdf) associated with it. At one point in time, there was a page on the Stanford.edu website that listed about 10 or so papers describing technology that Google was based upon, and this was one of the papers included. That page no longer seems to exist, and I don't remember the URL. I should have saved a screenshot of it when it was around, but it's too late to do that. If you're interested in some of the approaches that Google follows when crawling the Web, it's a good starting point. You can see from it that Google would rather index a million home pages than a million pages on one site.
The patent I pointed to on Anchor tag indexing in a web crawler system adds some additional information about how Google crawls pages.That one was originally filed in 2003, so things have likely changed considerably.
My "subliminal advertising" slide didn't elicit any laughter, but I did see a couple of smiles.
Google's webmaster tools do stress that a web site owner should build pages that would work well in an early browser like Lynx, but there are a lot of hints that Google has the capacity to view pages with much more sophisticated browsers. The question is though, do they do that when it's potentially very expensive from a computational stance?
The link merging slide contains information from a Microsoft patent that describes how they might be performing a Web Site Structure Analysis. What really interested me was how they might merge links they find on pages, such as links at the top of a page in a main navigation, and links at the bottoms of pages in a footer navigation. What kinds of implications might this have for pages that are linked to multiple times on the same page?
The remainder of the presentation points to some features introduced by the search engines that webmasters can use to try to help the search engines better understand the structures of their sites, such as canonical link elements, hreflang elements, "prev" and "next" link elements, and XML sitemaps. They can be helpful if used correctly.
My next-to-last slide includes a link to a Yahoo patent filing that tells us that they would consider looking at links in social media signals to find pages discussing hot topics, and provide answers to very recency sensitive queries.
Not included in the slides was a mention I made of a very recent paper that describes how a focused crawl of web pages that also looks at text around links might help identify some of the sentiment about those links. The paper is Sentimental Spidering: Leveraging Opinion Information in Focused Crawlers. One of the authors of that paper has since moved on to Google, and brought some of his expertise on that topic with him.
Chris Countey also gave a presentation on Monday about some of the trusted sources that he looks to in keeping up with SEO, and he's going to posting about his presentation sometime soon, so keep an eye out for that.
* Added 2012-09-13 at 1:13 pm (eastern) - Maxers is my name for the team at Webimax, and one that I coined while writing this post. Any resemblance to the term mozzers to stand for the people at SEOmoz is purely intentional.