One of my first stopping points when assessing whether there are any technical issues involving a Website is a text file in the root directory of a site with the name robots.txt.

Some sites don’t have a robots.txt site, and some don’t necessarily need one, but a dynamic site with endless loops that a search engine spider may get lost within should have a robots.txt file, and a disallow statement keeping the spidering programs from trying to index those pages.

A site that republishes the same content under different URLs, such as alternative print versions of pages, should also consider disallowing those pages.

An error in a robots.txt file can have some serious implications for the indexing of a web site.

A failure to have a robots.txt file, when one could be helpful, may mean that a site could have its internal link equity distributed poorly. A percentage of the important pages of a site also might not get indexed, while less meaningful pages may be.

A new paper, A Large-Scale Study of Robots.txt, describes one of the first detailed reviews of the usage of robots.txt files. It’s an interesting look at one of the most important pages on a Web site.

Combine it with a recent post at Search Engine Land, Up Close & Personal With Robots.txt, and you’ll come to the conclusion that not enough people are using a robots.txt page on their web sites, and of those using it, many are using it incorrectly or have problems with the ways that they have it set up.

The paper involved five crawls of websites from between Dec. 2005 and Oct. 2006, to view the robots.txt files for those sites. The sites were chosen from the Open Directory project, covering education, news, and government sites, and from a Fortune Top 1000 Company List for business sites. 7,593 sites were reviewed in total. Here’s the breakdown:

  • 600 government websites,
  • 2,047 newspaper websites,
  • 1,487 USA university websites,
  • 1,420 European university websites,
  • 1,039 Asian university websites, and,
  • 1,000 company websites

The study looks at the growth of the use of robots.txt files over that period, which kinds of sites robots.txt sites are more likely to be found upon, which kinds of sites have the longest robots.txt files (government sites), mistakes in robots.txt files, and some other interesting stats.

The conclusion that the authors of the study came to was that the use of robots.txt should probably be replaced with a “better specified, official standard.”

Until that happens, it pays to know how to use a robots.txt file when you need to have one.

One of the best sources of information about robots.txt files is The Web Robots Pages, which contain robots.txt examples and a lot of information about the robots exclusion protocol.

The major search engines follow the rules set in the protocol, but there have been some additions. Here are links to pages where each describes what each may look for in a robots.txt file:

Google: Webmaster Help Center – How Google crawls my site

Yahoo! Search > Yahoo! Search Help > Yahoo!’s Web Crawler How do I prevent my site or certain subdirectories from being crawled?

Bing – Control which pages of your website are indexed

Ask.com – Web Search Help