Using Rare Words to Estimate Search Engine Index Sizes

Can looking at how many times rare words appear in a search engine index gives us an idea of the size of the database for that search engine?

About a week ago, I wrote about some of the most common English words in the indexes for Google, Yahoo, Bing, Ask, and Google Caffeine. I took a look at 50 words that are amongst the most frequently appearing words in English, and estimates from those search engines about the number of times that those words showed up.

Comparing the number of results between the different search engines for those common words didn’t tell us anything about the relative sizes of the indexes for those search engines for many reasons.

One is that the number of results shown is rough estimates only. It’s also possible that the way that estimates are calculated from one search engine to another is very different. Some of the pages listed among those results are likely to duplicate pages at different URLs or may have contained misspellings of the words. Some of the words may be abbreviations or acronyms, as well (such as “it” is an abbreviation for information technology).

Some pages also show up as relevant for a particular search query without actually including that term on the page itself. For example, the Adobe Reader download page has ranked at the top of search results for the term “click here” on Google for years, without that phrase appearing on that page. So many links using those words as anchor text pointing to the page have been enough for the page to show up in search results for the term.

As I noted in that post, it might be possible to get a more realistic look at the relative sizes of search engine indexes by looking at the number of search results for rare terms, rather than looking at the most frequently appearing words. Cuil’s CEO and founder, Tom Costello, recently described using that technique in his blog on a post about Bing (no longer available), to tell us that “Bing is now around 20% the size of Google.”

I don’t have access to an advanced web crawler like the CEO of Cuil might, to identify many “rare” terms. I’m also using a tiny sample size, but I wanted to take a look at a few “scarce” English words, to see how frequently they appeared on the search engines.

I identified many English words that appear in less than 1,000 search results at Google Caffeine, Google, Yahoo, Bing, Ask, and Cuil by looking at the phrontistery’s Compendium of Lost Words, and doing searches for those terms. Since these search engines will only show the first 1,000 results for a query, it’s possible to see all the URLs for the terms, and use actual numbers rather than estimates, and to see if the words appear upon the pages listed. If I had a much larger sample size, I would feel comfortable in saying that the following table gives us a much better idea of the relative sizes of the indexes for the search engines that I’ve included.

Here are some infrequent English words, and the number of times that they appear in search results at different search engines (not counting duplicate pages and “substantially similar” results).

Query	Google Caffeine	Google	Yahoo	Bing	Ask	Cuil
archiloquy	67	69	25	14	11	24
exipotic	54	56	22	10	8	16
historiaster	82	82	27	28	15	22
irredivivous	42	43	14	7	8	9
keleusmatically	59	60	20	13	3	10
melanochalcographer	13	15	6	6	7	10
phylactology	58	58	25	17	11	10
stibogram	14	15	8	6	4	9
tussicate	36	37	15	13	12	11
vicambulate	144	128	41	21	12	31