Which words show up most frequently on the Web? I’m not sure that question can be answered, but it’s something I’ve wondered for a while.

With a beta version of Google’s future update, code named Caffeine recently released to allow people to experiment with, I thought I would do a few comparisons.

I found a few lists of the most common words in the English language and came up with a top 50 to see how frequently those were estimated to show up in Google, Yahoo, Bing, Ask, and Google Caffeine. Those are shown in a table and a chart below.

I’m not sure how informative this might be, even after looking at it. It’s not a very scientific test as well. There are a few reasons for that:

One of them is that when you search at one of the search engines, you’ll see a message that says something like:

Results 1 – 100 of about xxx,xxx,xxx for [query term]

From at least one previous Google patent filing, we can guess that the total amount (xxx,xxx,xxx) of results listed is likely only an estimate and not an actual count. That patent application told us that the number shown might be estimated based upon a look at anywhere from 2 percent to 10 percent of Google’s index. Since the Caffeine update is a complete infrastructure/database update, we may not even guess that the estimates are shown for the present day Google is created in the same way that the Caffeine updates might be.

We also can’t be sure that the numbers for Yahoo, Bing, and Ask are calculated in the same manner either.

Another is that while I may see one total count at Google for each term, if you looked up the same terms at Google, you might see different numbers because you may be searching at a different data center. There may be differences from one data center to another.

A third thing to keep in mind is that we aren’t searching the Web when we search at one of the search engines. Instead, we’re searching the indexes of the Web that the search engines have created. That means that some pages may be indexed more than once under different URLs, that many pages on the Web may not be included since they haven’t been indexed yet, and that words that might appear on the Web as text in images or which are presented in Flash or hidden behind javascript or log-in screens aren’t going to be counted.

The table below is the number of total results in Millions. I sorted them by how frequently the terms tested appeared in Google Caffeine.

QueryGoogle CaffeineGoogleYahooBingAsk
a19,32017,57031,2007,8001,280
in15,85013,98030,2007,850900
to15,22013,50027,5008,9201,740
the14,85013,90028,8008,170747
of14,76012,99028,0007,310794
and13,98012,95028,0007,490789
for12,11010,72026,8007,740769
by12,08010,42027,0006,120956
on11,2609,94025,1005,610598
is9,5808,87022,6004,250699
I9,2208,25018,6003,860686
all9,1107,58027,2006,9901,020
this8,8907,87021,5005,790585
with8,4906,30020,9002,440636
it7,7006,86019,3004,190542
at7,4106,60020,8003,930552
from7,3406,92018,4004,160521
or7,0306,21019,5003,940567
you6,7605,93019,9005,080543
as6,4605,75015,4003,550884
your6,3605,47019,5003,790495
an6,2605,52016,5003,780489
are6,2605,76018,100163578
be6,1205,46017,1003,990473
that5,7805,26015,2005,650405
do5,5005,02013,0002,090410
not5,5004,87015,6004,550418
have4,8704,39014,5004,130468
one4,3303,87012,3002,750375
can4,1503,69013,3003,030367
was3,9303,61010,4002,960361
if3,8103,50011,2002,660345
we3,7803,37012,4003,430358
but3,6103,34010,1001,680327
what3,2902,85011,6003,080322
which3,0202,8107,7501,810300
there2,9702,7708,3401,450262
when2,8502,6008,3601,580306
use2,7302,25012,3001,830327
their2,6902,6808,2101,650254
they2,6502,4408,2601,670293
how2,4702,1709,0501,730289
he2,2002,0406,0601,420190
were2,1302,1005,3202,770203
his2,0301,8805,310858182
had1,8602,2405,090966191
each1,3701,2904,1501,090164
said1,2101,3504,060857128
she9538823,0301,20095
word7806852,28046980

I thought it would be helpful to present this information in a visually different manner as well. Therefore, the chart that follows is in reverse order of the table above.

chart comparing estimates of the number of results for common words in Google Caffeine, Google, Yahoo, Bing, and Ask.

As I mentioned above, this is a completely unscientific view.

It definitely won’t do is provide an idea of how large the databases might be for each of the search engines. However, according to a post at the Cuil blog on Bing (no longer available), there is a way to try to make that comparison. Still, it relies upon looking at the number of search results for rare terms rather than looking at the most frequently appearing words as I have.