Which words show up most frequently on the Web? I’m not sure that question can be answered, but it’s something I’ve wondered for a while.
With a beta version of Google’s future update, code named Caffeine recently released to allow people to experiment with, I thought I would do a few comparisons.
I found a few lists of the most common words in the English language and came up with a top 50 to see how frequently those were estimated to show up in Google, Yahoo, Bing, Ask, and Google Caffeine. Those are shown in a table and a chart below.
I’m not sure how informative this might be, even after looking at it. It’s not a very scientific test as well. There are a few reasons for that:
One of them is that when you search at one of the search engines, you’ll see a message that says something like:
Results 1 – 100 of about xxx,xxx,xxx for [query term]
From at least one previous Google patent filing, we can guess that the total amount (xxx,xxx,xxx) of results listed is likely only an estimate and not an actual count. That patent application told us that the number shown might be estimated based upon a look at anywhere from 2 percent to 10 percent of Google’s index. Since the Caffeine update is a complete infrastructure/database update, we may not even guess that the estimates are shown for the present day Google is created in the same way that the Caffeine updates might be.
We also can’t be sure that the numbers for Yahoo, Bing, and Ask are calculated in the same manner either.
Another is that while I may see one total count at Google for each term, if you looked up the same terms at Google, you might see different numbers because you may be searching at a different data center. There may be differences from one data center to another.
A third thing to keep in mind is that we aren’t searching the Web when we search at one of the search engines. Instead, we’re searching the indexes of the Web that the search engines have created. That means that some pages may be indexed more than once under different URLs, that many pages on the Web may not be included since they haven’t been indexed yet, and that words that might appear on the Web as text in images or which are presented in Flash or hidden behind javascript or log-in screens aren’t going to be counted.
The table below is the number of total results in Millions. I sorted them by how frequently the terms tested appeared in Google Caffeine.
| Query | Google Caffeine | Yahoo | Bing | Ask | |
| a | 19,320 | 17,570 | 31,200 | 7,800 | 1,280 |
| in | 15,850 | 13,980 | 30,200 | 7,850 | 900 |
| to | 15,220 | 13,500 | 27,500 | 8,920 | 1,740 |
| the | 14,850 | 13,900 | 28,800 | 8,170 | 747 |
| of | 14,760 | 12,990 | 28,000 | 7,310 | 794 |
| and | 13,980 | 12,950 | 28,000 | 7,490 | 789 |
| for | 12,110 | 10,720 | 26,800 | 7,740 | 769 |
| by | 12,080 | 10,420 | 27,000 | 6,120 | 956 |
| on | 11,260 | 9,940 | 25,100 | 5,610 | 598 |
| is | 9,580 | 8,870 | 22,600 | 4,250 | 699 |
| I | 9,220 | 8,250 | 18,600 | 3,860 | 686 |
| all | 9,110 | 7,580 | 27,200 | 6,990 | 1,020 |
| this | 8,890 | 7,870 | 21,500 | 5,790 | 585 |
| with | 8,490 | 6,300 | 20,900 | 2,440 | 636 |
| it | 7,700 | 6,860 | 19,300 | 4,190 | 542 |
| at | 7,410 | 6,600 | 20,800 | 3,930 | 552 |
| from | 7,340 | 6,920 | 18,400 | 4,160 | 521 |
| or | 7,030 | 6,210 | 19,500 | 3,940 | 567 |
| you | 6,760 | 5,930 | 19,900 | 5,080 | 543 |
| as | 6,460 | 5,750 | 15,400 | 3,550 | 884 |
| your | 6,360 | 5,470 | 19,500 | 3,790 | 495 |
| an | 6,260 | 5,520 | 16,500 | 3,780 | 489 |
| are | 6,260 | 5,760 | 18,100 | 163 | 578 |
| be | 6,120 | 5,460 | 17,100 | 3,990 | 473 |
| that | 5,780 | 5,260 | 15,200 | 5,650 | 405 |
| do | 5,500 | 5,020 | 13,000 | 2,090 | 410 |
| not | 5,500 | 4,870 | 15,600 | 4,550 | 418 |
| have | 4,870 | 4,390 | 14,500 | 4,130 | 468 |
| one | 4,330 | 3,870 | 12,300 | 2,750 | 375 |
| can | 4,150 | 3,690 | 13,300 | 3,030 | 367 |
| was | 3,930 | 3,610 | 10,400 | 2,960 | 361 |
| if | 3,810 | 3,500 | 11,200 | 2,660 | 345 |
| we | 3,780 | 3,370 | 12,400 | 3,430 | 358 |
| but | 3,610 | 3,340 | 10,100 | 1,680 | 327 |
| what | 3,290 | 2,850 | 11,600 | 3,080 | 322 |
| which | 3,020 | 2,810 | 7,750 | 1,810 | 300 |
| there | 2,970 | 2,770 | 8,340 | 1,450 | 262 |
| when | 2,850 | 2,600 | 8,360 | 1,580 | 306 |
| use | 2,730 | 2,250 | 12,300 | 1,830 | 327 |
| their | 2,690 | 2,680 | 8,210 | 1,650 | 254 |
| they | 2,650 | 2,440 | 8,260 | 1,670 | 293 |
| how | 2,470 | 2,170 | 9,050 | 1,730 | 289 |
| he | 2,200 | 2,040 | 6,060 | 1,420 | 190 |
| were | 2,130 | 2,100 | 5,320 | 2,770 | 203 |
| his | 2,030 | 1,880 | 5,310 | 858 | 182 |
| had | 1,860 | 2,240 | 5,090 | 966 | 191 |
| each | 1,370 | 1,290 | 4,150 | 1,090 | 164 |
| said | 1,210 | 1,350 | 4,060 | 857 | 128 |
| she | 953 | 882 | 3,030 | 1,200 | 95 |
| word | 780 | 685 | 2,280 | 469 | 80 |
I thought it would be helpful to present this information in a visually different manner as well. Therefore, the chart that follows is in reverse order of the table above.

As I mentioned above, this is a completely unscientific view.
It definitely won’t do is provide an idea of how large the databases might be for each of the search engines. However, according to a post at the Cuil blog on Bing (no longer available), there is a way to try to make that comparison. Still, it relies upon looking at the number of search results for rare terms rather than looking at the most frequently appearing words as I have.