The first known appearance of the phrase “googlebomb” showed up in an article by Adam Mathes in the online magazine uber.nu, in a request to help pull a joke on a friend of his, by making the friend’s website rank highly for the term “talentless hack.”
You’ve possibly noticed that some pages rank well in Google search results for terms or phrases that don’t actually appear on those pages, because other pages link to those pages using those words as the text that accompanies those links. For example, search for “click here” and the top search result at Google is the Adobe Reader download page, which is linked to by millions of links across the Web using “click here” as a link to the page. That is how Googlebombing works. It’s something that can affect more than just Google Results.
I’ve used the phrase “Google bombing” in this post, but this is something that happens at Yahoo and Bing as well. Given enough links from enough pages using the same text pointing to a specific page, and there’s a chance that the page being linked to might rank very well in search results from any of the major search engines, even if the content of the page has nothing to do with the text in those links.
Usually, when people link to pages, the text used in those links is often descriptive of what people might find on the pages being linked to. This can help a search engine understand what the page being pointed to is about. Search engines have been associating the text in links to the pages that they refer to since the early days of the Web. As Google’s founders, Sergey Brin and Lawrence Page, note in one of the first white papers about Google, The Anatomy of a Large-Scale Hypertextual Web Search Engine, the idea is something that they incorporated in Google, but it didn’t start with them:
This idea of propagating anchor text to the page it refers to was implemented in the World Wide Web Worm [McBryan 94] especially because it helps search non-text information and expands the search coverage with fewer downloaded documents.
We use anchor propagation mostly because anchor text can help provide better quality results. Using anchor text efficiently is technically difficult because of the large amounts of data that must be processed. In our current crawl of 24 million pages, we had over 259 million anchors, which we indexed.
Trying to understand the relevance of a page from links pointed to it is often referred to as hypertext relevance. While search engines have been employed for almost as long as there have been search engines on the Web, people have also been manipulated for personal, political, and commercial purposes.
The talentless hack Googlebombing was intended as a joke. Still, one of the most famous googlebombs was inspired by political activism, with many people linking to the presidential biography page on George Bush’s Whitehouse biography using the phrase “miserable failure” in the anchor text of their links. A September 2005 statement in the Official Google Blog, Googlebombing ‘failure’, explained why that page was showing up for that result:
Google’s search results are generated by computer programs that rank web pages in large part by examining the number and relative popularity of the sites that link to them. By using a practice called googlebombing, however, determined pranksters can occasionally produce odd results. In this case, many webmasters use the phrases [failure] and [miserable failure] to describe and link to President Bush’s website, thus pushing it to the top of searches for those phrases.
In January 2007, a post on the Google Webmaster Central blog, A quick word about Google bombs told us that Google had solved the “Miserable Failure” Google bombing:
We wanted to give a quick update about “Google bombs.” By improving our analysis of the web link structure, Google has begun minimizing the impact of many Googlebombs. Now we will typically return commentary, discussions, and articles about the Google bombs instead. The actual scale of this change is pretty small (there are under a hundred well-known Googlebombs), but if you’d like to get more details about this topic, read on.
The post doesn’t tell us how the problem was solved, other than mentioning an improvement to how they analyze links, and that the solution was an algorithmic one. So how did Google solve the problem?
The only patent or whitepaper reference that I’ve seen on Google bombing from Google appears in the Google patents on Phrase-Based Indexing. Until today, I hadn’t seen any other references from any of the other search engines about how they may have attempted to solve the problem, until a Yahoo patent granted today, which describes how they fight “search engine hijacking,” which uses the example of a query for “miserable failure” showing the Presidential biography page.
The Google phrase-based indexing approach that I mentioned may or may not be the method used, as described in the Google Webmaster Central post above. But, it may account for the President’s bio page starting to show up in search results a few months later when the word “failure” was added to that bio page. Here’s a snippet from the first phrase-based indexing patent:
[0156] This approach has the benefit of entirely preventing certain types of manipulations of web pages (a class of documents) from skewing the results of a search. Search engines that use a ranking algorithm that relies on the number of links that point to a given document to rank that document can be “bombed” by artificially creating many pages with a given anchor text that then points to the desired page.
As a result, when a search query using the anchor text is entered, the desired page is typically returned, even if, in fact, this page has little or nothing to do with the anchor text. Importing the related bit-vector from a target document URL1 into the phrase A related phrase bit vector for document URL0 eliminates the reliance of the search system on just the relationship of phrase A in URL0 pointing to URL1 as an indicator of significance or URL1 to the anchor text phrase.
Once the Whitehouse staff added “failure” to the bio page, it suddenly became relevant under a phrase-based indexing approach for all of those links pointing to it that used “miserable failure” as anchor text.
Yahoo and Bing are also subject to Google Bombs, and a search at both Yahoo and Bing for “miserable failure” shows the George Bush Whitehouse bio in the top four results. The Yahoo patent describes a way of diffusing Google Bombs using sentiment analysis, and if it works, it’s possible that Microsoft might want to license the approach from Yahoo.
The patent is:
Mitigation of search engine hijacking Invented by Shanmugasundaram Ravikumar and Bo Pang Assigned to Yahoo! US Patent 7,870,131 Granted January 11, 2011 Filed: December 13, 2007
Abstract
The subject matter disclosed herein relates to the mitigation of search engine hijacking. A sentiment value associated with anchor text in a search engine result may be determined in one example implementation.
Similarly, a sentiment value of one or more web pages referenced by the anchor text may also be determined. A divergence between sentiment values associated with the anchor text and a web page may then determined.
Here’s the technical language on how the Yahoo method works, straight from the patent:
More specifically, given an anchor text-page pair (q, p), a sentiment classifier may be applied to the anchor text and the web page separately, resulting in the sentiment of the anchor text (C(p)) and the sentiment of the web page (C(q)). In the case where C(p) U C(q)={acceptable, unacceptable}, a determination may be made to see whether the anchor text q is trying to hijack web page p. Where Pq is the set of all pages with anchor text q, and Qp is the set of all anchor texts for page p, hijacking may be indicated where C(p)={acceptable} and C(q)={unacceptable}. This may correspond to a case in which an invalid anchor text tries to hijack a valid web page. In this case, anchor text q may be declared as hijacking page p if the multi-set Pq, treated as a distribution, has low entropy and if most of the anchor text in the set Qp are “acceptable.” Such a result may indicate that the goal of the anchor text q is to slander web page p as web page p is also indicated as having a significant amount of other “labelings”(in the form of diverse, and mostly “acceptable” anchor texts).
Likewise, for example, hijacking of search engines may be indicated in cases where anchor text has an acceptable sentiment value, and the web page has an unacceptable sentiment value. A ranking component may determine that such hijacking occurs if a set of anchor text referencing the web page has a distribution with low entropy, and if a majority of web pages within a set of web pages containing the anchor text have an acceptable sentiment value. The acceptable anchor text sentiment value diverges from the unacceptable web page sentiment value in such a case. Such divergence may be shown not to be a normal occurrence due to the low entropy of the set of anchor texts referencing the web page.
In other words, if anchor text used to point to a page has a negative sentiment value, and the text on the web page being pointed to has a positive sentiment value, then the relevance of that anchor text may not be used by the search engine to analyze what the page is about. Likewise, if the anchor text has a positive sentiment value, and the text on the page linked to it has a negative sentiment value, then the anchor text also may not be applied to the page pointed towards.
A link using “miserable failure” as anchor text expresses a negative sentiment, and the bio page of the former president expresses positive sentiments. Under this system, presumably, the “miserable failure” text wouldn’t be applied to the bio page.
I’m not sure if Yahoo tried this out, and with Bing now powering Yahoo’s search results, it’s impossible to test whether or not this was effective if Yahoo had implemented it. So at this point, Yahoobombing and bingbombing still seem to work.
Conclusion
Is Phrase-Based indexing responsible for the disappearance of Googlebombing at Google?
There are two different sets of Phrase-based indexing patents that Google published. The first set described many ways that the search engine could use it. The second set described how the system could be incorporated into a large-scale search engine index like Google’s. For example, phrase-based indexing would stop the miserable failure query from showing George Bush’s bio. It would explain why the bio started appearing again for a query using just “failure” once the Whitehouse added that word to the page after the miserable failure googlebombing was diffused.
Is Google using some kind of sentiment analysis approach to solving Googlebombing, like described in the Yahoo patent?
It’s possible, but it’s hard to say whether or not the Yahoo approach even works, at this point.
Last updated June 9, 2019.