Searching Google for Big Panda and Finding Decision Trees

Around the beginning of May 2010, several site owners noticed a change in Google rankings for websites that caused many of them to lose traffic. Since the change took place around May 1st, that Google update was referred to by many as the MayDay update.

Google’s head of Web Spam, Matt Cutts, published a video in response, answering the question Is Google putting more weight on brands in rankings? where he referred to the Google internal name for the update as the “Vince Update,” named after one of the Google search engineers who worked on the project.

A week ago on February 24th, a post at the Official Google Blog titled Finding more high-quality sites in search, written by Matt Cutts, and Google Fellow Amit Singhal, announced a significant change in Google’s rankings of Web pages in search results. In the post, we were told that the change would “noticeably impact 11.8% of Google’s queries.”

The purpose of the change was to:

…reduce rankings for low-quality sites — sites which are low-value add for users, copy content from other websites or sites that are just not very useful.

At the same time, it will provide better rankings for high-quality sites — sites with original content and information such as research, in-depth reports, thoughtful analysis and so on.

Earlier today, Wired published an interview with Matt Cutts and Amit Singhal titled The ‘Panda’ That Hates Farms: A Q&A With Google’s Top Search Engineers.

In that interview, we were told that the focus of the update was to rank higher-quality sites above lower quality pages in Google’s search rankings.

While many writing about the update have been referring to it as the “Farmer Update,” since it seemed to target content farm websites, Matt Cutts shared the internal Google code name for the update, telling us that it was named a “Big Panda,” after one of the key guys involved in the update whose name is Panda.

It appears that the update involved classifying websites on the basis of a number of questions about the sites, such as:

Do you consider this site to be authoritative?

Would it be okay if this was in a magazine?

Does this site have excessive ads?

So, I went to Google and searched for Panda.

And I found Biswanath Panda

One of the papers that Biswanath Panda and several other Googlers published for Google in 2009, described an experiment that Google performed on their advertising system, seeing if they could learn about the quality of ads and landing pages based upon bounce rates associated with clicks on those ads.

The focus of the paper wasn’t so much upon the effectiveness of the ads in the experiment, but rather about the ability of the machine learning system to work on a very large set of data.

The paper is:

“PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce” (pdf), by Biswanath Panda, Joshua S. Herbach, Sugato Basu, and Roberto J. Bayardo, which was originally published in the Proceedings of the 35th International Conference on Very Large Data Bases (VLDB-2009).

We’re told in the conclusion section of the paper that while the authors are focusing upon problems in sponsored search with their experimentation, they expect to be able to achieve similarly effective results while working on other problems involving large scale learning problems.

The Farmer/Panda update does appear to be one where a large number of websites were classified based upon the quality of content on the pages within those sites. The process described in the Tree Ensemble paper is one potential candidate for the change in rankings, resulting in a reranking of search results based upon answers to the kinds of questions above that could be used to determine the quality of pages.

In Document Level Classifiers and Google Spam Identification last month, I provided an example of a patent that described how Google might classify web pages based upon many characteristics of those pages.

While the patent I focused upon in that post gave us some hints about how Google might determine the language used in a web page, the main idea behind my post was that Google might pose some questions about a page to determine and classify whether or not the page could be considered Web spam.

I mentioned several things that Google might look for in the Document Level Classifiers post, and there could be other factors involved as well, including the number and placement of advertising on found on pages, how much novel or duplicate content might be found on the pages, and more.

Is Biswanath Panda the “Panda” that Matt Cutts referred to in the Wired article? Did Google use an approach like the one described in the Tree Ensembles page?

I’m not sure that it matters.

What does matter is that the update focuses upon boosting sites in search results that Google considers to be higher quality, and demoting pages that are lower quality?

The takeaway from this update from Google seems pretty obvious – rankings in search results are now more closely tied to the quality of those pages.

How does Google define “quality?”

That’s the challenge facing site owners after this update.