Google Malware Detection Using Document Classification

In the Google paper, Predicting Bounce Rates in Sponsored Search Advertisements (pdf), we’re told about an experiment at Google where researchers used a document classification model on sponsored advertisements and landing pages to try to predict how many people might see an advertisement in Google’s search results, and after clicking upon the ad leave the landing page very quickly. The experiment in that paper is also described in another Google paper, PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce (pdf), which tells us how Google might be able to take an extremely large amount of observational data and use it to create classifications that, amongst other things, could potentially be used to help rank pages in organic search like we’ve been told that Google’s Panda updates do.

A patent about Google Malware Detection was granted today that appears to use a similar approach to determine whether sponsored advertisements in Google might lead to malware. The patent describes malware as malicious software that might be deceptively or automatically installed on a visitor’s computer when they arrive at a page. In addition to trojan horses and viruses, this can include monitoring software. In some instances, a landing page may be the first in a series of one or more redirections, which can include malware on the page or pages being redirected to. The need for such a classification approach comes about because of the sheer volume of advertisements that Google shows.

We know that Google’s Panda updates look for features on websites that indicate “quality” in some manner. Under the document classification approach in this patent, “intrusion features” are tested and weighted on landing pages.

This process involves Google taking a large number of landing pages and using a machine learning approach to examine all of the features that appear on those pages that might indicate the possibility of malware. This training set then can be used by the system to classify other landing pages. The malware detection approach may test pages and take appropriate actions when malware is detected such as suspending an advertiser’s account, flagging ads associated with the landing pages, rechecking those landing pages to see if they have been cleaned up, and enabling advertisers to have their accounts unsuspended (keep in mind that malware may be introduced to a site through someone who may have hacked the site rather than the advertiser themselves).

The patent provides a fair amount of details on how a malware detection system might be implemented by Google, but my interest is in looking at the kinds of “intrusion features” that might be used to indicate that a landing page might contain malware. The classification approach described in this patent would be used as a first step in evaluating pages to predict when a landing page might contain malware or redirect to it. We know that Google purchased Green Border in 2007 to use that technology to protect browsers from malware when surfing the Web, and if the classification approach from this patent predicts the presence of malware, it could be further tested by approaches similar to that used in the Green Border technology.

What kind of intrusion features might Google look for on landing pages or any pages that might redirect from those landing pages?

Those could include specific iframe features, URL features, or script features that are known to be associated with landing pages that include malware. If a certain feature score is reached when a page is evaluated, it would be classified as a candidate for further evaluation.

Here are some examples of the kinds of features that might be evaluated to determine if a site should be further reviewed to see if it contains malware:

Small iFrames that may be indicative of an attempt to embed other HTML documents (e.g., malware-related) inside a main document.
A bad or suspicious URL that may match a URL on a known list of malware-infected domains.
Suspicious script language containing certain function calls or language elements that are known to be used in serving malware.
Multiple frames, scripts or iFrames appearing in unusual places, such as at the end of the HTML.
Domain name used that was shown to have malware installed upon it on other pages.
Geographical features, such as a domain being from Russia or China or any other country statistically known to have higher rates of infected domains.
A page with a domain originating in one country with an iframe that contains content originating from a geographically remote location.
Relative age of a domain and URLs or links to that domain (malware is often distributed from new sites).

The amount of weight that each of these features carries may be different and may change over time as the system goes through and learns from training data.

The Google Malware Detection patent is:

Intrusive feature classification model Invented by Mark Palatucci, Panayiotis Mavrommatis, Niels Provos, Christopher K. Monson, Yunkai Zhou, Kamal P. Nigam, Clayton W. Bavor, Jr., Eric L. Davis, Rachel Nakauchi Assigned to Google Inc. US Patent 7,991,710 Granted August 2, 2011 Filed: March 3, 2008

Abstract

Landing pages associated with advertisements are partitioned into training landing pages and testing landing pages. Iterative training and testing of a classification mode on intrusion features of the partitioned landing pages is conducted until the occurrence of a cessation event. Feature weights are derived from the iterative training and testing and are associated with the intrusion features. The associated feature weights and intrusion features can be used to classify other landing pages.

It’s possible that some of the features in this malware detection system may be used on webpages that Google comes across that aren’t tied to advertisement and landing pages, as part of Google’s Safe Browsing Diagnostic Program, since Google wouldn’t want to deliver searchers to pages that contain malware.

Conclusion

It’s interesting to see how Google may use a document classification approach as described in the Predicting Bounce Rates paper to try to help advertisers build more effective ads and landing pages, and how Google may use a document classification approach to evaluate landing pages for Malware. The paper was originally published in 2009 and the patent was filed in 2008, and we’ve been told by Google’s Matt Cutts and Amit Singhal that this was roughly the same time period that works on the document classification system behind Panda started.

There has been a lot of speculation and guesses about what types of features might be involved the classification of pages for quality in Google’s Panda, including things like numbers and sizes and locations of advertisements and a ratio of advertisements to content, reading levels and originality of content, and many others. The number of actual features could be fairly large, and like the intrusion features in this Google malware detection classification approach may change over time in how much weight each feature is given based upon training data used.