Missing Content in Search Engines and Difficult Queries

Missing Content Can Lead to Bad Search Results

There are times when you perform a search in a search engine, and the results just aren’t very relevant.

When you don’t get the results that you expect from an internet or intranet search engine, is it because the search engine isn’t very good, or is it because there isn’t much indexable information on the web or intranet document repository that contains content related to that search?

A new patent application discusses how the folks who run search engines might identify difficult queries where there may not be much content collected by the search engine on certain topics. The process in the patent filing provides search engines the chance to offer searchers suggestions for queries where they may find an answer to questions that they may be searching for or to allow indexing efforts from the engines to work on filling those gaps.

The best introduction to the patent filing is probably a couple of pages from IBM which discuss the efforts of the researchers who came up with this process:

Estimating the difficulty of queries submitted to search engines
Machine Learning for Information Retrieval

The missing content patent application:

Detection of missing content in a searchable repository Invented by Andrei Z. Broder, David Carmel, Adam Darlow, Shai Fine, Elad Yom-Tov Assigned to IBM US Patent Application 20070016545 Published January 18, 2007 Filed July 14, 2005

Abstract

A method and system for the detection of missing content in a searchable repository are provided. A system includes: a missing content query identifier (401) for identifying queries to a search engine (102) for which no or little relevant content is returned; a missing content detector (110) which clusters missing content queries by topic; and an output provider for providing details of a missing content topic.

While the focus of this patent application is on enterprise search and IBM’s efforts in providing a robust search feature, it provides some insights into how and why a search engine might be fine-tuned to provide more relevant results to searchers. The testing of search quality means taking developing ways to test the content and coverage of the searchable information within a search engine.

The process for detecting missing content involves:

Identifying queries to a search engine for which no or little relevant content is returned,
Clustering missing content queries by topic, and;
Providing details of a missing content topic.

Identifying missing content in response to queries could be done by looking at explicit responses from searchers who provide user feedback. It’s more likely in an enterprise setting that people will provide feedback about not being able to find things, but user feedback about search results can be used by web-based search engines, too.

More implicit indicators of irrelevant searches can be found by looking at how people respond to searches. Do people click through the results of searches? Do they scroll down those pages and spend time on them? If in response to certain queries, people rarely click on the results they are shown or don’t spend much time on those pages, there may be an issue with those results.

The third approach is one which relies upon machine learning, that focus on indications of low satisfaction with queries.

Missing Content Conclusion

The writers of this patent application note a few benefits from this method for enterprise search:

Query suggestions can be offered to searchers to help them find what they are looking for.
Intranet administrators may be able to identify information that may not be presented in a search engine friendly way.
Document creators may be able to locate topics that they should have more information about upon the intranet and can add.

That last benefit is something that creators of web pages should pay attention to also. If the information in a certain field or market tends to be hidden behind user logins or appears upon pages that aren’t very search engine friendly, search results for queries for that information may be not very competitive.

There are areas where a Web search engine may be returning results that aren’t very relevant. The fault may not be with the search engine, but rather that the information is missing content that isn’t out on the web in search engine friendly form.