Finding Sites in Other Languages by Searching Hypertext?

I’ve been enjoying visiting many sites that are written in languages other than English, such as Google.Dirson.com and Referencement, Design et Cie, and others. I often rely on some of the translation services available online to read those sites, but I have trouble searching the web to find some information that isn’t written in English.

It would be nice to have a way to search non-English sites without having to try to translate queries into other languages first.

IBM has a patent filing, published as a patent application last week, which tries to help people find sites in other languages relevant to their searches, and might be authority sites on those subjects.

Finding More Searchers by Better Searching Hypertext

One of the fastest-growing groups of users on the web doesn’t speak English. While they may be searching for information in their native languages, they may also want to see results consisting of documents in other languages. The method described in this patent application involves helping us overcome that language barrier without resorting to a translation service to form queries.

Searching hypertext based multilingual web information

Inventors: Ling Zhang Assignee Name and Adress: International Business Machines Corporation US Patent Application 20060059132 Published March 16, 2006 Filed: July 29, 2005

Abstract

The present invention provides methods, apparatus, and systems for searching hypertext-based multilingual Web information when searching on a network for keywords to be queried. A method includes: a receiving step for receiving keywords input by a user; a native language hypertext searching step for searching on the network, according to the keywords to be queried, for all hypertexts whose representing language is the same as a language representing the keywords and which matches the keywords to be queried; extracting hyperlinks related to an arbitrary language from all the searched hypertexts; a hyperlink ranking step for ranking the extracted hyperlinks according to the correlativity of the hyperlinks with the keywords to be queried; and returning to the user ranked search result. Thereby, an accurate cross-language searching can be provided without extra machine translation effort, being more accurate and objective than machine translation, even than human translation.

This patent uses an approach involving anchor text and hyperlinks to solve problems with language translation and help people find authority pages in more than one language based upon a query in his or her language.

Here’s one example offered by the patent application on how this could work:

supposing a Chinese Internet user tries to locate the homepage of “Reader’s Digest” magazine, he/she will input “(Reader’s Digest)” (keyword) expressed in Chinese, since many Chinese Web pages include hyperlinks to the Web site of the magazine of “Reader’s Digest” and most of the hypertexts corresponding to the hyperlinks include “Reader’s Digest” expressed in Chinese ( (Reader’s Digest)), by matching the hypertexts with the keyword and analyzing the hyperlink distribution, the URL www (followed by) rd.com of the magazine of “Reader’s Digest” can be retrieved.

This seems fairly simple, and the process of how this could be implemented is spelled out in much more detail in the document. Among other implications it may hold, it describes a good reason not to use “click here” as anchor text on your site and carefully choose your anchor text. There is value in searching hypertext that uses better anchor text than that.