Archive for April 29th, 2008

The Future of Web Search
Part Six: Creating A Large-Scale, Dynamic Ontology

Tuesday, April 29th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

How do we create an automated, flexible ontology that will enable a more personalized and more flexible search experience?

First, the requirements: 1.) It has to be commercially viable; i.e. it can’t take 100 years to classify a billion web pages or require a thousand volunteers to label training examples. 2.) It has to be precise; if you say a page is about golf, it has to be about golf nearly all the time. 3.) It has to have sufficient recall; i.e. it has to identify nearly all of the relevant pages that belong to a given category – recognize all golf pages as golf.

All automated text classification systems require the same basic recipe: Obtain “good” training data, learn a model from the training data, and apply that model to the unlabeled documents.

When doing classification, all documents must be represented in a mathematical way that corresponds to a set of “features”. The simplest set of features is called “bag of words”, which is common in Information Retrieval (IR) literature. In a “bag of words”, each unique word in a document corresponds to a single feature.

A very simple classifier might try to find documents that have words “similar” to positive training examples for a given category. For example, if the training documents in the “golf” category often contain the words “ball”, “iron”, “wood” and “par”, then documents that contain all of those words are likely to be considered as “golf”. (If you are interested, most academic – and commercial – text classification systems use a Bayesian or linear-SVM as their “mathematical model.”)

All this sounds fairly straightforward, but there are many daunting challenges when trying to apply this procedure to classifying the whole Web.

Next – Part Seven: Classifying the Whole Web