The Future of Web Search
Part Seven: Classifying the Whole Web
Wednesday, April 30th, 2008
By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.
Classifying the whole Web is challenging. First, a single category could require hundreds or thousands of cleanly labeled examples – for a Bayesian classifier, the rule is that you need at least twice as many training examples as features (think bag of words – hundreds or thousands of features). This is fine for one or even ten categories, but not for thousands.
Second, it is one thing to classify clean and proper English news articles or academic pages, but it is another thing entirely to classify a random blog post or a site’s home page where the only “content” is “you need Flash to view this page.”
Third, and most important for a commercial application, even 99.9% overall accuracy would not be good enough when it came to very specific categories, because, simply put, the number of false positives can outweigh the number of true positives for low-frequency categories. Since the Web is so large, the number of false-positives (pages which are not about golf, but contain many “golf words”) could easily be more than the number of actual golf pages.
For example, pages about Amazon development might contain “eagle”, “wood” and “iron”, or pages about tigers might mention “tiger” and “woods”. Likewise, what about the pages where someone is talking about how they “hit a hole in one” with their presentation about economics? Even 99.9% overall (balanced) accuracy could mean that every other page that is predicted to be in the “golf” category is an error; therefore even such a high accuracy is too low to work, though it might win a best paper award at an academic conference.
These reasons make it seem virtually impossible and/or impractical to do large-scale classification with a dynamic ontology. Are there any ways to get around these challenges?
Next – Part Eight: We Really Can Get To Better Search





