Archive for the ‘Artificial Intelligence’ Category

The Future of Search
Part Nine: How We’re Making Search Better Today

Friday, May 2nd, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

So, now that we actually have a mapping from all URLs to all categories for our dynamic ontology – considered by some to be the holy grail of web-scale machine learning – the question becomes: What do we do with this data? We’ve shown that it’s feasible (and we encourage you to test us), but what are we doing with it to make search better for our users?

So far, we’ve chosen one possible use for our ontology – presenting categories that help disambiguate queries and get users to the information they want more quickly. Currently, in our beta, we have more than 200 categories. Some are subjects like “golf”, while others are page types like “blogs” or “political news” (which were chosen to represent how users actually use the Web as opposed to how an academic might divide up a domain).

Each category is thoroughly tested to ensure sufficient (but nothing is perfect) accuracy. While testing, it’s exciting to me when random or obscure queries come back with perfect category suggestions. I have also seen cases where the ontology was not ideal or a category definition just didn’t work. For example, sites listing current spreads for a football game previously came back as “gambling and casinos”. Using our system, it took very little time to change the definition in the ontology and apply the changes (thus reclassifying our entire index). In addition, this work was done by a data analyst/ontologist, with no engineering resources required. This is only possible with a powerful system that supports our dynamic ontology and the ability to rapidly train and classify.

Every day I wake up excited to go to the office. I get to work with experts in the area of web search and machine learning to further enhance what we can do. We are already planning to grow the number of exposed categories, improve our accuracy, and further reduce the training costs of new verticals. We’ve got a long way to go, but I think this is the start of something that will radically change web search and the expectations of large-scale search engines. It has already been an exciting ride, and we have barely begun.

This concludes our series on ‘The Future of Web Search’, By Dr. Eric Glover, Searchme’s Classification Architect. Please check back as other Searchme team members will be posting on the topic as well.

The Future of Search
Part Eight: We Really Can Get To Better Search

Thursday, May 1st, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

At Searchme, we’re very aware of the difficulties in building a web-scale automatic classification system that is fast and accurate and maps to a deep, dynamic ontology. In fact, in the previous post, we discussed how it was almost impossible.

However, as you may have guessed by now, at Searchme we are using a dynamic ontology to create our “categories” feature. Please feel free give it a try – pick some query, choose a category, and decide for yourself how accurate our classifiers feel.

Here’s how we are able to do what many have long considered impossible:

First, we define our own ontology. This means we can easily adapt it to better match how users search the Web, as well as match what works best from a categorization standpoint. Simply put, if a category doesn’t work, we can change the rules of the game by picking a slightly different definition – one that would have fewer errors.

Second, we use complex models for classification (non-linear SVMs), as well as more complex features (not limited to bag of words). This richer set of features reduces the chance that a document with a few golf terms will be considered “golf”. A simple linear model assigns a fixed weight to the word “eagle”, independent of the context, which increases misclassifications over a non-linear model. However, using non-linear classifiers enables us to learn subtler concepts, such as “eagle” and “flying” makes “eagle” negative with respect to golf, but “eagle” and “birdie” make “eagle” positive for golf.

Third, we’ve incorporated technologies for rapid training. These technologies reduce the amount of data and human effort required to train a classifier, keeping it at a manageable level without sacrificing final accuracy.

All of these factors are integrated into our core production system, which we designed from the ground up with the future of search in mind. Using the ideas of dynamic ontologies, we can be agile when new categories are needed or definitions change, and with our rapid training capabilities, we can adjust in weeks or months as opposed to years.

Conclusion – Part Nine: How We’re Making Search Better Today

The Future of Web Search
Part Seven: Classifying the Whole Web

Wednesday, April 30th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Classifying the whole Web is challenging. First, a single category could require hundreds or thousands of cleanly labeled examples – for a Bayesian classifier, the rule is that you need at least twice as many training examples as features (think bag of words – hundreds or thousands of features). This is fine for one or even ten categories, but not for thousands.

Second, it is one thing to classify clean and proper English news articles or academic pages, but it is another thing entirely to classify a random blog post or a site’s home page where the only “content” is “you need Flash to view this page.”

Third, and most important for a commercial application, even 99.9% overall accuracy would not be good enough when it came to very specific categories, because, simply put, the number of false positives can outweigh the number of true positives for low-frequency categories. Since the Web is so large, the number of false-positives (pages which are not about golf, but contain many “golf words”) could easily be more than the number of actual golf pages.

For example, pages about Amazon development might contain “eagle”, “wood” and “iron”, or pages about tigers might mention “tiger” and “woods”. Likewise, what about the pages where someone is talking about how they “hit a hole in one” with their presentation about economics? Even 99.9% overall (balanced) accuracy could mean that every other page that is predicted to be in the “golf” category is an error; therefore even such a high accuracy is too low to work, though it might win a best paper award at an academic conference.

These reasons make it seem virtually impossible and/or impractical to do large-scale classification with a dynamic ontology. Are there any ways to get around these challenges?

Next – Part Eight: We Really Can Get To Better Search

The Future of Web Search
Part Six: Creating A Large-Scale, Dynamic Ontology

Tuesday, April 29th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

How do we create an automated, flexible ontology that will enable a more personalized and more flexible search experience?

First, the requirements: 1.) It has to be commercially viable; i.e. it can’t take 100 years to classify a billion web pages or require a thousand volunteers to label training examples. 2.) It has to be precise; if you say a page is about golf, it has to be about golf nearly all the time. 3.) It has to have sufficient recall; i.e. it has to identify nearly all of the relevant pages that belong to a given category – recognize all golf pages as golf.

All automated text classification systems require the same basic recipe: Obtain “good” training data, learn a model from the training data, and apply that model to the unlabeled documents.

When doing classification, all documents must be represented in a mathematical way that corresponds to a set of “features”. The simplest set of features is called “bag of words”, which is common in Information Retrieval (IR) literature. In a “bag of words”, each unique word in a document corresponds to a single feature.

A very simple classifier might try to find documents that have words “similar” to positive training examples for a given category. For example, if the training documents in the “golf” category often contain the words “ball”, “iron”, “wood” and “par”, then documents that contain all of those words are likely to be considered as “golf”. (If you are interested, most academic – and commercial – text classification systems use a Bayesian or linear-SVM as their “mathematical model.”)

All this sounds fairly straightforward, but there are many daunting challenges when trying to apply this procedure to classifying the whole Web.

Next – Part Seven: Classifying the Whole Web

The Future of Web Search
Part Five: Towards A Large, Dynamic Ontology

Monday, April 28th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Because it’s very difficult to create a deep, useful ontology that actually works and stands the test of time, it is understandable that search engines have moved towards the simple approach of “do you want news, images or blogs?” served with a side of “search only in English.” Is there a way to create a better ontology that actually works and is flexible over time?

The answer is yes. We have the ability to create a responsive ontology, one that is rapidly adaptable. With a dynamic ontology, if someone new becomes famous next week, we could create and apply a category for that person, instantly. Likewise, if an existing category changes or becomes obsolete, it is easy to adjust.

Being both deep (many categories) and flexible, however, makes it difficult to effectively map web pages onto such an ontology, because things are changing all the time. In addition, the rapidly changing definitions mean the potential maintenance costs may be prohibitive – especially if each change requires thousands of labeled examples and days of training. So how do you have a classification system on such a grand scale – one that makes business sense?

There are hundreds of academic papers on how to do text classification, but few methods are viable when applied to billions of web pages and hundreds or thousands of categories. Typically, they are too slow, not accurate enough, or too expensive to train/maintain.

How do we create a dynamic ontology that really works for large-scale web classification?

Next – Part Six: Creating A Large-Scale, Dynamic Ontology

The Future of Web Search
Part Four: How Ontology Gives Us Better Search

Friday, April 25th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

What is an ontology, and what does it have to do with web search?

An ontology is “a systematic arrangement of all of the important categories of objects or concepts which exist in some field of discourse, showing the relations between them.” (Wordnet.) Or to simplify, an advanced “topic hierarchy”.

Many web sites use an ontology. For example, dating sites let you select by gender, age and location. Shopping sites let you search by color, style, price or inventory. In each case, the site uses a “domain-specific” ontology – all the content on the site is described by and fits into its ontology.

An ontology needs two things to be effective: It needs to make sense for the site, and the content it references must meaningfully map onto it.

So, when it comes to large-scale, general-purpose web search, you can see the problem. First, because a search engine is general-purpose and users can query for anything, there doesn’t exist a small set of “topics” that will cover every query. Second, because the Web is a collection of tens of billions of pages of varying quality, all created by a variety of “users”, it’s difficult for a company to accurately map what’s out there onto any ontology.

It’s easy enough to make up a bunch of categories, but it’s hard to make ones that will stand the test of time. Furthermore, if you do make ones that last, odds are that you will have a shallow ontology; we don’t know who the president will be in 2020, or who’ll be the biggest movie star in 2015. In addition, what if the definition of a category changes – what if the European Union gets a new country? The previous “EU” category becomes obsolete.

So how do you create a meaningful ontology?

Next – Part Five: Towards A Large, Dynamic Ontology

The Future of Web Search
Part Three: How Search Works Right Now, Cont’d.

Thursday, April 24th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Search companies aren’t clueless about the fact that users have different needs. To deal with an ambiguous query, most will use some type of “mixing”.

With “Saturn”, for example, they will present results from both the car company and the planet. Some even go a step further and offer “related searches”, which help a user by presenting queries that others might have asked, such as “Saturn cars” or “Saturn Vue”.

Unfortunately, commercial search engines don’t do “mixing” based on knowledge about the explicit meanings of a query or of web pages. But what would happen if they did? What would happen if search engines actually “understood” that “Saturn” had multiple meanings, not just because different results were manually “mixed”. What if they knew that www.saturn.com was about “car companies” and that “en.wikipedia.org/wiki/Saturn” was about “astronomy”? Could they use this knowledge to help separate out results by meaning, thus reducing a user’s difficulty in locating what they want?

The ability to “separate out by meaning” brings us to the subject of ontology – which is what we think is required to create the search of the future.

Next – Part Four: How Ontology Gives Us Better Search

The Future of Web Search
Part Two: How Search Works Right Now

Wednesday, April 23rd, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Henry Ford once said: “If I’d asked my customers what they wanted, they’d have said a faster horse.” In thinking about search right now, most of us wouldn’t compare our favorite search engine to a horse. But in reality, somebody like Henry Ford has already decided for us how search will be, what results we will see, and how they will be presented.

A lot of people say, “Who gives a crap? I’m happy with how search works. Why do we need to change it?”

Here’s why: Just like those horse owners, we are so used to what we know that we don’t think about the fact that things could be even better. And, just because we aren’t screaming for a better search engine doesn’t mean that we wouldn’t use one when it showed up.

Before we start thinking about what search could be, though, let’s look at why search is currently the way it is.

The main reason is that most search engines see users as a number, not as individuals. Most search engines assume that: 1.) There’s a perfect set of results out there for every query; and 2.) What’s perfect for one user will be perfect for everyone.

To come up with this “perfect” set of results and corresponding ranking, search companies pick a bunch of pre-defined queries and then pay people to judge the relevance of their results and maybe the relevance to other engines. The problem with this is that we all know there’s no “perfect” set of results. It varies with each user, as does the best ranking of results.

So how do search companies currently deal with the fact that there’s no one perfect set of results?

Next – Part Three: How Search Works Right Now, Cont’d.

The Future of Web Search
Part One: What Could The Future Look Like?

Tuesday, April 22nd, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Imagine the Internet five years from now: As you begin to type a word into a search engine, it seems to know you personally. It naturally gravitates towards your unique interests and preferences. Rarely do you need to type more than one or two words before it shows you exactly what you’re looking for. On the occasional instance when it doesn’t correctly guess your intention, it’s easy to correct and quickly get to what you want.

For example, a student doing research for a school science project sees only science web sites that are appropriate for someone his age. A few hours later, he searches for information on his favorite video game, and he’s able to easily re-focus the engine on reviews and downloadable expansion packs.

By no means am I the first to postulate this future vision where your search engine seems to know you personally. But despite nearly ten years of artificial intelligence (AI) research in this area, we’re still not there. Why?

How do we get from here – a world where most people view search engines as big bookmark replacements – to there – a world where search engines are even more useful for real research and seem to know us personally, demonstrating the flexibility we all dream about? Are we moving in the right direction? Is it possible? Does anyone even care?

At Searchme, we are working to move toward this future and prove that it is possible by demonstrating some of the initial steps to get there. It’s extremely challenging, rewarding and exciting. Over the next few posts, I will go into detail on the real challenges to creating better search, what has already been done, and how we are starting to move into this future.

Next – Part Two: How Search Works Right Now