Archive for April, 2008

The Future of Web Search
Part Seven: Classifying the Whole Web

Wednesday, April 30th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Classifying the whole Web is challenging. First, a single category could require hundreds or thousands of cleanly labeled examples – for a Bayesian classifier, the rule is that you need at least twice as many training examples as features (think bag of words – hundreds or thousands of features). This is fine for one or even ten categories, but not for thousands.

Second, it is one thing to classify clean and proper English news articles or academic pages, but it is another thing entirely to classify a random blog post or a site’s home page where the only “content” is “you need Flash to view this page.”

Third, and most important for a commercial application, even 99.9% overall accuracy would not be good enough when it came to very specific categories, because, simply put, the number of false positives can outweigh the number of true positives for low-frequency categories. Since the Web is so large, the number of false-positives (pages which are not about golf, but contain many “golf words”) could easily be more than the number of actual golf pages.

For example, pages about Amazon development might contain “eagle”, “wood” and “iron”, or pages about tigers might mention “tiger” and “woods”. Likewise, what about the pages where someone is talking about how they “hit a hole in one” with their presentation about economics? Even 99.9% overall (balanced) accuracy could mean that every other page that is predicted to be in the “golf” category is an error; therefore even such a high accuracy is too low to work, though it might win a best paper award at an academic conference.

These reasons make it seem virtually impossible and/or impractical to do large-scale classification with a dynamic ontology. Are there any ways to get around these challenges?

Next – Part Eight: We Really Can Get To Better Search

The Future of Web Search
Part Six: Creating A Large-Scale, Dynamic Ontology

Tuesday, April 29th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

How do we create an automated, flexible ontology that will enable a more personalized and more flexible search experience?

First, the requirements: 1.) It has to be commercially viable; i.e. it can’t take 100 years to classify a billion web pages or require a thousand volunteers to label training examples. 2.) It has to be precise; if you say a page is about golf, it has to be about golf nearly all the time. 3.) It has to have sufficient recall; i.e. it has to identify nearly all of the relevant pages that belong to a given category – recognize all golf pages as golf.

All automated text classification systems require the same basic recipe: Obtain “good” training data, learn a model from the training data, and apply that model to the unlabeled documents.

When doing classification, all documents must be represented in a mathematical way that corresponds to a set of “features”. The simplest set of features is called “bag of words”, which is common in Information Retrieval (IR) literature. In a “bag of words”, each unique word in a document corresponds to a single feature.

A very simple classifier might try to find documents that have words “similar” to positive training examples for a given category. For example, if the training documents in the “golf” category often contain the words “ball”, “iron”, “wood” and “par”, then documents that contain all of those words are likely to be considered as “golf”. (If you are interested, most academic – and commercial – text classification systems use a Bayesian or linear-SVM as their “mathematical model.”)

All this sounds fairly straightforward, but there are many daunting challenges when trying to apply this procedure to classifying the whole Web.

Next – Part Seven: Classifying the Whole Web

The Future of Web Search
Part Five: Towards A Large, Dynamic Ontology

Monday, April 28th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Because it’s very difficult to create a deep, useful ontology that actually works and stands the test of time, it is understandable that search engines have moved towards the simple approach of “do you want news, images or blogs?” served with a side of “search only in English.” Is there a way to create a better ontology that actually works and is flexible over time?

The answer is yes. We have the ability to create a responsive ontology, one that is rapidly adaptable. With a dynamic ontology, if someone new becomes famous next week, we could create and apply a category for that person, instantly. Likewise, if an existing category changes or becomes obsolete, it is easy to adjust.

Being both deep (many categories) and flexible, however, makes it difficult to effectively map web pages onto such an ontology, because things are changing all the time. In addition, the rapidly changing definitions mean the potential maintenance costs may be prohibitive – especially if each change requires thousands of labeled examples and days of training. So how do you have a classification system on such a grand scale – one that makes business sense?

There are hundreds of academic papers on how to do text classification, but few methods are viable when applied to billions of web pages and hundreds or thousands of categories. Typically, they are too slow, not accurate enough, or too expensive to train/maintain.

How do we create a dynamic ontology that really works for large-scale web classification?

Next – Part Six: Creating A Large-Scale, Dynamic Ontology

The Future of Web Search
Part Four: How Ontology Gives Us Better Search

Friday, April 25th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

What is an ontology, and what does it have to do with web search?

An ontology is “a systematic arrangement of all of the important categories of objects or concepts which exist in some field of discourse, showing the relations between them.” (Wordnet.) Or to simplify, an advanced “topic hierarchy”.

Many web sites use an ontology. For example, dating sites let you select by gender, age and location. Shopping sites let you search by color, style, price or inventory. In each case, the site uses a “domain-specific” ontology – all the content on the site is described by and fits into its ontology.

An ontology needs two things to be effective: It needs to make sense for the site, and the content it references must meaningfully map onto it.

So, when it comes to large-scale, general-purpose web search, you can see the problem. First, because a search engine is general-purpose and users can query for anything, there doesn’t exist a small set of “topics” that will cover every query. Second, because the Web is a collection of tens of billions of pages of varying quality, all created by a variety of “users”, it’s difficult for a company to accurately map what’s out there onto any ontology.

It’s easy enough to make up a bunch of categories, but it’s hard to make ones that will stand the test of time. Furthermore, if you do make ones that last, odds are that you will have a shallow ontology; we don’t know who the president will be in 2020, or who’ll be the biggest movie star in 2015. In addition, what if the definition of a category changes – what if the European Union gets a new country? The previous “EU” category becomes obsolete.

So how do you create a meaningful ontology?

Next – Part Five: Towards A Large, Dynamic Ontology

The Future of Web Search
Part Three: How Search Works Right Now, Cont’d.

Thursday, April 24th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Search companies aren’t clueless about the fact that users have different needs. To deal with an ambiguous query, most will use some type of “mixing”.

With “Saturn”, for example, they will present results from both the car company and the planet. Some even go a step further and offer “related searches”, which help a user by presenting queries that others might have asked, such as “Saturn cars” or “Saturn Vue”.

Unfortunately, commercial search engines don’t do “mixing” based on knowledge about the explicit meanings of a query or of web pages. But what would happen if they did? What would happen if search engines actually “understood” that “Saturn” had multiple meanings, not just because different results were manually “mixed”. What if they knew that www.saturn.com was about “car companies” and that “en.wikipedia.org/wiki/Saturn” was about “astronomy”? Could they use this knowledge to help separate out results by meaning, thus reducing a user’s difficulty in locating what they want?

The ability to “separate out by meaning” brings us to the subject of ontology – which is what we think is required to create the search of the future.

Next – Part Four: How Ontology Gives Us Better Search

The Future of Web Search
Part Two: How Search Works Right Now

Wednesday, April 23rd, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Henry Ford once said: “If I’d asked my customers what they wanted, they’d have said a faster horse.” In thinking about search right now, most of us wouldn’t compare our favorite search engine to a horse. But in reality, somebody like Henry Ford has already decided for us how search will be, what results we will see, and how they will be presented.

A lot of people say, “Who gives a crap? I’m happy with how search works. Why do we need to change it?”

Here’s why: Just like those horse owners, we are so used to what we know that we don’t think about the fact that things could be even better. And, just because we aren’t screaming for a better search engine doesn’t mean that we wouldn’t use one when it showed up.

Before we start thinking about what search could be, though, let’s look at why search is currently the way it is.

The main reason is that most search engines see users as a number, not as individuals. Most search engines assume that: 1.) There’s a perfect set of results out there for every query; and 2.) What’s perfect for one user will be perfect for everyone.

To come up with this “perfect” set of results and corresponding ranking, search companies pick a bunch of pre-defined queries and then pay people to judge the relevance of their results and maybe the relevance to other engines. The problem with this is that we all know there’s no “perfect” set of results. It varies with each user, as does the best ranking of results.

So how do search companies currently deal with the fact that there’s no one perfect set of results?

Next – Part Three: How Search Works Right Now, Cont’d.

The Future of Web Search
Part One: What Could The Future Look Like?

Tuesday, April 22nd, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Imagine the Internet five years from now: As you begin to type a word into a search engine, it seems to know you personally. It naturally gravitates towards your unique interests and preferences. Rarely do you need to type more than one or two words before it shows you exactly what you’re looking for. On the occasional instance when it doesn’t correctly guess your intention, it’s easy to correct and quickly get to what you want.

For example, a student doing research for a school science project sees only science web sites that are appropriate for someone his age. A few hours later, he searches for information on his favorite video game, and he’s able to easily re-focus the engine on reviews and downloadable expansion packs.

By no means am I the first to postulate this future vision where your search engine seems to know you personally. But despite nearly ten years of artificial intelligence (AI) research in this area, we’re still not there. Why?

How do we get from here – a world where most people view search engines as big bookmark replacements – to there – a world where search engines are even more useful for real research and seem to know us personally, demonstrating the flexibility we all dream about? Are we moving in the right direction? Is it possible? Does anyone even care?

At Searchme, we are working to move toward this future and prove that it is possible by demonstrating some of the initial steps to get there. It’s extremely challenging, rewarding and exciting. Over the next few posts, I will go into detail on the real challenges to creating better search, what has already been done, and how we are starting to move into this future.

Next – Part Two: How Search Works Right Now

From the Blogosphere #2

Friday, April 18th, 2008

The fine folks at CogBox attended Ad:Tech SF and blogged about Drew Ianni’s keynote presentation, “This Is Not Your Father’s Kodak”. Drew referenced Searchme in his speech, which is very cool. We appreciate the shout-out from Drew, and we appreciate Chris’s comment about Searchme on the CogBlog:

Pretty cool, and something I think I’ll actually use.

That’s what we like to hear! Thanks, Chris.

New Feature: OpenSearch Plug-In for Firefox and IE7

Thursday, April 17th, 2008

Do you want quick and easy access to Searchme? Well now you can add Searchme to your list of search providers in Firefox 2 and Internet Explorer 7. Once added, you will have instant access to Searchme results right from your toolbar. Here’s how:

  1. While you’re on the Searchme site, click on the little glowing arrow next to the the search box in the top right corner of your browser.
  2. Select “Add Searchme Beta” from the list.
  3. Start Searching!

How to add the Searchme Plug-in (white)

Now you can use Searchme from any page!

New Feature: Scroll Arrows

Wednesday, April 16th, 2008

This feature was originally mentioned here, but we decided to re-post it separately in our ‘New Features’ category.

Our scroll bar is really fun: Slide it back and forth and watch the pages whiz by! But a lot of users asked for more precise control over the pages, so we decided to add an arrow to either end of the scroll bar:

New Arrow on Scroll Bar

You’ve always been able to go forward or backward by clicking on an individual page, but now these simple, clickable arrows make the results even easier to navigate.