Archive for the ‘The Future of Search’ Category

Visual Search and SEO

Friday, May 16th, 2008

By Aaron Curtiss. Aaron is an internal tools developer assisting the Search Quality department in Searchme’s efforts to improve relevance and classification.

According to Wikipedia, “Search engine optimization (SEO) is the process of improving the volume and quality of traffic to a web site from search engines via ‘natural’ (‘organic’ or ‘algorithmic’) search results for targeted keywords. Usually, the earlier a site is presented in the search results or the higher it ‘ranks’, the more searchers will visit that site.”

It’s that last sentence that’s interesting, because with a visual interface like Searchme, it’s not quite the same ‘punishment’ if your site is ranked 9th or 10th instead of 1st or 2nd.

With visual search, people are more likely to get to your site in a set of search results and more able to determine if it’s the one they need, even if it’s not in the top five results. This is because visual search lets people scroll through web pages extremely quickly and identify the relative strength of a site, often without reading a single word. They can see instantly whether or not a site is what they want, because the visual representation of a site has higher information yield than page titles and snippets. This means that users are more likely to get the result they want the first time they click through, instead of going back and forth on text-only links, so they can afford to scroll through more results.

What effect might this have on SEO? Much remains to be seen, but it’s safe to say that the authoritative and traditional definition given above will no longer suffice. When users can actually see whether a page will most likely answer their search query, the meaning of relevance changes.

One outcome that users could hope for is that designers and optimizers will start to aim not just for page rank, but for more effective visual communication of a site’s information. Visual search could promote a more holistic and multi-dimensional approach to SEO, one that brings users closer to the overall content and purpose (and quality) of a site, more quickly.

In sum, visual search is changing SEO, and we’re excited about the ways in which visual information is going to play a role in improving relevance and getting people to the information they want more quickly.

Building the Category Structure - Part Two

Tuesday, May 13th, 2008

By Alice Swanberg, of Searchme’s Search Quality department. Alice is a librarian who is responsible for creating, organizing and training the categories that you see on Searchme.

In the last post, I described how categorization has worked in the past. When we started to build the categories for Searchme, we knew that we didn’t want to make users climb up and down branches on a tree to find their sites. Instead, the categories needed to come to them. Furthermore, the categories needed to be useful to web searchers, not academics or search engine employees. Finally, we decided that our categories should not be based on what had gone before, nor on topics that we felt we should have, nor on topics of which we were very fond.

Instead, we went to the Web to look at sites where users were contributing content and tags. This information told us which words people used to describe things, what they were interested in, and where the most interest was. We aggregated all of the tags, common searches, and “folksonomy” data that we could find and spent some time organizing and studying it.

The next step was to decide which of these categories were mutually exclusive and which went together. Then we pulled them all into a category structure that could be used to help the classification system make sense of our search index.

The categories that you see while using Searchme represent just the tip of the iceberg. Each one is being supported by a batch of subcategories that work to clarify, expand, and constrain its meaning. And we can keep building these out as we note problems with a particular category’s breadth or precision.

How will we spot these problems, you ask? We’re watching your feedback! Thank you to everyone who has sent us praise, but especially to those who tell us what we could do better. We’re also watching the searches as they come in, so we can identify the searches in which the categories are not disambiguating well enough.

It’s exciting stuff, and it means that over time, our categories will get even better. We’ll improve the categories we already have, and we’ll add the categories that can help out searches. So please keep sending us feedback!

Building the Category Structure - Part One

Monday, May 12th, 2008

By Alice Swanberg, of Searchme’s Search Quality department. Alice is a librarian who is responsible for creating, organizing and training the categories that you see on Searchme.

My colleague Dr. Glover wrote in earlier posts about how the classification system works with our ontology. I’d like to write a bit about how and why the categories in the ontology were chosen.

In the early days of the Web, there weren’t many sites out there. That meant it was possible to build a category structure, hire people to find all those sites, and fit those sites into that structure. This model continued for a few years before the snowballing number of sites meant that no human team could fit all of them into a structure. At that point, most search companies gave up on human-generated categories as part of search and began to rely on sophisticated search algorithms to bring up the content users wanted.

As a librarian working for Internet companies, I’ve spent ten years helping design the biggest and most complex category structures on the Web. I joined Searchme because I was really excited to find a company that believed that neither an algorithm nor human intervention should be the dominant factor in building good search results. Human-generated meanings could be overlaid onto a smart search engine to bring about a new sort of search experience.

The category structures I worked with in the past have a strict hierarchy that doesn’t change very much over time. This makes them very easy to browse, as long as you understand where everything is - but that would take some study! These kinds of hierarchical taxonomies tend to offer users just one path through thousands of categories to their goal, and that path is not always easy to figure out.

That’s a very little background on category-based search and its uses. Next I will write a bit about how Searchme is approaching it.

Next: How We Choose Categories - Part Two

The Future of Search
Part Nine: How We’re Making Search Better Today

Friday, May 2nd, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

So, now that we actually have a mapping from all URLs to all categories for our dynamic ontology – considered by some to be the holy grail of web-scale machine learning – the question becomes: What do we do with this data? We’ve shown that it’s feasible (and we encourage you to test us), but what are we doing with it to make search better for our users?

So far, we’ve chosen one possible use for our ontology – presenting categories that help disambiguate queries and get users to the information they want more quickly. Currently, in our beta, we have more than 200 categories. Some are subjects like “golf”, while others are page types like “blogs” or “political news” (which were chosen to represent how users actually use the Web as opposed to how an academic might divide up a domain).

Each category is thoroughly tested to ensure sufficient (but nothing is perfect) accuracy. While testing, it’s exciting to me when random or obscure queries come back with perfect category suggestions. I have also seen cases where the ontology was not ideal or a category definition just didn’t work. For example, sites listing current spreads for a football game previously came back as “gambling and casinos”. Using our system, it took very little time to change the definition in the ontology and apply the changes (thus reclassifying our entire index). In addition, this work was done by a data analyst/ontologist, with no engineering resources required. This is only possible with a powerful system that supports our dynamic ontology and the ability to rapidly train and classify.

Every day I wake up excited to go to the office. I get to work with experts in the area of web search and machine learning to further enhance what we can do. We are already planning to grow the number of exposed categories, improve our accuracy, and further reduce the training costs of new verticals. We’ve got a long way to go, but I think this is the start of something that will radically change web search and the expectations of large-scale search engines. It has already been an exciting ride, and we have barely begun.

This concludes our series on ‘The Future of Web Search’, By Dr. Eric Glover, Searchme’s Classification Architect. Please check back as other Searchme team members will be posting on the topic as well.

The Future of Search
Part Eight: We Really Can Get To Better Search

Thursday, May 1st, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

At Searchme, we’re very aware of the difficulties in building a web-scale automatic classification system that is fast and accurate and maps to a deep, dynamic ontology. In fact, in the previous post, we discussed how it was almost impossible.

However, as you may have guessed by now, at Searchme we are using a dynamic ontology to create our “categories” feature. Please feel free give it a try – pick some query, choose a category, and decide for yourself how accurate our classifiers feel.

Here’s how we are able to do what many have long considered impossible:

First, we define our own ontology. This means we can easily adapt it to better match how users search the Web, as well as match what works best from a categorization standpoint. Simply put, if a category doesn’t work, we can change the rules of the game by picking a slightly different definition – one that would have fewer errors.

Second, we use complex models for classification (non-linear SVMs), as well as more complex features (not limited to bag of words). This richer set of features reduces the chance that a document with a few golf terms will be considered “golf”. A simple linear model assigns a fixed weight to the word “eagle”, independent of the context, which increases misclassifications over a non-linear model. However, using non-linear classifiers enables us to learn subtler concepts, such as “eagle” and “flying” makes “eagle” negative with respect to golf, but “eagle” and “birdie” make “eagle” positive for golf.

Third, we’ve incorporated technologies for rapid training. These technologies reduce the amount of data and human effort required to train a classifier, keeping it at a manageable level without sacrificing final accuracy.

All of these factors are integrated into our core production system, which we designed from the ground up with the future of search in mind. Using the ideas of dynamic ontologies, we can be agile when new categories are needed or definitions change, and with our rapid training capabilities, we can adjust in weeks or months as opposed to years.

Conclusion – Part Nine: How We’re Making Search Better Today

The Future of Web Search
Part Seven: Classifying the Whole Web

Wednesday, April 30th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Classifying the whole Web is challenging. First, a single category could require hundreds or thousands of cleanly labeled examples – for a Bayesian classifier, the rule is that you need at least twice as many training examples as features (think bag of words – hundreds or thousands of features). This is fine for one or even ten categories, but not for thousands.

Second, it is one thing to classify clean and proper English news articles or academic pages, but it is another thing entirely to classify a random blog post or a site’s home page where the only “content” is “you need Flash to view this page.”

Third, and most important for a commercial application, even 99.9% overall accuracy would not be good enough when it came to very specific categories, because, simply put, the number of false positives can outweigh the number of true positives for low-frequency categories. Since the Web is so large, the number of false-positives (pages which are not about golf, but contain many “golf words”) could easily be more than the number of actual golf pages.

For example, pages about Amazon development might contain “eagle”, “wood” and “iron”, or pages about tigers might mention “tiger” and “woods”. Likewise, what about the pages where someone is talking about how they “hit a hole in one” with their presentation about economics? Even 99.9% overall (balanced) accuracy could mean that every other page that is predicted to be in the “golf” category is an error; therefore even such a high accuracy is too low to work, though it might win a best paper award at an academic conference.

These reasons make it seem virtually impossible and/or impractical to do large-scale classification with a dynamic ontology. Are there any ways to get around these challenges?

Next – Part Eight: We Really Can Get To Better Search

The Future of Web Search
Part Six: Creating A Large-Scale, Dynamic Ontology

Tuesday, April 29th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

How do we create an automated, flexible ontology that will enable a more personalized and more flexible search experience?

First, the requirements: 1.) It has to be commercially viable; i.e. it can’t take 100 years to classify a billion web pages or require a thousand volunteers to label training examples. 2.) It has to be precise; if you say a page is about golf, it has to be about golf nearly all the time. 3.) It has to have sufficient recall; i.e. it has to identify nearly all of the relevant pages that belong to a given category – recognize all golf pages as golf.

All automated text classification systems require the same basic recipe: Obtain “good” training data, learn a model from the training data, and apply that model to the unlabeled documents.

When doing classification, all documents must be represented in a mathematical way that corresponds to a set of “features”. The simplest set of features is called “bag of words”, which is common in Information Retrieval (IR) literature. In a “bag of words”, each unique word in a document corresponds to a single feature.

A very simple classifier might try to find documents that have words “similar” to positive training examples for a given category. For example, if the training documents in the “golf” category often contain the words “ball”, “iron”, “wood” and “par”, then documents that contain all of those words are likely to be considered as “golf”. (If you are interested, most academic – and commercial – text classification systems use a Bayesian or linear-SVM as their “mathematical model.”)

All this sounds fairly straightforward, but there are many daunting challenges when trying to apply this procedure to classifying the whole Web.

Next – Part Seven: Classifying the Whole Web

The Future of Web Search
Part Five: Towards A Large, Dynamic Ontology

Monday, April 28th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Because it’s very difficult to create a deep, useful ontology that actually works and stands the test of time, it is understandable that search engines have moved towards the simple approach of “do you want news, images or blogs?” served with a side of “search only in English.” Is there a way to create a better ontology that actually works and is flexible over time?

The answer is yes. We have the ability to create a responsive ontology, one that is rapidly adaptable. With a dynamic ontology, if someone new becomes famous next week, we could create and apply a category for that person, instantly. Likewise, if an existing category changes or becomes obsolete, it is easy to adjust.

Being both deep (many categories) and flexible, however, makes it difficult to effectively map web pages onto such an ontology, because things are changing all the time. In addition, the rapidly changing definitions mean the potential maintenance costs may be prohibitive – especially if each change requires thousands of labeled examples and days of training. So how do you have a classification system on such a grand scale – one that makes business sense?

There are hundreds of academic papers on how to do text classification, but few methods are viable when applied to billions of web pages and hundreds or thousands of categories. Typically, they are too slow, not accurate enough, or too expensive to train/maintain.

How do we create a dynamic ontology that really works for large-scale web classification?

Next – Part Six: Creating A Large-Scale, Dynamic Ontology

The Future of Web Search
Part Four: How Ontology Gives Us Better Search

Friday, April 25th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

What is an ontology, and what does it have to do with web search?

An ontology is “a systematic arrangement of all of the important categories of objects or concepts which exist in some field of discourse, showing the relations between them.” (Wordnet.) Or to simplify, an advanced “topic hierarchy”.

Many web sites use an ontology. For example, dating sites let you select by gender, age and location. Shopping sites let you search by color, style, price or inventory. In each case, the site uses a “domain-specific” ontology – all the content on the site is described by and fits into its ontology.

An ontology needs two things to be effective: It needs to make sense for the site, and the content it references must meaningfully map onto it.

So, when it comes to large-scale, general-purpose web search, you can see the problem. First, because a search engine is general-purpose and users can query for anything, there doesn’t exist a small set of “topics” that will cover every query. Second, because the Web is a collection of tens of billions of pages of varying quality, all created by a variety of “users”, it’s difficult for a company to accurately map what’s out there onto any ontology.

It’s easy enough to make up a bunch of categories, but it’s hard to make ones that will stand the test of time. Furthermore, if you do make ones that last, odds are that you will have a shallow ontology; we don’t know who the president will be in 2020, or who’ll be the biggest movie star in 2015. In addition, what if the definition of a category changes – what if the European Union gets a new country? The previous “EU” category becomes obsolete.

So how do you create a meaningful ontology?

Next – Part Five: Towards A Large, Dynamic Ontology

The Future of Web Search
Part Three: How Search Works Right Now, Cont’d.

Thursday, April 24th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Search companies aren’t clueless about the fact that users have different needs. To deal with an ambiguous query, most will use some type of “mixing”.

With “Saturn”, for example, they will present results from both the car company and the planet. Some even go a step further and offer “related searches”, which help a user by presenting queries that others might have asked, such as “Saturn cars” or “Saturn Vue”.

Unfortunately, commercial search engines don’t do “mixing” based on knowledge about the explicit meanings of a query or of web pages. But what would happen if they did? What would happen if search engines actually “understood” that “Saturn” had multiple meanings, not just because different results were manually “mixed”. What if they knew that www.saturn.com was about “car companies” and that “en.wikipedia.org/wiki/Saturn” was about “astronomy”? Could they use this knowledge to help separate out results by meaning, thus reducing a user’s difficulty in locating what they want?

The ability to “separate out by meaning” brings us to the subject of ontology – which is what we think is required to create the search of the future.

Next – Part Four: How Ontology Gives Us Better Search