Building the Category Structure - Part Two

May 13th, 2008

By Alice Swanberg, of Searchme’s Search Quality department. Alice is a librarian who is responsible for creating, organizing and training the categories that you see on Searchme.

In the last post, I described how categorization has worked in the past. When we started to build the categories for Searchme, we knew that we didn’t want to make users climb up and down branches on a tree to find their sites. Instead, the categories needed to come to them. Furthermore, the categories needed to be useful to web searchers, not academics or search engine employees. Finally, we decided that our categories should not be based on what had gone before, nor on topics that we felt we should have, nor on topics of which we were very fond.

Instead, we went to the Web to look at sites where users were contributing content and tags. This information told us which words people used to describe things, what they were interested in, and where the most interest was. We aggregated all of the tags, common searches, and “folksonomy” data that we could find and spent some time organizing and studying it.

The next step was to decide which of these categories were mutually exclusive and which went together. Then we pulled them all into a category structure that could be used to help the classification system make sense of our search index.

The categories that you see while using Searchme represent just the tip of the iceberg. Each one is being supported by a batch of subcategories that work to clarify, expand, and constrain its meaning. And we can keep building these out as we note problems with a particular category’s breadth or precision.

How will we spot these problems, you ask? We’re watching your feedback! Thank you to everyone who has sent us praise, but especially to those who tell us what we could do better. We’re also watching the searches as they come in, so we can identify the searches in which the categories are not disambiguating well enough.

It’s exciting stuff, and it means that over time, our categories will get even better. We’ll improve the categories we already have, and we’ll add the categories that can help out searches. So please keep sending us feedback!

Building the Category Structure - Part One

May 12th, 2008

By Alice Swanberg, of Searchme’s Search Quality department. Alice is a librarian who is responsible for creating, organizing and training the categories that you see on Searchme.

My colleague Dr. Glover wrote in earlier posts about how the classification system works with our ontology. I’d like to write a bit about how and why the categories in the ontology were chosen.

In the early days of the Web, there weren’t many sites out there. That meant it was possible to build a category structure, hire people to find all those sites, and fit those sites into that structure. This model continued for a few years before the snowballing number of sites meant that no human team could fit all of them into a structure. At that point, most search companies gave up on human-generated categories as part of search and began to rely on sophisticated search algorithms to bring up the content users wanted.

As a librarian working for Internet companies, I’ve spent ten years helping design the biggest and most complex category structures on the Web. I joined Searchme because I was really excited to find a company that believed that neither an algorithm nor human intervention should be the dominant factor in building good search results. Human-generated meanings could be overlaid onto a smart search engine to bring about a new sort of search experience.

The category structures I worked with in the past have a strict hierarchy that doesn’t change very much over time. This makes them very easy to browse, as long as you understand where everything is - but that would take some study! These kinds of hierarchical taxonomies tend to offer users just one path through thousands of categories to their goal, and that path is not always easy to figure out.

That’s a very little background on category-based search and its uses. Next I will write a bit about how Searchme is approaching it.

Next: How We Choose Categories - Part Two

From The Blogosphere

May 8th, 2008

One of our favorite recent posts is over at Search Engine News, entitled “Searchme: Most Visual Search Engine Yet?” Terri Wells does a thorough review of the site and walks people through our features step by step (while acknowledging that we’re still in beta - thank you).

“Visual search engines try to show us that there’s a better way to search than by looking at 10 links with snippets on a page and making an educated guess. Most try to say that they’re more intuitive. Searchme really is.”

Check it out, and thank you, Search Engine News!

New Feature
Use Your Mouse Wheel

May 7th, 2008

This was one of the most requested features by far, so we added it as quickly as we could. Now you can flip through Searchme’s visual search results using your mouse wheel.

Keep those suggestions coming!

Favorite Searchme Searches
Erin Goes Etsy

May 6th, 2008

A new series spotlighting some of our favorite ways to use Searchme. This one’s by Erin Pipkin, one of our Search Analysts. As a member of the Search Quality team, Erin helps plan and build categories. She used Searchme’s visual search and category suggest while shopping at a favorite site.

I’m a big fan of online craft emporium Etsy. The site houses over 150,000 artists and craftspeople and their small online shops, and offers about 1.5 million items for sale. Just as importantly, the company’s innovative engineers and developers have concocted a series of clever ways to connect users and sellers.

There are many ways to browse Etsy’s wares. Their default mode is to display sample works from a mixed set of artists in a grid of thumbnail images. If an object interests me, I can click on the thumbnail to visit the maker’s shop and view the item up close. From there, I can see the seller’s other items and view his or her favorite sellers.

I’ve spent many happy hours rambling over the site this way, but sometimes I’m not able to find the things I want very quickly. I find that the front pages of artists’ Etsy shops are more informative than the initial thumbnails. Most have unique mastheads and display multiple items for sale. Recently, I wondered: What would it be like to browse Etsy by viewing the sellers’ front pages? I thought Searchme might be able to help.

I’d been thinking I’d like a new necklace, so I began by typing in the simple query “Etsy”. The top result was what I expected: www.etsy.com. The results spanning out to the right, however, showed pages from individual sellers. The array of Etsy products shown was still a bit broad, so I clicked on Searchme’s “jewelry” category, and soon I was scrolling through a stream of artists’ pages, glancing at each briefly and clicking to open the ones that looked the most interesting. With little effort, I found a new set of favorite independent jewelers.

For me, this Etsy experiment raises so many questions about how visual search and category suggest can help broaden and enrich online shoppers’ experiences. This is something my team is actively working on at Searchme – we hope you have a chance to check it out for yourself!

From The Blogosphere

May 5th, 2008

Check out this KCBS-LA news segment that aired about Searchme!

From the Blogosphere

May 3rd, 2008

Thanks to the fine folks over at Daily Candy for posting about us on The Daily Candy Weekend Guide. Daily Candy is a free daily e-mail newsletter and web site that’s the insider’s guide to what’s hot, new, and undiscovered — from fashion and style to gadgets and travel. Several of us here in the Searchme offices subscribe and we love it. Yum.

GO
Search Me
What: Efficient new search engine enables you to scan resulting pages visually.
Why: Watch out, Google.
Where: Online at beta.searchme.com.

The Future of Search
Part Nine: How We’re Making Search Better Today

May 2nd, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

So, now that we actually have a mapping from all URLs to all categories for our dynamic ontology – considered by some to be the holy grail of web-scale machine learning – the question becomes: What do we do with this data? We’ve shown that it’s feasible (and we encourage you to test us), but what are we doing with it to make search better for our users?

So far, we’ve chosen one possible use for our ontology – presenting categories that help disambiguate queries and get users to the information they want more quickly. Currently, in our beta, we have more than 200 categories. Some are subjects like “golf”, while others are page types like “blogs” or “political news” (which were chosen to represent how users actually use the Web as opposed to how an academic might divide up a domain).

Each category is thoroughly tested to ensure sufficient (but nothing is perfect) accuracy. While testing, it’s exciting to me when random or obscure queries come back with perfect category suggestions. I have also seen cases where the ontology was not ideal or a category definition just didn’t work. For example, sites listing current spreads for a football game previously came back as “gambling and casinos”. Using our system, it took very little time to change the definition in the ontology and apply the changes (thus reclassifying our entire index). In addition, this work was done by a data analyst/ontologist, with no engineering resources required. This is only possible with a powerful system that supports our dynamic ontology and the ability to rapidly train and classify.

Every day I wake up excited to go to the office. I get to work with experts in the area of web search and machine learning to further enhance what we can do. We are already planning to grow the number of exposed categories, improve our accuracy, and further reduce the training costs of new verticals. We’ve got a long way to go, but I think this is the start of something that will radically change web search and the expectations of large-scale search engines. It has already been an exciting ride, and we have barely begun.

This concludes our series on ‘The Future of Web Search’, By Dr. Eric Glover, Searchme’s Classification Architect. Please check back as other Searchme team members will be posting on the topic as well.

The Future of Search
Part Eight: We Really Can Get To Better Search

May 1st, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

At Searchme, we’re very aware of the difficulties in building a web-scale automatic classification system that is fast and accurate and maps to a deep, dynamic ontology. In fact, in the previous post, we discussed how it was almost impossible.

However, as you may have guessed by now, at Searchme we are using a dynamic ontology to create our “categories” feature. Please feel free give it a try – pick some query, choose a category, and decide for yourself how accurate our classifiers feel.

Here’s how we are able to do what many have long considered impossible:

First, we define our own ontology. This means we can easily adapt it to better match how users search the Web, as well as match what works best from a categorization standpoint. Simply put, if a category doesn’t work, we can change the rules of the game by picking a slightly different definition – one that would have fewer errors.

Second, we use complex models for classification (non-linear SVMs), as well as more complex features (not limited to bag of words). This richer set of features reduces the chance that a document with a few golf terms will be considered “golf”. A simple linear model assigns a fixed weight to the word “eagle”, independent of the context, which increases misclassifications over a non-linear model. However, using non-linear classifiers enables us to learn subtler concepts, such as “eagle” and “flying” makes “eagle” negative with respect to golf, but “eagle” and “birdie” make “eagle” positive for golf.

Third, we’ve incorporated technologies for rapid training. These technologies reduce the amount of data and human effort required to train a classifier, keeping it at a manageable level without sacrificing final accuracy.

All of these factors are integrated into our core production system, which we designed from the ground up with the future of search in mind. Using the ideas of dynamic ontologies, we can be agile when new categories are needed or definitions change, and with our rapid training capabilities, we can adjust in weeks or months as opposed to years.

Conclusion – Part Nine: How We’re Making Search Better Today

The Future of Web Search
Part Seven: Classifying the Whole Web

April 30th, 2008

By Dr. Eric Glover, Searchme’s Classification Architect. Eric is responsible for the design and implementation of Searchme’s categories feature, a seemingly simple tool that springs from an exciting area of artificial intelligence (AI) research and development.

Classifying the whole Web is challenging. First, a single category could require hundreds or thousands of cleanly labeled examples – for a Bayesian classifier, the rule is that you need at least twice as many training examples as features (think bag of words – hundreds or thousands of features). This is fine for one or even ten categories, but not for thousands.

Second, it is one thing to classify clean and proper English news articles or academic pages, but it is another thing entirely to classify a random blog post or a site’s home page where the only “content” is “you need Flash to view this page.”

Third, and most important for a commercial application, even 99.9% overall accuracy would not be good enough when it came to very specific categories, because, simply put, the number of false positives can outweigh the number of true positives for low-frequency categories. Since the Web is so large, the number of false-positives (pages which are not about golf, but contain many “golf words”) could easily be more than the number of actual golf pages.

For example, pages about Amazon development might contain “eagle”, “wood” and “iron”, or pages about tigers might mention “tiger” and “woods”. Likewise, what about the pages where someone is talking about how they “hit a hole in one” with their presentation about economics? Even 99.9% overall (balanced) accuracy could mean that every other page that is predicted to be in the “golf” category is an error; therefore even such a high accuracy is too low to work, though it might win a best paper award at an academic conference.

These reasons make it seem virtually impossible and/or impractical to do large-scale classification with a dynamic ontology. Are there any ways to get around these challenges?

Next – Part Eight: We Really Can Get To Better Search