Archive for the ‘Dealing With Spam’ Category

Spam Part Three - Babies with the Bathwater

Thursday, April 10th, 2008

This post was written by our Chief Spam Fighter and delves into the subject of why spam is such a tricky little beast. It was prompted by a question that a Searchme user posted at getsatisfaction.com.

Now that we know how spam fights, cheats and multiplies, let’s talk about why it’s a particularly tricky problem from the search engine side of things. Namely, we have to be really careful that we only remove spam and not the good sites.

Let’s drag out the stadium analogy one last time. We know the end zone’s full of spammers, but what if Baby Kylie and Aunt Millie happen to be sitting there as well? We obviously don’t want to get rid of them, so we can’t just blast the area.

At Searchme, we strive for a zero error rate - not one site mistakenly removed from our search results for being spam; not a single baby with the bathwater. While almost all other aspects of search are relatively mistake-tolerant and work well ‘on average’, identifying spam does not. ‘On average’ could mean that a search engine was wrong up to one-half of the time, and nobody can afford to be 50% wrong when it comes to identifying spam. So we have to look at every site very closely - no big sweeps.

Also, what if some of the hooligans look an awful lot like Aunt Millie? We have to be so cautious about not getting rid of what may be a good site that sometimes a spam site won’t be identified. (The good news is that since this is a two-way adversarial street, a site missed today will be found tomorrow.)

So, this is why you still find spam pages in search engines, despite our best efforts. Spammers use every dirty trick in the book, knowing that we, the search engines, have to be very careful in how we get rid of them.

The good news is, we won’t give up the fight.

Spam Part Two - Attack of the Clones

Tuesday, April 8th, 2008

This post was written by our Chief Spam Fighter and delves into the subject of why spam is such a tricky little beast. It was prompted by a question that a Searchme user posted at getsatisfaction.com.

In Spam Part One, I touched on the adversarial nature of spammers, how they cheat by yelling and shape-shifting. Now let’s discuss the second reason why spam is a particularly tricky problem: The numbers.

First of all, there is just so much darn spam out there. Billions and billions of pages. Dealing with the sheer mass of it is a never-ending, soul-wearying battle.

Second of all, spammers multiply like the devil. Say each person in our stadium represents one good site. Well, the spammers in the crowd have found a way to clone themselves, so what looks like a whole end zone full of people could in fact be one bad spammer. This cloning process is so fast and so cheap that even if we cleared out the area at half time, the area would be filled again by the third quarter.

Here’s an example to illustrate this point: We once found a spam site that led to 381 billion pages. One domain created a flood of spam pages that was more than ten times the size of Google’s index.

That’s the kind of enemy we’re dealing with.

Next time I’ll post about how hard it is to distinguish what is and is not spam (even though they’re everywhere.)

Tomorrow: Spam Part Three - Babies with the Bathwater

Spam Part One - This Is War

Monday, April 7th, 2008

This post was written by our Chief Spam Fighter and delves into the subject of why spam is such a tricky little beast. It was prompted by a question that a Searchme user posted at getsatisfaction.com.

Academics refer to web-spam as ‘adversarial information retrieval’. Academic-speak is always amusing, but ‘adversarial’ does point to why web-spam is so hard to identify and so very different from all the other problems that search engines must address.

Why is web-spam such a tough one? Well, first of all, spammers are actively trying to game the search engine. Or in common parlance, cheat.

Here’s an analogy: Imagine the Web as a stadium full of of people. Each person represents a web site. You do a search by asking your query over the PA system, and the whole stadium responds. The search engine’s job is to distinguish the correct answer from the false ones in the din of voices.

Now imagine that the spammers have smuggled in bullhorns.

Obviously, they’re much louder than everyone else, so they’re going to rise above the noise.

Second of all, spammers learn as they go. As search engines get better and better at distinguishing them, spammers adapt. So at the very beginning, they just needed to shout. Then they smuggled in bullhorns. Now they have Bose™ directional sound cannons.

That’s why it’s so rare to find spam results using techniques from just a few years ago.

As you can see, not only is spam identification a difficult problem, but even when a search engine solves it, there’s a new version clamoring from the end zone by the next morning.

This is why, in the game of spammer v. search engine, it’s a never-ending, constantly-evolving war.

Next time, I’ll post about how spam is a mighty foe just by the sheer weight of its numbers.

Tomorrow: Spam Part Two - Attack of the Clones