Spam Part One - This Is War
This post was written by our Chief Spam Fighter and delves into the subject of why spam is such a tricky little beast. It was prompted by a question that a Searchme user posted at getsatisfaction.com.
Academics refer to web-spam as ‘adversarial information retrieval’. Academic-speak is always amusing, but ‘adversarial’ does point to why web-spam is so hard to identify and so very different from all the other problems that search engines must address.
Why is web-spam such a tough one? Well, first of all, spammers are actively trying to game the search engine. Or in common parlance, cheat.
Here’s an analogy: Imagine the Web as a stadium full of of people. Each person represents a web site. You do a search by asking your query over the PA system, and the whole stadium responds. The search engine’s job is to distinguish the correct answer from the false ones in the din of voices.
Now imagine that the spammers have smuggled in bullhorns.
Obviously, they’re much louder than everyone else, so they’re going to rise above the noise.
Second of all, spammers learn as they go. As search engines get better and better at distinguishing them, spammers adapt. So at the very beginning, they just needed to shout. Then they smuggled in bullhorns. Now they have Bose™ directional sound cannons.
That’s why it’s so rare to find spam results using techniques from just a few years ago.
As you can see, not only is spam identification a difficult problem, but even when a search engine solves it, there’s a new version clamoring from the end zone by the next morning.
This is why, in the game of spammer v. search engine, it’s a never-ending, constantly-evolving war.
Next time, I’ll post about how spam is a mighty foe just by the sheer weight of its numbers.
Tomorrow: Spam Part Two - Attack of the Clones

June 3rd, 2008 at 12:58 pm
Interesting analogy. Here is some food for thought. They say dogs with the loudest barks, have no bite. I am not sure how this can be practically implemented, but maybe you should build it into your system that if something perfectly fits your search engine criteria, it should raise a flag. So if you have 1000 criteria and a site conforms very high on almost all of these criteria, there is a good chance it is a spammer. Good sites, don’t spend that much time on trying to beat the system. They focus on their content.