On spam and the filtering thereof
Posted: Thu Jun 02, 2005 11:59 pm
Jonathan Pearce wrote:The text of this article was taken from archive.org's copy of the original.
Code: Select all
43 N Dec 07 Angie's Genealo (1303) *****SPAM***** 422 Genealogy Databases
44 N + Dec 07 Gamers.com ( 257) Thinking Out Loud: Holiday Gaming and the
45 N + Dec 07 Gamers.com ( 159) Gamers.com's Week In Review - New Sega Ga
46 N + Dec 09 TechOnLine Webc ( 113) *****SPAM***** Upcoming Webcasts
47 N + Dec 09 TechOnLine Webc ( 113) *****SPAM***** Upcoming Webcasts
48 N + Dec 08 eBay ( 428) And now, the choice is clear...
49 N + Dec 09 BIGWORDS ( 232) *****SPAM***** It's that time again...
50 N + Dec 09 qhjjpearce6@ema ( 122) *****SPAM***** vyleceny iuvenum jpearce I
51 N Dec 09 GameDAILY_Newsl (1625) Game industry news for December 9, 2002
Spamassassin is the king of pattern-matching spam filtering software. Bogofilter is one of many implementations of Paul Graham's Bayesian spam filtering detailed in A Plan for Spam. I chose bogofilter because someone has already conveniently packaged it and put it in Debian testing.
The reason why I have installed both sets of filters is simple. I was not blocking enough of my incoming spam using spamassassin alone, but I did not have a large (greater than a thousand) corpus of spam to train bogofilter on. However, I believe the results will be better than if I used either by itself.
In the snippet of my Mail/Junk folder above, messages beginning *****SPAM***** were tagged by spamassassin. The remaining messages were caught by bogofilter. To understand the difference between the two, let's examine the first message, 43, and the last message, 51, in more detail.
The first message is generic unsolicited bulk email. It's from someone I don't know about a service I don't care about. This is classic spam. Spamassassin is really very good about catching this sort of thing. It's from a known abusive relay and has red HTML. I'll probably never hear from this spammer again.
The last message is a spammer who has been sending me email for months. I don't want the damn "Game industry news". Somewhere along the line they got my email address. I can't even remember if I gave it to them or not. All I know is I don't want to look at these messages anymore. However, these messages look like solicited newsletters and dlists to spamassassin. It lets them through. However, I have accumulated enough of these particular messages to train bogofilter to spot them. As a result, bogofilter catches all of them, even though spamassassin thinks they're legit.
Conversely, when I receive email from a new spammer about a new service or product, bogofilter simply won't know what to do with them. If I had a really large training set, maybe I could get bogofilter to spot them based on the characteristic language. I don't, at the moment. Hence, the two-pronged approach.
Currently, I let spamassassin process all my email, then bogofilter. I then decide whether to filter it based on the bogofilter score. I am experimenting with the spamassassin score. I have been using it as well to filter email, but I am going to start relying completely on the bogofilter score. Theoretically, bogofilter should start to recognize spamassassin's output as a clear indication of spam and send it on its merry way to Mail/Junk.
I haven't had a test case yet, but I am hoping this approach will result in fewer false positives. That is, bogofilter may one day overrule spamassassin if it determines that it really is a real email. If I do get a false positive or a false positive save, I'll make that known.
I am using procmail to do my mail filtering. Eventually, I hope to eliminate the middleman and use bogofilter and spamassassin in my MTA (exim) at SMTP time, so I never even accept spam and my email address get taken off lists. I'm waiting on a Debian exim4 package for this. In the interim, I use the following procmail recipe, gleaned from spamassassin and bogofilter documentation:
Code: Select all
:0fw
* < 256000
| spamassassin
:0fw
| bogofilter -u -e -p
:0e
{ EXITCODE=75 HOST }
:0:
* ^X-Bogosity: Yes, tests=bogofilter
Mail/Junk
Sometime later I'll update this with further results.
***
http://www.mozilla.org/mailnews/spam.html
http://www.upserve.com
New versions of Mozilla Mail incorporate Bayesian filtering. There's also a Bayesian filter for Outlook called Spammunition.
http://spamassassin.org/
http://spamassassin.org/where.html
Spamassassin is a perl module and can be run on any system that supports perl. Even better, the spamassassin website lists products that have integrated spamassassin support. The list includes a POP3 filter, IMAP filter, Outlook filter, and Eudora filter.
***
Update: I use Comcast's built-in spam filter as my main source of filtering. Bayesian filtering obviously caught on in a big way since this article was first posted. I use Evolution's built-in Junk Filter, but I get most of the junk mail on my Hiptop anyway so it's not terribly useful.