Tuesday, February 13, 2007

DSPAM Installed


At work, we have a Debian Linux box that serves as our mail and web server. Recently, I upgraded that box from Debian Woody to Sarge. After that upgrade was complete, I decided to try and improve upon the spam filtering implementation.

The mail server runs qmail as the SMTP server, and I have the mail delivered to local users via procmail. Using procmail allows me to do all kinds of neat filtering on the server side. As an example, I automatically place mailing list traffic into separate IMAP folders. Similarly, mail from customer domains is automatically placed in their own company specific IMAP folders, thereby organizing my incoming emails automatically.

One of the filters that I had incoming mail going through was SpamAssassin. When I first installed SpamAssassin 2+ years ago, it did a very good job of detecting and flagging spam. However, the spammers have gotten rather sophisticated over the last two years, and many of them now actually test their spam against SpamAssassin to try and thwart it. While SpamAssassin was still able to detect and flag an ample amount of spam, more and more seemed to slip through as the months went by.

I decided to try out DSPAM after hearing good reports on its performance from a friend, and reading about it on its website. It wasn't available in the Debian repository, so I downloaded the source tarball and then built and installed it.

DSPAM works best if you have a corpus of spam and non-spam (or ham) to train it. If you do not have a good selection (in the thousands), then I would recommend not using it until you do. Once you train DSPAM with emails that are good and bad, then you can put it to work effectively.

DSPAM is not a fire and forget type of spam fighting solution. It requires a certain amount of vigilance on the part of the user to correct falsely flagged spam or ham. I set up three folders in each user's IMAP folder hierarchy for this purpose:

  1. Spam/ - This is the folder where DSPAM sends all emails it detects as spam.
  2. Spam/Missed/ - This folder is where users place emails which DSPAM did not detect as spam, but should have.
  3. Spam/NotSpam/ - This folder is where emails are placed which were detected as spam, but are not.
Every night, cron executes a bash script I wrote which crawls through the Missed/ and NotSpam/ directories, correcting DSPAM's mistakes.

So, how is it working? After about 3 weeks of use, my accuracy rate is up to over 95%, and that is with out as much initial training as I recommended. I expect that over the course of the next month or two that I will be able to get that accuracy up to the triple 9's range - 99.9%.

Over all, I am extremely happy with DSPAM's performance. It is a bit trickier to install and get going, but once it is humming along nicely, it purrs.

2 comments:

  1. Awesome! Long as the users keep sorting, it'll keep getting better.

    I've been finding that the spam that comes around SpamAssassin definitely comes in waves. We're in a high bypass section of the wave right now, but I just keep flagging the spam and it gets better pretty quickly on my webmail account. DSpam's currently running on that account as a sort of backup and it catches the majority of the stuff that SpamAssassin misses, but occasionally (maybe 1 in 500, very good, actually) nabs some ham. So it goes...

    HATE spam hate hate.

    ReplyDelete
  2. That's a great setup. I'm looking for exactly the same setup; with DSPAM scanning the MISSED and NotSpam folders to correct itself.

    Could you please post your script somewhere here? Thanks!

    ReplyDelete