bogofilter and RMAIL
bogofilter is a facility for filtering email, declaring each to be either spam or ham. But the docs expect a lot of the user. They don’t really hold your hand through a failsafe way to install and use bogofilter; instead they just tell you what it’s all about and set you free to chart your own course. If you use RMAIL to read your mail on a machine where the mail is delivered to you by sendmail (or any program that can incorporate procmail in your delivery chain) then here’s a recipe for getting bogofilter going. It documents the choices I made, though of course other choices are possible.
I read my email on an internet server that I administer; it lives in a commercial data center and is always on and always connected to the internet with a permanent IP address. It is a linux Fedora Core 5 (FC5) distribution running linux kernel 2.6.18. Email addressed to me on this machine is accepted by sendmail-8.13.8. If sendmail sees a .procmailrc file in my home directory, it will pass all my mail to the procmail program (I have procmail 3.22) instead of dropping it directly into my mail box.
When I am logged on with emacs running, I can run the RMAIL program which transfers any messages it finds in my mail box into my RMAIL file.
So my basic plan to get going with bogofilter is to create a .procmailrc file that causes procmail to run bogofilter. If bogofilter decides the message is spam, I just output it to a special mailbox that I will almost never look at. If bogofilter is not sure the message is spam (if it classifies it either as ham or as unsure) then I will have procmail drop the mail in my normal system mail box. This is going on all the time, whether I am logged in or not. When I do log in and run emacs/RMAIL, only messages that are in my normal system mail box get read. Spam just accumulates in the special mail box, and I can glance at it every day or two to see if I am losing legitimate mail to my spam box.
When you first download bogofilter, either from source or from an RPM, it has logic but it has no data. Before it can begin classifying your email as spam or ham, it needs data on your email. So I started sorting my email a week before I installed bogofilter. I’ve been getting around 400 spams a day, and about 30 emails that I want to read. So during this week, I continued to let everything get delivered into my RMAIL file. Then, I copied every single message out of my RMAIL file into one of two files — one file for spam and one file for ham. To do this, I used the Ctrl-O command in RMAIL, which copies the current message to a unix-mbox-format file. So if I was in RMAIL and I typed “Ctrl-O~/spam.mbox
Finally I was ready to install bogofilter. Don’t be afraid to install it; it doesn’t do anything when you install it except make the program available. You have to take additional steps before it will actually start being used. I installed it from a FC5 RPM. Now the ‘bogofilter’ executable exists in /usr/bin/bogofilter. I created my personal database by passing in my already-existing spam.mbox and ham.mbox and telling bogofilter how to classify them. I ran the following logged in to my normal user account:
$ bogofilter -s -M < ~/spam.mbox
$ bogofilter -n -M < ~/ham.mbox
The -M option means I am passing in a whole mailbox not just a single message; -s tells bogofilter that all these messages should be considered spam; -n tells bogofilter that all these messages should be considered ham. At the end of this, bogofilter had created my personal word count database in ~/.bogofilter/wordlist.db. Now bogofilter knows what it needs to know to start classifying my email. However, I still haven't done anything dangerous -- though bogofilter now knows how to filter my mail, I haven't actually done anything to make it start doing so.
Now I was ready to create my .procmailrc script to tell procmail how I want to use bogofilter. First, I decided I don't like the -u option in bogofilter. When you specifiy -u, bogofilter decides if something is spam or ham, and assumes it is right so it then updates its database with all the words it found in that email. If you later decide bogofilter made a mistake you can correct the database by calling '$ bogofilter -nS' or '$bogofilter -sN' but I didn't want to mess with that. Maybe it works great in practice but it just rubs me the wrong way. My plan is to rely mostly on the training I already did, and to only use mistakes to train bogofilter in the future. If I find ham in my spam box, I will tell bogofilter it is ham and have it update its DB; likewise if I have spam in my regular mailbox I will tell bogofilter it is spam and have it update its DB. I also left alone the config file that the RPM installed in /etc/bogofilter.cf; this config file has everything commented out so bogofilter will use all defaults.
So with those considerations, I developed the following .procmailrc, based partially on the examles in the bogofilter docs.
--------------------------------------------------------------------------
# Set to yes when debugging
VERBOSE=no
# Remove ## when debugging; set to no if you want minimal logging
## LOGABSTRACT=all
# Replace $HOME/Msgs with your message directory
# rules that direct mail to a folder will put the folder in this dir
MAILDIR=$HOME/Mail # Make sure this directory exists!
# Directory for storing procmail-related files
PMDIR=$HOME/.procmail
# Put ## before LOGFILE if you want no logging (not recommended)
LOGFILE=$PMDIR/log
## INCLUDERC=$PMDIR/testing.rc
## INCLUDERC=$PMDIR/lists.rc
## INCLUDERC=$PMDIR/perl.rc
#
# ':0' is just a magic literal introducing a new rule
# lines beginning with * define a condition for selecting mail
# the last line of a rule is the action to take
# flags: f = filter, w = wait and check filter's exit code
:0fw
| bogofilter -e -p
# if bogofilter choked, try again later
# flags: e = error - only executes if previous rule returned an error
# the MTA will retry to deliver it later
# 75 is the value for EX_TEMPFAIL in /usr/include/sysexits.h
:0e
{ EXITCODE=75 HOST }
# OK, anything that gets to here has been classified, and Bogosity header added
:0
* ^X-Bogosity: Spam, tests=bogofilter
spam.mbox
# Most mail will fall through to end, and get delivered to my mail spool
-------------------------------------------------------------------
When you save the .procmailrc that is the first moment that you are actually turning on filtering and diverting putative spam away from RMAIL. I only had to wait a couple of minutes before I saw that ~/Mail/spam.mbox had been created, and when I looked in it, it contained spam! Yay!
So that's basically the setup I have been running with for about a week now.
As for the results, I am both pleasantly surprised at how well it works and bitterly disappointed that, as well as it works, it doesn't work nearly well enough. By using the default config file, bogofilter classifies everything as spam (spamicity > 0.99), ham (spamicity < 0.45 I think) or unknown. So far, about 800 messages have gone to my spam.mbox, and I believe they are all spam. I’ve received a few true personal messages and noticed that they get classified as ham with very low spamicity values. However, I’m still getting about 200 spams a day into my RMAIL. So with the default cutoffs and the little bit of training I’ve done so far, only about 1/2 my spam is being diverted. Some messages with very obvious spam words (like penis and \/iagra) failed to get classified as spam.
I’ve defined a keyboard macro that I hit every time I see spam in my RMAIL. It pipes the content of the RMAIL buffer to ‘bogofilter -s’ and then deletes the message from RMAIL. I hope with this ongoing training bogofilter will get better and better at recognizing my spam.