oct2003.tar

Putting the Ease Back into Email

Martin Streicher

Sending, receiving, and reading email are an absolutely essential part of my job. With colleagues, acquaintances, and friends scattered all over the planet, email has replaced the telephone as my preeminent medium for communication and collaboration. Indeed, so much information is exchanged and captured in my email that I'm quite afraid of losing (or misplacing) even a single message.

In this article, I present a script that helps me keep up with my email. The script pulls email from any number of POP servers, creates an easily browsed, unified archive of both incoming and outgoing email, and filters out spam. The script is written in Perl, which makes the script largely independent of platform and much more flexible than procmail. Of course, writing the code in Perl was also fun, and allowed me to leverage an extensive array of existing, email-related CPAN modules. I've been using this script in production for some time, and it's invaluable to me.

Email at My Fingertips

Here are the requirements I had in mind at the outset of developing the code:

Wherever possible, the system must be based on open source software. Open source not only gives me access to the internals of a utility, but open source projects commonly complement existing systems that already work well.
Email must be stored in a widely supported, application-independent format.
The email repository should be stored centrally and served globally. The email should be accessible from anywhere on the planet, and ideally, can be read from a variety of "interfaces", including the browser, the command line, or some graphical, windowed application.
The email system should archive all incoming and outgoing mail, deliver incoming email to the appropriate Inbox, and identify and segregate spam (unsolicited, unwanted email) from ham (legitimate correspondence).

All of these requirements are fairly easy to satisfy with Linux (or Unix) and a handful of software packages. To begin, the ubiquitous mbox file format satisfies the first two requirements. Mbox is a simple, "flat" text file that stores a series of mail messages. It's readily parsed and scanned with any number of tools from cat to grepmail, and mbox is widely supported by almost all open source mail applications. Even some Windows applications export to the format.

The third requirement -- global access and separation between how email is stored and how it is presented -- is also easily satisfied with the Internet Message Access Protocol, or IMAP. IMAP stores email on a central machine, serves email via its own application protocol, and allows "client" email applications to access messages as if they were stored locally on the client machine. (If you've ever used a news reader, think of IMAP like an NNTP for email. More information about the IMAP protocol and IMAP software can be found at: http://www.imap.org).

With the mbox format and IMAP, the only work left was to realize the fourth requirement: process all incoming and outgoing email, remove spam, and file each message into some sort of easy-to-navigate email archive.

At first, I built my system on top of fetchmail and procmail, which worked well, but also presented some problems. Fetchmail was more than adequate for injecting mail from remote servers into the local mail queue -- in fact, fetchmail offers many nice options. Procmail, however, was a problem. Procmail syntax is (insanely) cryptic, and I dislike using it. Additionally, procmail can be expensive: each incoming mail message creates a new procmail process. If a system has many users, all using procmail, things get a bit busy when email arrives. Finally, procmail was useless for processing outgoing mail -- I had to process that mail separately using an altogether different technique.

While my first go-round worked well, it was basically a prototype for the system I use now. Using Perl and many inventive modules from the CPAN, I created something more flexible and functional.

A Brief Schematic

Here's a quick overview of how my new script works. A detailed description of how to set up and use the script is presented in the next section.

1. To use the script, you define one or more email accounts. You can define one email account using just command-line options, or you can edit the code and create a list of accounts. The command-line approach is very useful, because you can mix the use of shell scripts and cron or at to run the script at regular intervals.

In any case, each account must include the usual POP server information -- user name, POP server name, and a password. You must also supply a regular expression that describes your email address. The regular expression defines "you", even if "you" receive email using a variety of aliases. For example, you can send me email at [email protected], [email protected], or [email protected]. All are "me" at Apress, and I want all of those aliases to be recognized as my incoming email.

So, to define a new account on the command line, you'd type:

% perl audit.pl -h pop.somedomain.com -p olives -u martini \
    -e 'martin.*@somedomain\.com$'

Here, martin.* would be sufficient to match all of my Apress aliases. As usual, the single quotes are needed around the argument to -e (the email address) to prevent the shell from interpreting the asterisk and the dollar sign. Although not required, you can also set other account parameters, such as where to store the Inbox, where to keep the archive, and whether you want to filter out spam.

2. Each time the script runs, it checks each email account for new messages. The script reads each mail message and runs the filter associated with that account. If you do not specify your own filter, a default filter is used instead. (More on the default filter in a moment.)

3. The filter makes decisions about what to do with each message. For example, if an email message is addressed to "you" (the To line includes an address that matches your regular expression), it's filed into your inbox. Filters can file into other folders, and can determine whether the message is spam and handle it accordingly.

The default filter should work for a large majority of users, but if it's not suitable, you can write your own.

Setting up the Script

Listings 1 through 3 contain the components of my script. Listing 1 shows audit, the end-user script. Listing 2 shows Mail::Archive::Account, which encapsulates all of the features of an email account. Listing 3 shows Mail::Archive::Manager, a helper object that assists with the creation of accounts. You can download (an extensively annotated version of) the source from the Sys Admin Web site or from:

http://www.flywheelagency.com/download/sysadmin

You should install Mail::Archive::Account and Mail::Archive::Manager in $HOME/lib/Mail/Archive and then go to the CPAN and download and install all of the following modules: Mail::Address, Mail::Audit, Mail::Box::Manager 2.00, Mail::Send, and Mail::SpamAssassin. The standard Perl installation probably has the four other modules you'll need: File::Path, File::Spec, File::Basename, and Getopt::Std.

Once you have installed all of the modules, you're ready to run the script. Here's a command line I use to get my Linux Magazine email:

% audit -h mail.via.net -p XXXXXXX -u mss \
        -e '(mstreicher|editors)@linux-mag\.com$' -i Linux -s Spam

This command reads email from my "Linux Magazine" POP mailbox. The -h is the POP server name; -p is my password; and -u provides my POP "login" name.

The -i specifies that all incoming mail that has "[email protected]" or "[email protected]" in the To or Cc line should be filed in a mailbox called "Linux". The -s switch enables spam filtering and places all spam in a mailbox named "Spam".

By default, the "Inbox" and "Spam" mailboxes are created in $HOME/mail. You can change that default directory with -m. For instance, if you ran the previous command line with additional arguments -m ~/media/email would create "Inbox" and "Spam" in ~/media/email.

After running the command above (and assuming you had some spam), you'd get the following set of directories and mailboxes:

% ls -F $HOME/mail
archive/
Linux
Spam
Junk
log

You already know what "Linux" and "Spam" are for. "Junk" is a catch-all mailbox: if email sent to your account isn't directly to "you" and isn't spam, it's filed away in "Junk". "Log" is just a record of what happened to each piece of email as it was processed.

Finally, "archive" is a directory that contains your email library. Let's look at what it contains.

% ls -FR ~/mail/archive
archive:
a/
b/
c/
d/
e/
f/
...

Archive/a:
aahz
amber_ankerholz
...

Here, the archive contains individual mailboxes for each person that corresponds with me. So, if I receive a new email message from Sys Admin Editor Amber Ankerholz (whose email address is Amber Ankerholz <[email protected]>), her email is automatically filed into Linux (remember, that's this account's incoming mailbox) and her archive mailbox, <archive/a/amber_ankerholz. Because I tend to remember people by their first name, the "a" in Amber is used to file her under subfolder a/.

With my archive organized in this fashion, I can find almost any piece of mail in about two clicks of the mouse. I just have to remember the person's first name.

More Examples and Script Fun

Continuing with another example, I can use the command line to read email from another account.

% audit -h pop.mail.yahoo.com -p XXXXXX -u gadget \
        -e 'martin.*@apress\.com$' -i Apress -s Spam

This command line reuses the "Spam" mailbox (and the "Junk" mailbox since I did not override the default with -j), but places incoming email into a mailbox called "Apress". And because I did not specify a different mail directory, all incoming email is stored in the same mail archive as before, $HOME>/mail/archive.

So, my email archive, just by choice, contains all mail from all correspondents, independent of how they sent me email. Yes, it mixes my email from both (or more) accounts, but everything is in one place. If you want to keep your email accounts separate, simply specify different directories for each account with the -m switch.

You can also apply my script to existing mailboxes. Just run the script and provide a list of mailboxes as additional arguments. Each message is processed using the exact same set of rules that are applied to incoming mail.

This feature lets you migrate existing mail into a new structure, according to the rules of the filter you create. Better yet, you can also use this system to process your outgoing mail. Simply run this script from time to time over your sent messages.

Here's one more example. Every so often, I run the same script over my outgoing email folder to store my email in the archive, too. For example, if I sent email to Amber, the folder archive/a/amber_ankerholz contains Amber's email to me and my email to her.

Here's a command line to process outgoing mail:

% audit -h mail.via.net -p XXXXXXX -u mss \
        -e '(mstreicher|editors)@linux-mag\.com$' -i Linux -s Spam \
        -d ~/mail/Linux.folder/Sent

The file ~/mail/Linux.folder/Sent contains copies of my outgoing mail. The -d flag says to delete the messages from "Sent" as the script runs. Using the same account information, the script will process each outgoing message to see whether it's from me. If so, the message is filed into the mailbox of each recipient. So, if I send an email message to five people, each person's mailbox in the archive receives a copy of that message.

Again, knowing the person's name, and with just a few clicks in a good IMAP-compatible mail reader (I use Mac OS X Mail), I can see all of the email that I ever exchanged with that person.

Rather than run these commands from the shell prompt (which is time-consuming to repeat and dangerous, since command-line options can be seen from ps), I've placed all of the commands into a shell script and simply run that script from cron every few minutes.

I've been using this script since January 2003 without problems. My current email archive contains 517 MB of email (and associated attachments) organized into more than 1200 mailboxes. I still receive hundreds of spam messages per day, but I never see them. And, how much time do I spend manually filing and sorting email message? None.

The Default Filter

As mentioned above, the script provides a useful, default filter for all accounts. If you don't want to write any code, here's what the filter does.

1. If the email message is from you, the message is filed into each of your recipients' archive folder. If you sent yourself a copy of the message, it's also delivered to the account's inbox. And finally, if you have enabled spam filtering for the account, the default filter teaches SpamAssassin that your email is not spam.

2. If the email message is to you, and the message is not spam, the message is filed into the sender's archive folder and delivered to your inbox. If the message was spam, then it's placed in the spam folder. You can then sort through the spam folder to see if any false positives were made and teach SpamAssassin not to make the same mistake again.

3. If the email message is destined for an address in your domain (say, "apress.com") and originated in your domain, the message is delivered to your inbox. This reflects the common case where email is sent to aliases like "all". Otherwise, the message is checked to see whether it's spam. If it's not mail from within your domain and it isn't spam, it's filed away as junk email, and you can read the "Junk" folder to sort things further.

4. If the previous three conditions don't apply, the email is checked to see whether it's spam. If so, it's filed appropriately.

Otherwise, the mail gets dumped into "Junk." If you begin to see a lot of email accumulate in the Junk box, you can refine the regular expression to include more "aliases" or edit the rules that the filter applies. The next section describes how to crank open the code.

Under the Hood

The most important part of the script is the Mail::Archive::Account module (or just Account for short), so let's start there.

"Account" is a Perl class that encapsulates an email account. It has several public methods, including a constructor, new(), getter/setter methods like inbox(), and fetch(), the method that reads email from a POP server and calls a filtering callback. Account's filter() method is the default filtering routine for the email account. Finally, the methods deliver() and file() provide two different ways to store your email. deliver() places a message directly into a specified mailbox. file() stores a message into the archive by finding the correspondent's proper name and filing the message into that person's archive mailbox.

As shown above, when you create a new email account, you must provide four parameters: your POP server user name, your POP server password, the host name of your POP server, and a regular expression that describes your email address. The script dies if these basic pieces of information are omitted.

Here's a simple account created in code:

$account = Mail::Archive::Account->new(
  {
        address      => "martin.*@somedomain\.com$",
            user         => "martini",
            password     => "olives",
            host         => "pop.somedomain.com",
  }
);

Because no other parameters are provided, defaults are used for the mail directory, the name of the spam mailbox, etc. If you want to customize the Account, you can provide several other parameters to the constructor.

Maildir specifies the name of your mail directory. maildir is essentially the root of all email for the account: the account's "Inbox," email archive, and other mailboxes are created in this directory. By default, maildir is set to $ENV{HOME}/mail.
Inbox is the name of the mailbox where new mail is stored. By default, it's simply "Inbox."
Similarly, Junk is the name of the "junk mail" mailbox. All email that cannot be delivered to any other mailbox ends up in Junk, and archive is the name of the root folder for your email archive. The defaults for these two parameters are "Junk" and "archive," respectively.
Spam is the name of the mailbox where spam is filed, and setting this option enables SpamAssassin spam filtering for the account. By default, this option is turned off. Specifying a name for this special mailbox enables the feature.
Mode controls the permissions of new directories and mailboxes that are created within the account.
Log points to a log file that records what email has been received or processed. By default, each account gets its own log file named "log".

Also, maildir, inbox, spam, junk, archive, and log can either be file names or fully qualified path names. If you don't provide fully qualified path names, everything is created relative to maildir.

Here's another example of creating a new account.

$account = Mail::Archive::Account->new(
  {
        user         => "dragon",
        username     => "puffmagicdragon@yahoo\.com$",
        host         => "pop.mail.yahoo.com",
        password     => "fire12345",
        spam         => "Spam",
        inbox        => "Dragon",
        junk         => "Junkmail",
        archive      => "tomb",
        maildir      => "$ENV{HOME}/dragon",
        log          => $ENV{HOME}/log/dragonmail",
  }
);

In operation, this account would create the following files and directories:

% ls -F ~/dragon
Dragon
Spam
Junkmail
tomb/

%ls -F ~/log
dragonmail

This command line does exactly the same thing as the code above:

% perl audit.pl -u '[email protected]' -n dragon \
    -h pop.mail.yahoo.com -p fire12345 -m ~/dragon \
    -l ~/log/dragonmail -i Dragon -j Junkmail -s Spam -a tomb

Once an Account is defined, you call its fetch() method -- as in $account->fetch() -- to read email from that account. During fetch(), each incoming mail message is read from the POP server and filtered.

Writing a New Filter

If the default filter doesn't do what you'd like it to do, you can simply write your own and call fetch() with a code reference.

$account->fetch(\&myfilter);

Then, for each mail message found, myfilter() is called with two parameters: the Account being used and a Mail::Audit object. You can use the getter methods of each type of object to access individual fields.

Let's look at a very simple filter that prints some diagnostics, checks whether the email message is spam, and files the message into the correct mailbox.

sub myfilter($$) {
    my $account = shift;
    my $item = shift;
    
    print "In myfilter...\n";
    print "From: ", $item->from, "\n";
    print "To: ", $item->to, "\n";

    return eval {
        if ($account->isspam($item)) {
                $account->deliver($item, $account->spam());
        } else {
                print "Dropping in ", $account->inbox();
                $account->deliver($item);     
        }
}
}

$account->isspam() yields a boolean that indicates whether the message in $item is spam. $account->spam() returns the name of the mailbox for spam, and $account->inbox() gives the name of the inbox. The deliver() method always expects one argument, the mail message. If you don't specify a second argument to deliver -- the name of a mailbox -- the message is placed in the account inbox.

You'll notice that the default filter() code and this code returns something to the caller. This allows the caller to do something with the email in case the filter fails.

Quick Tips for Hacking the Script

If you just want to jump in and hack on the code, here are some quick tips to help you refine your script.

Since the ultimate goal of this script is to safely read and archive your email, I've provided a "safe" mode that won't delete any of your original email. Define your accounts and set the option "safe=>1" to prevent the script from deleting your mail from the POP server.
To build a complete archive, you should define all of your email accounts. However, if you don't want the script to read mail from a certain account, just set "skip => 1" for each account that you want to skip. (Of course, the previous tip also helps maintain the status quo until you're certain the script is working.)
If you're processing existing mailboxes, avoid using the -d flag, which deletes a message from the mailbox after its been audited. Without the -d flag, the script leaves the original mailbox intact and simply makes a copy of the message for the new archive. Once you're happy with the new archive, you can back up and then delete your old mailbox files.
By default, the script builds one archive for all accounts. I've found this beneficial because it unifies my mail archive independent of what account was used to send or receive the email. If you prefer to keep one archive per account, simply set a new maildir or a new archive for each account.
If you don't like the way Mail::Archive::Account::file() works, just write your own and call it from your own filter() routine. At one point, I had a file() routine that kept the archive organized by domain names -- but switched to the new scheme of storing by proper names because it was easier to use. I'm sure you'll think of other schemes.

I hope you find this script useful. It certainly saves me time, energy, and sanity. I'll never worry about losing or misplacing email again.

Martin Streicher graduated from Purdue University with a Masters Degree in Computer Science. He's been a programmer, producer, and executive producer, and is currently the Editor in Chief of Linux Magazine and the Editorial Director for Open Source books at Apress Books. You can reach Martin at [email protected] or [email protected].