Putting
the Ease Back into Email
Martin Streicher
Sending, receiving, and reading email are an absolutely essential
part of my job. With colleagues, acquaintances, and friends scattered
all over the planet, email has replaced the telephone as my preeminent
medium for communication and collaboration. Indeed, so much information
is exchanged and captured in my email that I'm quite afraid of losing
(or misplacing) even a single message.
In this article, I present a script that helps me keep up with
my email. The script pulls email from any number of POP servers,
creates an easily browsed, unified archive of both incoming and
outgoing email, and filters out spam. The script is written in Perl,
which makes the script largely independent of platform and much
more flexible than procmail. Of course, writing the code in Perl
was also fun, and allowed me to leverage an extensive array of existing,
email-related CPAN modules. I've been using this script in production
for some time, and it's invaluable to me.
Email at My Fingertips
Here are the requirements I had in mind at the outset of developing
the code:
- Wherever possible, the system must be based on open source
software. Open source not only gives me access to the internals
of a utility, but open source projects commonly complement existing
systems that already work well.
- Email must be stored in a widely supported, application-independent
format.
- The email repository should be stored centrally and served
globally. The email should be accessible from anywhere on the
planet, and ideally, can be read from a variety of "interfaces",
including the browser, the command line, or some graphical, windowed
application.
- The email system should archive all incoming and outgoing mail,
deliver incoming email to the appropriate Inbox, and identify
and segregate spam (unsolicited, unwanted email) from ham (legitimate
correspondence).
All of these requirements are fairly easy to satisfy with Linux
(or Unix) and a handful of software packages. To begin, the ubiquitous
mbox file format satisfies the first two requirements. Mbox is a
simple, "flat" text file that stores a series of mail messages.
It's readily parsed and scanned with any number of tools from cat
to grepmail, and mbox is widely supported by almost all open source
mail applications. Even some Windows applications export to the
format.
The third requirement -- global access and separation between
how email is stored and how it is presented -- is also easily satisfied
with the Internet Message Access Protocol, or IMAP. IMAP stores
email on a central machine, serves email via its own application
protocol, and allows "client" email applications to access messages
as if they were stored locally on the client machine. (If you've
ever used a news reader, think of IMAP like an NNTP for email. More
information about the IMAP protocol and IMAP software can be found
at: http://www.imap.org).
With the mbox format and IMAP, the only work left was to realize
the fourth requirement: process all incoming and outgoing email,
remove spam, and file each message into some sort of easy-to-navigate
email archive.
At first, I built my system on top of fetchmail and procmail,
which worked well, but also presented some problems. Fetchmail was
more than adequate for injecting mail from remote servers into the
local mail queue -- in fact, fetchmail offers many nice options.
Procmail, however, was a problem. Procmail syntax is (insanely)
cryptic, and I dislike using it. Additionally, procmail can be expensive:
each incoming mail message creates a new procmail process. If a
system has many users, all using procmail, things get a bit busy
when email arrives. Finally, procmail was useless for processing
outgoing mail -- I had to process that mail separately using an
altogether different technique.
While my first go-round worked well, it was basically a prototype
for the system I use now. Using Perl and many inventive modules
from the CPAN, I created something more flexible and functional.
A Brief Schematic
Here's a quick overview of how my new script works. A detailed
description of how to set up and use the script is presented in
the next section.
1. To use the script, you define one or more email accounts. You
can define one email account using just command-line options, or
you can edit the code and create a list of accounts. The command-line
approach is very useful, because you can mix the use of shell scripts
and cron or at to run the script at regular intervals.
In any case, each account must include the usual POP server information
-- user name, POP server name, and a password. You must also supply
a regular expression that describes your email address. The regular
expression defines "you", even if "you" receive email using a variety
of aliases. For example, you can send me email at martin@apress.com,
martin_streicher@apress.com, or martin.streicher@apress.com. All
are "me" at Apress, and I want all of those aliases to be recognized
as my incoming email.
So, to define a new account on the command line, you'd type:
% perl audit.pl -h pop.somedomain.com -p olives -u martini \
-e 'martin.*@somedomain\.com$'
Here, martin.* would be sufficient to match all of my Apress
aliases. As usual, the single quotes are needed around the argument
to -e (the email address) to prevent the shell from interpreting
the asterisk and the dollar sign. Although not required, you can also
set other account parameters, such as where to store the Inbox, where
to keep the archive, and whether you want to filter out spam.
2. Each time the script runs, it checks each email account for
new messages. The script reads each mail message and runs the filter
associated with that account. If you do not specify your own filter,
a default filter is used instead. (More on the default filter in
a moment.)
3. The filter makes decisions about what to do with each message.
For example, if an email message is addressed to "you" (the To line
includes an address that matches your regular expression), it's
filed into your inbox. Filters can file into other folders, and
can determine whether the message is spam and handle it accordingly.
The default filter should work for a large majority of users,
but if it's not suitable, you can write your own.
Setting up the Script
Listings 1 through 3 contain the components of my script. Listing
1 shows audit, the end-user script. Listing 2 shows Mail::Archive::Account,
which encapsulates all of the features of an email account. Listing
3 shows Mail::Archive::Manager, a helper object that assists with
the creation of accounts. You can download (an extensively annotated
version of) the source from the Sys Admin Web site or from:
http://www.flywheelagency.com/download/sysadmin
You should install Mail::Archive::Account and Mail::Archive::Manager
in $HOME/lib/Mail/Archive and then go to the CPAN and download and
install all of the following modules: Mail::Address, Mail::Audit,
Mail::Box::Manager 2.00, Mail::Send, and Mail::SpamAssassin. The standard
Perl installation probably has the four other modules you'll need:
File::Path, File::Spec, File::Basename, and Getopt::Std.
Once you have installed all of the modules, you're ready to run
the script. Here's a command line I use to get my Linux Magazine
email:
% audit -h mail.via.net -p XXXXXXX -u mss \
-e '(mstreicher|editors)@linux-mag\.com$' -i Linux -s Spam
This command reads email from my "Linux Magazine" POP mailbox. The
-h is the POP server name; -p is my password; and -u
provides my POP "login" name.
The -i specifies that all incoming mail that has "mstreicher@linux-mag.com"
or "editors@linux-mag.com" in the To or Cc line should
be filed in a mailbox called "Linux". The -s switch enables
spam filtering and places all spam in a mailbox named "Spam".
By default, the "Inbox" and "Spam" mailboxes are created in $HOME/mail.
You can change that default directory with -m. For instance,
if you ran the previous command line with additional arguments -m
~/media/email would create "Inbox" and "Spam" in ~/media/email.
After running the command above (and assuming you had some spam),
you'd get the following set of directories and mailboxes:
% ls -F $HOME/mail
archive/
Linux
Spam
Junk
log
You already know what "Linux" and "Spam" are for. "Junk" is a catch-all
mailbox: if email sent to your account isn't directly to "you" and
isn't spam, it's filed away in "Junk". "Log" is just a record of what
happened to each piece of email as it was processed.
Finally, "archive" is a directory that contains your email library.
Let's look at what it contains.
% ls -FR ~/mail/archive
archive:
a/
b/
c/
d/
e/
f/
...
Archive/a:
aahz
amber_ankerholz
...
Here, the archive contains individual mailboxes for each person that
corresponds with me. So, if I receive a new email message from Sys
Admin Editor Amber Ankerholz (whose email address is Amber
Ankerholz <aankerholz@cmp.com>), her email is automatically
filed into Linux (remember, that's this account's incoming mailbox)
and her archive mailbox, <archive/a/amber_ankerholz. Because
I tend to remember people by their first name, the "a" in Amber is
used to file her under subfolder a/.
With my archive organized in this fashion, I can find almost any
piece of mail in about two clicks of the mouse. I just have to remember
the person's first name.
More Examples and Script Fun
Continuing with another example, I can use the command line to
read email from another account.
% audit -h pop.mail.yahoo.com -p XXXXXX -u gadget \
-e 'martin.*@apress\.com$' -i Apress -s Spam
This command line reuses the "Spam" mailbox (and the "Junk" mailbox
since I did not override the default with -j), but places incoming
email into a mailbox called "Apress". And because I did not specify
a different mail directory, all incoming email is stored in the same
mail archive as before, $HOME>/mail/archive.
So, my email archive, just by choice, contains all mail from all
correspondents, independent of how they sent me email. Yes, it mixes
my email from both (or more) accounts, but everything is in one
place. If you want to keep your email accounts separate, simply
specify different directories for each account with the -m
switch.
You can also apply my script to existing mailboxes. Just
run the script and provide a list of mailboxes as additional arguments.
Each message is processed using the exact same set of rules that
are applied to incoming mail.
This feature lets you migrate existing mail into a new structure,
according to the rules of the filter you create. Better yet, you
can also use this system to process your outgoing mail. Simply run
this script from time to time over your sent messages.
Here's one more example. Every so often, I run the same script
over my outgoing email folder to store my email in the archive,
too. For example, if I sent email to Amber, the folder archive/a/amber_ankerholz
contains Amber's email to me and my email to her.
Here's a command line to process outgoing mail:
% audit -h mail.via.net -p XXXXXXX -u mss \
-e '(mstreicher|editors)@linux-mag\.com$' -i Linux -s Spam \
-d ~/mail/Linux.folder/Sent
The file ~/mail/Linux.folder/Sent contains copies of my outgoing
mail. The -d flag says to delete the messages from "Sent" as
the script runs. Using the same account information, the script will
process each outgoing message to see whether it's from me.
If so, the message is filed into the mailbox of each recipient. So,
if I send an email message to five people, each person's mailbox in
the archive receives a copy of that message.
Again, knowing the person's name, and with just a few clicks in
a good IMAP-compatible mail reader (I use Mac OS X Mail), I can
see all of the email that I ever exchanged with that person.
Rather than run these commands from the shell prompt (which is
time-consuming to repeat and dangerous, since command-line options
can be seen from ps), I've placed all of the commands into
a shell script and simply run that script from cron every
few minutes.
I've been using this script since January 2003 without problems.
My current email archive contains 517 MB of email (and associated
attachments) organized into more than 1200 mailboxes. I still receive
hundreds of spam messages per day, but I never see them. And, how
much time do I spend manually filing and sorting email message?
None.
The Default Filter
As mentioned above, the script provides a useful, default filter
for all accounts. If you don't want to write any code, here's what
the filter does.
1. If the email message is from you, the message is filed
into each of your recipients' archive folder. If you sent yourself
a copy of the message, it's also delivered to the account's inbox.
And finally, if you have enabled spam filtering for the account,
the default filter teaches SpamAssassin that your email is not
spam.
2. If the email message is to you, and the message is not
spam, the message is filed into the sender's archive folder and
delivered to your inbox. If the message was spam, then it's placed
in the spam folder. You can then sort through the spam folder to
see if any false positives were made and teach SpamAssassin not
to make the same mistake again.
3. If the email message is destined for an address in your domain
(say, "apress.com") and originated in your domain, the message
is delivered to your inbox. This reflects the common case where
email is sent to aliases like "all". Otherwise, the message is checked
to see whether it's spam. If it's not mail from within your domain
and it isn't spam, it's filed away as junk email, and you can read
the "Junk" folder to sort things further.
4. If the previous three conditions don't apply, the email is
checked to see whether it's spam. If so, it's filed appropriately.
Otherwise, the mail gets dumped into "Junk." If you begin to see
a lot of email accumulate in the Junk box, you can refine the regular
expression to include more "aliases" or edit the rules that the
filter applies. The next section describes how to crank open the
code.
Under the Hood
The most important part of the script is the Mail::Archive::Account
module (or just Account for short), so let's start there.
"Account" is a Perl class that encapsulates an email account.
It has several public methods, including a constructor, new(), getter/setter
methods like inbox(), and fetch(), the method that reads email from
a POP server and calls a filtering callback. Account's filter()
method is the default filtering routine for the email account. Finally,
the methods deliver() and file() provide two different ways to store
your email. deliver() places a message directly into a specified
mailbox. file() stores a message into the archive by finding the
correspondent's proper name and filing the message into that person's
archive mailbox.
As shown above, when you create a new email account, you must
provide four parameters: your POP server user name, your POP server
password, the host name of your POP server, and a regular expression
that describes your email address. The script dies if these basic
pieces of information are omitted.
Here's a simple account created in code:
$account = Mail::Archive::Account->new(
{
address => "martin.*@somedomain\.com$",
user => "martini",
password => "olives",
host => "pop.somedomain.com",
}
);
Because no other parameters are provided, defaults are used for the
mail directory, the name of the spam mailbox, etc. If you want to
customize the Account, you can provide several other parameters to
the constructor.
- Maildir specifies the name of your mail directory. maildir
is essentially the root of all email for the account: the account's
"Inbox," email archive, and other mailboxes are created in this
directory. By default, maildir is set to $ENV{HOME}/mail.
- Inbox is the name of the mailbox where new mail is stored.
By default, it's simply "Inbox."
- Similarly, Junk is the name of the "junk mail" mailbox. All
email that cannot be delivered to any other mailbox ends up in
Junk, and archive is the name of the root folder for your email
archive. The defaults for these two parameters are "Junk" and
"archive," respectively.
- Spam is the name of the mailbox where spam is filed, and setting
this option enables SpamAssassin spam filtering for the account.
By default, this option is turned off. Specifying a name for this
special mailbox enables the feature.
- Mode controls the permissions of new directories and mailboxes
that are created within the account.
- Log points to a log file that records what email has been received
or processed. By default, each account gets its own log file named
"log".
Also, maildir, inbox, spam, junk, archive, and log can either
be file names or fully qualified path names. If you don't provide
fully qualified path names, everything is created relative to maildir.
Here's another example of creating a new account.
$account = Mail::Archive::Account->new(
{
user => "dragon",
username => "puffmagicdragon@yahoo\.com$",
host => "pop.mail.yahoo.com",
password => "fire12345",
spam => "Spam",
inbox => "Dragon",
junk => "Junkmail",
archive => "tomb",
maildir => "$ENV{HOME}/dragon",
log => $ENV{HOME}/log/dragonmail",
}
);
In operation, this account would create the following files and directories:
% ls -F ~/dragon
Dragon
Spam
Junkmail
tomb/
%ls -F ~/log
dragonmail
This command line does exactly the same thing as the code above:
% perl audit.pl -u 'puffmagicdragon@yahoo.com' -n dragon \
-h pop.mail.yahoo.com -p fire12345 -m ~/dragon \
-l ~/log/dragonmail -i Dragon -j Junkmail -s Spam -a tomb
Once an Account is defined, you call its fetch() method -- as in $account->fetch()
-- to read email from that account. During fetch(), each incoming
mail message is read from the POP server and filtered.
Writing a New Filter
If the default filter doesn't do what you'd like it to do, you
can simply write your own and call fetch() with a code reference.
$account->fetch(\&myfilter);
Then, for each mail message found, myfilter() is called with two parameters:
the Account being used and a Mail::Audit object. You can use the getter
methods of each type of object to access individual fields.
Let's look at a very simple filter that prints some diagnostics,
checks whether the email message is spam, and files the message
into the correct mailbox.
sub myfilter($$) {
my $account = shift;
my $item = shift;
print "In myfilter...\n";
print "From: ", $item->from, "\n";
print "To: ", $item->to, "\n";
return eval {
if ($account->isspam($item)) {
$account->deliver($item, $account->spam());
} else {
print "Dropping in ", $account->inbox();
$account->deliver($item);
}
}
}
$account->isspam() yields a boolean that indicates whether
the message in $item is spam. $account->spam() returns the
name of the mailbox for spam, and $account->inbox() gives
the name of the inbox. The deliver() method always expects one argument,
the mail message. If you don't specify a second argument to deliver
-- the name of a mailbox -- the message is placed in the account inbox.
You'll notice that the default filter() code and this code returns
something to the caller. This allows the caller to do something
with the email in case the filter fails.
Quick Tips for Hacking the Script
If you just want to jump in and hack on the code, here are some
quick tips to help you refine your script.
- Since the ultimate goal of this script is to safely read and
archive your email, I've provided a "safe" mode that won't delete
any of your original email. Define your accounts and set the option
"safe=>1" to prevent the script from deleting your mail from
the POP server.
- To build a complete archive, you should define all of your
email accounts. However, if you don't want the script to read
mail from a certain account, just set "skip => 1" for each
account that you want to skip. (Of course, the previous tip also
helps maintain the status quo until you're certain the script
is working.)
- If you're processing existing mailboxes, avoid using the -d
flag, which deletes a message from the mailbox after its been
audited. Without the -d flag, the script leaves the original
mailbox intact and simply makes a copy of the message for the
new archive. Once you're happy with the new archive, you can back
up and then delete your old mailbox files.
- By default, the script builds one archive for all accounts.
I've found this beneficial because it unifies my mail archive
independent of what account was used to send or receive the email.
If you prefer to keep one archive per account, simply set a new
maildir or a new archive for each account.
- If you don't like the way Mail::Archive::Account::file() works,
just write your own and call it from your own filter() routine.
At one point, I had a file() routine that kept the archive organized
by domain names -- but switched to the new scheme of storing by
proper names because it was easier to use. I'm sure you'll think
of other schemes.
I hope you find this script useful. It certainly saves me time,
energy, and sanity. I'll never worry about losing or misplacing
email again.
Martin Streicher graduated from Purdue University with a Masters
Degree in Computer Science. He's been a programmer, producer, and
executive producer, and is currently the Editor in Chief of Linux
Magazine and the Editorial Director for Open Source books at
Apress Books. You can reach Martin at mstreicher@linux-mag.com
or martin@apress.com.
|