apr2003.tar

An SQL Archive for Email and MIME Attachments

Robert Bond

Email has exploded in volume, importance, and complexity. For business or legal reasons, many organizations are archiving email and attached files. The traditional method of archiving and compressing multiple flat-text files gets the job done, but retrieval is time-consuming, very limited, and generally must be performed by the administrator.

In this article, I will describe how to build an email archive using Perl and MySQL, which can properly decode and extract attachments from even the most complex MIME messages. Assuming the access is warranted and appropriate, the archive can also be queried by users without intervention by the systems administrator. In short, the email archive becomes just another SQL data source for users, who may access the database in a number of ways: from a command-line utility, script, custom front-end, or desktop productivity application. Note that privacy concerns related to email archiving are beyond the scope of this article.

Email and MIME

In this section, I will review the structure of an email message to provide enough information for you to customize the solution to your particular requirements (if necessary). Despite its text-based format, email is constructed with an astonishing amount of engineering. The format of a message and all its parts is governed by dozens of RFCs (see the Internet Mail Consortium's site: http://www.imc.org for more information).

Email comes in two flavors: plain text and MIME. MIME stands for Multipurpose Internet Mail Extensions and is simply a way to encapsulate text and binary files so they can be sent via email. While plain-text files are always human-readable, MIME messages typically are not, because they often contain binary data encoded into fixed-length lines of ASCII text.

Whether plain text or MIME, each message consists of a header and a body, separated by a blank line. The header is a series of lines, consisting of a field name, a colon, and a value (From: "Robert Bond" <[email protected]>) terminated by a carriage return/linefeed. Certain header lines may look very simple but may in fact follow their own carefully crafted RFCs.

The header contains just one occurrence of some fields (Message-Id, To, From, Subject, and Date) while other fields may occur multiple times (e.g., the Received field, which is added at each hop during the delivery process). Many other headers, including a multitude of vendor-specific headers, can be seen in a typical message.

The body of a message may come in one of two types: plain text and encoded; MIME supports multiple encoding schemes. (We're ignoring the relatively unpopular uuencoding of message bodies, which for practical purposes has been superseded by MIME.)

MIME messages may be single-part or multi-part. A single-part MIME message is a single entity (e.g., an HTML-formatted text message (content-type: text/html), or a single JPG file (content-type: image/jpeg)). A multi-part message contains two or more entities (e.g., a text message and several attached files). MIME entities may be deeply nested and must be extracted recursively.

Multi-part messages come in a variety of flavors. One common flavor is multi-part/alternative, which indicates that the message is present in two different formats, typically plain text and HTML. No attachments per se are included, however. Multi-part/mixed messages contain two or more entities (e.g., a text message and one or more attachments).

For each part in a message, there is a content-type, also known as a MIME type, which allows the user's client software to handle the content properly. (Check whether your system has the Apache file mime.types, perhaps in /etc/, which contains a partial list.)

Most email client software allows you to specify a default action based on the content type. Although generally a Windows problem, you should be aware that malformed messages might attempt to trick email clients into doing something bad (e.g., a file with a content-type for images might actually be a malicious executable file).

Another pitfall is that some HTML-formatted spam messages (typically single-part with a content-type of text/html) contain so-called email bugs, which are links to images with specially crafted URLs that can be used to track you. If the special URL turns up in the Web server logs, it indicates that you viewed the message (and thereby validated your address for the spammer). It's best never to view such messages in a browser. (I will show how to extract and store a sanitized text-only version of all such messages; see below.)

The Solution

To build the solution, carry out the following steps:

1. Set up the project directory.

2. Set up the SQL database.

3. Set up the Perl script, fetch.pl, which can be downloaded from the Sys Admin Web site.

fetch.pl (Listing 1) fetches email from a POP3 server, processes each message, and stores the result in the SQL database. After running fetch, it's easy to test the result. Then you may want to consider how you (or others) will access the SQL database in the future; I will use the mysql command-line utility.

1. Project Directory

Create a directory and two subdirectories for the project:

[user]$ mkdir ~/email
[user]$ mkdir ~/email/extract
[user]$ mkdir ~/email/attach

The extract directory is simply a scratch directory for extracting files from a MIME message. When parsing a MIME message, it's best to store the items in the filesystem to avoid running out of RAM. The attach directory is for longer-term storage of the attachments.

2. Database

I will use the MySQL database (other databases should work with minor changes). The SQL commands (see Table 1) create a new database called "email," assign restrictive permissions, and create two tables (one for messages and one for attachments). The two tables are related in a one-to-many fashion -- one message may have zero or more items in the attachments table.

Save the SQL out to a file (e.g., email.sql) and redirect it to the mysql command-line client:

[user]$ mysql -u root -p < email.sql
3. Perl Script

The fetch.pl Perl script (Listing 1) should go in the project directory (e.g., /home/robert/email). Make this file executable. Edit the variables at the top of the file to point to your MySQL server (with the appropriate username and password). Because this script utilizes a remote POP3 server, set the appropriate variables for the POP3 account's username and password as well.

How It Works

The script connects to a POP3 server and fetches all messages. (You could easily change this to be an IMAP server or simply parse messages on standard input.) The Net::POP3 module handles the chores of connecting and querying for the number of waiting messages and the total size of the remote mailbox.

For each message, the script creates a Mail::Internet object, which makes processing the header fields a snap. The script extracts the message id from the header to see whether the message is already stored in the database.

The message is parsed by the MIME::Parser module, which is part of the MIME-tools bundle. Various items are extracted from the header (Message-Id, From, To, Subject, Date, etc.) to populate the appropriate Perl variables. If the message is MIME, the script looks at the message's MIME type, determines whether the message is single-part or multi-part, and takes appropriate action. If the message is not MIME, it is assumed to be plain text. The script does not attempt to determine whether the plain-text message might in fact be uuencoded, since such messages are increasingly rare.

MIME messages are parsed to extract all attachments into the extract directory. For each attachment, an entry is made in the attach table and the auto-incremented id is retrieved. Then the file is moved to the attach directory, taking the id as its name (to avoid filename collisions). For each file in this directory, it's possible to look up the MIME type and original filename, as well as the id of the attachment's original message.

In regard to the "From" email address, the address is processed through the Mail::Address Perl module. Addresses may take a variety of formats comprising the email address itself, along with an optional "phrase" in a variety of combinations.

Single-part messages of type "text/html" do not provide a plain text version, so the script uses the strip_html subroutine to extract a plain-text version, sanitized of possible email bugs (described above).

A message also contains one of several standardized date string formats. Unfortunately, some mail clients output improperly formatted date strings. We use the date1 column to store the literal date field as contained in the message, while date2 uses a massaged format. Another date field, recd, records when the message was fetched. You may want to use the date2 and recd columns for sorting, since date1 will not sort well.

Accessing the Archive

Run fetch.pl and check for any unexpected errors. Make sure you've set the appropriate variables as described above. Send yourself some test messages and run the script again. Then use the mysql command-line utility to ensure the messages have been properly fetched and stored. Start the client:

[user]$ mysql -u mailuser -p

Once the client starts, issue the following commands:

mysql>USE email;
mysql>SELECT count(*) FROM msg;

To scan all messages, use (or modify) the following:

SELECT id, to_addr, fr_addr, subject, date2 FROM msg ORDER BY date2;

To view a particular message, specify the id:

SELECT * FROM msg WHERE id=2897\G

The \G on the end of the statement instructs the mysql utility to present the results in a friendly vertical, rather than wide, format.

To retrieve messages along with the appropriate attachments, you need to perform an SQL join. Use (or modify) the following statement:

SELECT to_addr, fr_addr, subject, date2, attach.id,attach.filename,attach.mime_type \
 FROM msg LEFT OUTER JOIN attach ON msg.id = attach.msg_id\G

To retrieve only messages with attachments, change LEFT OUTER JOIN above to INNER JOIN.

To actually view an attached file, obtain the item's id number in the attach table and view the corresponding file in the ~/email/attach directory.

Conclusion

The solution outlined here is far more retrieval-friendly than traditional flat-file archiving, and far more scalable than a GUI-based email client. Additionally, a single, unified storage solution is often easier to back up and maintain in the long run (as compared to multiple, archived text files).

The ability to access the archive remotely, from any SQL-capable application, is another benefit that may save busy admins a great deal of time. One final benefit -- many organizations want to discard email after a certain date, and the solution outlined here reduces that task to a simple matter of crafting the appropriate SQL statement.

Robert J. Bond 3rd is the Web manager and a computer science instructor at College of Santa Fe in New Mexico. The author wishes to thank David Bond of K12 Webworks.