An
SQL Archive for Email and MIME Attachments
Robert Bond
Email has exploded in volume, importance, and complexity. For
business or legal reasons, many organizations are archiving email
and attached files. The traditional method of archiving and compressing
multiple flat-text files gets the job done, but retrieval is time-consuming,
very limited, and generally must be performed by the administrator.
In this article, I will describe how to build an email archive
using Perl and MySQL, which can properly decode and extract attachments
from even the most complex MIME messages. Assuming the access is
warranted and appropriate, the archive can also be queried by users
without intervention by the systems administrator. In short, the
email archive becomes just another SQL data source for users, who
may access the database in a number of ways: from a command-line
utility, script, custom front-end, or desktop productivity application.
Note that privacy concerns related to email archiving are beyond
the scope of this article.
Email and MIME
In this section, I will review the structure of an email message
to provide enough information for you to customize the solution
to your particular requirements (if necessary). Despite its text-based
format, email is constructed with an astonishing amount of engineering.
The format of a message and all its parts is governed by dozens
of RFCs (see the Internet Mail Consortium's site: http://www.imc.org
for more information).
Email comes in two flavors: plain text and MIME. MIME stands for
Multipurpose Internet Mail Extensions and is simply a way to encapsulate
text and binary files so they can be sent via email. While plain-text
files are always human-readable, MIME messages typically are not,
because they often contain binary data encoded into fixed-length
lines of ASCII text.
Whether plain text or MIME, each message consists of a header
and a body, separated by a blank line. The header is a series of
lines, consisting of a field name, a colon, and a value (From: "Robert
Bond" <rbond@csf.edu>) terminated by a carriage return/linefeed.
Certain header lines may look very simple but may in fact follow
their own carefully crafted RFCs.
The header contains just one occurrence of some fields (Message-Id,
To, From, Subject, and Date) while other fields may occur multiple
times (e.g., the Received field, which is added at each hop during
the delivery process). Many other headers, including a multitude
of vendor-specific headers, can be seen in a typical message.
The body of a message may come in one of two types: plain text
and encoded; MIME supports multiple encoding schemes. (We're ignoring
the relatively unpopular uuencoding of message bodies, which for
practical purposes has been superseded by MIME.)
MIME messages may be single-part or multi-part. A single-part
MIME message is a single entity (e.g., an HTML-formatted text message
(content-type: text/html), or a single JPG file (content-type: image/jpeg)).
A multi-part message contains two or more entities (e.g., a text
message and several attached files). MIME entities may be deeply
nested and must be extracted recursively.
Multi-part messages come in a variety of flavors. One common flavor
is multi-part/alternative, which indicates that the message is present
in two different formats, typically plain text and HTML. No attachments
per se are included, however. Multi-part/mixed messages contain
two or more entities (e.g., a text message and one or more attachments).
For each part in a message, there is a content-type, also known
as a MIME type, which allows the user's client software to handle
the content properly. (Check whether your system has the Apache
file mime.types, perhaps in /etc/, which contains a partial list.)
Most email client software allows you to specify a default action
based on the content type. Although generally a Windows problem,
you should be aware that malformed messages might attempt to trick
email clients into doing something bad (e.g., a file with a content-type
for images might actually be a malicious executable file).
Another pitfall is that some HTML-formatted spam messages (typically
single-part with a content-type of text/html) contain so-called
email bugs, which are links to images with specially crafted URLs
that can be used to track you. If the special URL turns up in the
Web server logs, it indicates that you viewed the message (and thereby
validated your address for the spammer). It's best never to view
such messages in a browser. (I will show how to extract and store
a sanitized text-only version of all such messages; see below.)
The Solution
To build the solution, carry out the following steps:
1. Set up the project directory.
2. Set up the SQL database.
3. Set up the Perl script, fetch.pl, which can be downloaded from
the Sys Admin Web site.
fetch.pl (Listing 1) fetches email from a POP3 server, processes
each message, and stores the result in the SQL database. After running
fetch, it's easy to test the result. Then you may want to consider
how you (or others) will access the SQL database in the future;
I will use the mysql command-line utility.
1. Project Directory
Create a directory and two subdirectories for the project:
[user]$ mkdir ~/email
[user]$ mkdir ~/email/extract
[user]$ mkdir ~/email/attach
The extract directory is simply a scratch directory for extracting
files from a MIME message. When parsing a MIME message, it's best
to store the items in the filesystem to avoid running out of RAM.
The attach directory is for longer-term storage of the attachments.
2. Database
I will use the MySQL database (other databases should work with
minor changes). The SQL commands (see Table 1) create a new database
called "email," assign restrictive permissions, and create two tables
(one for messages and one for attachments). The two tables are related
in a one-to-many fashion -- one message may have zero or more items
in the attachments table.
Save the SQL out to a file (e.g., email.sql) and redirect it to
the mysql command-line client:
[user]$ mysql -u root -p < email.sql
3. Perl Script
The fetch.pl Perl script (Listing 1) should go in the project directory
(e.g., /home/robert/email). Make this file executable. Edit the variables
at the top of the file to point to your MySQL server (with the appropriate
username and password). Because this script utilizes a remote POP3
server, set the appropriate variables for the POP3 account's username
and password as well.
How It Works
The script connects to a POP3 server and fetches all messages.
(You could easily change this to be an IMAP server or simply parse
messages on standard input.) The Net::POP3 module handles the chores
of connecting and querying for the number of waiting messages and
the total size of the remote mailbox.
For each message, the script creates a Mail::Internet object,
which makes processing the header fields a snap. The script extracts
the message id from the header to see whether the message is already
stored in the database.
The message is parsed by the MIME::Parser module, which is part
of the MIME-tools bundle. Various items are extracted from the header
(Message-Id, From, To, Subject, Date, etc.) to populate the appropriate
Perl variables. If the message is MIME, the script looks at the
message's MIME type, determines whether the message is single-part
or multi-part, and takes appropriate action. If the message is not
MIME, it is assumed to be plain text. The script does not attempt
to determine whether the plain-text message might in fact be uuencoded,
since such messages are increasingly rare.
MIME messages are parsed to extract all attachments into the extract
directory. For each attachment, an entry is made in the attach table
and the auto-incremented id is retrieved. Then the file is moved
to the attach directory, taking the id as its name (to avoid filename
collisions). For each file in this directory, it's possible to look
up the MIME type and original filename, as well as the id of the
attachment's original message.
In regard to the "From" email address, the address is processed
through the Mail::Address Perl module. Addresses may take a variety
of formats comprising the email address itself, along with an optional
"phrase" in a variety of combinations.
Single-part messages of type "text/html" do not provide a plain
text version, so the script uses the strip_html subroutine to extract
a plain-text version, sanitized of possible email bugs (described
above).
A message also contains one of several standardized date string
formats. Unfortunately, some mail clients output improperly formatted
date strings. We use the date1 column to store the literal date
field as contained in the message, while date2 uses a massaged format.
Another date field, recd, records when the message was fetched.
You may want to use the date2 and recd columns for sorting, since
date1 will not sort well.
Accessing the Archive
Run fetch.pl and check for any unexpected errors. Make sure you've
set the appropriate variables as described above. Send yourself
some test messages and run the script again. Then use the mysql
command-line utility to ensure the messages have been properly fetched
and stored. Start the client:
[user]$ mysql -u mailuser -p
Once the client starts, issue the following commands:
mysql>USE email;
mysql>SELECT count(*) FROM msg;
To scan all messages, use (or modify) the following:
SELECT id, to_addr, fr_addr, subject, date2 FROM msg ORDER BY date2;
To view a particular message, specify the id:
SELECT * FROM msg WHERE id=2897\G
The \G on the end of the statement instructs the mysql utility to
present the results in a friendly vertical, rather than wide, format.
To retrieve messages along with the appropriate attachments, you
need to perform an SQL join. Use (or modify) the following statement:
SELECT to_addr, fr_addr, subject, date2, attach.id,attach.filename,attach.mime_type \
FROM msg LEFT OUTER JOIN attach ON msg.id = attach.msg_id\G
To retrieve only messages with attachments, change LEFT OUTER JOIN
above to INNER JOIN.
To actually view an attached file, obtain the item's id number
in the attach table and view the corresponding file in the ~/email/attach
directory.
Conclusion
The solution outlined here is far more retrieval-friendly
than traditional flat-file archiving, and far more scalable than
a GUI-based email client. Additionally, a single, unified storage
solution is often easier to back up and maintain in the long run
(as compared to multiple, archived text files).
The ability to access the archive remotely, from any SQL-capable
application, is another benefit that may save busy admins a great
deal of time. One final benefit -- many organizations want to discard
email after a certain date, and the solution outlined here reduces
that task to a simple matter of crafting the appropriate SQL statement.
Robert J. Bond 3rd is the Web manager and a computer science
instructor at College of Santa Fe in New Mexico. The author wishes
to thank David Bond of K12 Webworks.
|