A
MIME Is a Terrible Thing to Waste
Randal L. Schwartz
The Multipurpose Internet Mail Extensions (MIME) standard has
been around for nearly a decade, but has only recently become popular.
This is probably because of the higher bandwidth data connections
available for email, as well as the advent of the Web, and the desktop
horsepower required to make things that are fancier than plain text
(or should that be text/plain)?
MIME is both a blessing, and a curse, in my opinion. It's
cool that I can send a PDF or a JPEG to a friend as an attachment,
and know that I don't need to figure out if they have a uu-decoder
or a shell to extract from a sharchive. It's bad, however,
because a lot of mail that is really just plain old text is being
sent as HTML mail or the very popular "multipart/alternative"
mail.
Why is this bad? Well, for one, I don't think Tim Berners-Lee
or any of the chaps involved with the creation of the Web envisioned
HTML as a medium for email. HTML is about hyperlinks and structured
text, readable in an interactive environment. Email is a simple
message, usually conversational, and generally with an absence of
a need for markups and links.
So most of the use of HTML mail these days is by the "push
advertisers", or as we more often call them, "spammers".
It's a great way to shove a flashy, sizzly, no-content ad for
fax paper or a trip to Central America into our email boxes, with
enough bouncy clicky things that we'll probably respond.
A more serious problem with HTML email is that it's a great
carrier of Javascript viruses. Countless times I've read about
people getting nailed because of embedded codes in HTML email. Thus,
it's a security threat to organizations.
That's why I think mail should always be plain text, unless
both parties agree otherwise. Go ahead, shoot me, but there's
my opinion.
Apparently, my opinion is not shared by the makers of some
of the so-called mail clients, like Outlook Express or Netscape
Communicator. Out of the box, every mail sent is as the multipart/alternative
MIME type, with a text version and a HTML version. Theoretically,
if you have a MIME-savvy mail client, you receive such mail as a
nice HTML-formatted window. If not, you get gibberish for the second
half of your text screen. And they call that communicating.
Sure, you can turn it off. Perhaps. But read on, and you'll
see where this is going.
Now, here's the problem. I run a low-volume mailing list
for a management class I'm taking... nothing fancy... just
a rebroadcaster called from procmail. I started to see a
lot of these HTML forked messages, and got annoyed when some of
the replies also quoted part of the MIME wrapper markup, making
it hopeless to read in any normal sense.
So I put a filter in the mail forwarder to kick back anything
that included either boundary or html in the content-type
mail header... a sure sign that someone was sending something other
than plain text. Yes, right after inserting that filter, the worst
offenders were unable to use my mailing list until they figured
out how to turn that HTML fork off, and then all was good.
In this most recent group of users, we had a couple of people
who had installed Outlook with Windows 2000 (not Outlook Express).
Even after I had called in favors from my friends who understand
Redmond-ware better than I do, they still couldn't figure out
how to turn off the durn HTML.
So what to do? I wasn't about to relax my policy, having
been very happy with the result achieved with the previous group.
And one of them had started painfully copying all the addresses
directly into their address book, a mess for maintenance, and trouble
for the Web-based archive for the mailing list.
Then I thought, "Hey, all I need is a small Perl filter that
recognizes this so-called email and strips the HTML fork!"
And that's what I decided to build.
Luckily, we've got the very nice MIME::Tools package
in the CPAN to do most of the hard work, although I admit it took
me a few false starts to get the project done.
First, let's hack out some code to take a brain-damaged email
on standard input, writing out a clean piece of email on standard
output (untouched if it's not the right format). We'll
start with three lines that begin nearly every program I write:
#!/usr/bin/perl -w
use strict;
$|++;
This enables warnings, turns on the compiler restrictions (no symbolic
references, undeclared variables, or barewords), and unbuffers standard
output. Next, we grab the "envelope-from" from the input:
my $envelope = <STDIN>;
This "envelope-from" looks like:
From merlyn Wed Jan 24 11:37:17 2001
and tells the next mailer where this mail came from. It's actually
not in the shape of an RFC822 header, because it's a "meta-header",
and therefore shouldn't be parsed along with the rest of the
MIME information. We'll grab it here, and print it back out when
we're done.
Next, we'll pull in two of the modules from the MIME::Tools
distribution:
use MIME::Parser;
use MIME::Entity;
And then we'll create a MIME::Parser object to read the
input:
my $parser = MIME::Parser->new;
$parser->output_to_core(1);
$parser->tmp_to_core(1);
Here, I'm creating a MIME parser that keeps everything in core,
including any temporary files. Of course, this will break down if
someone sends me a 200-MB AVI file, but I can catch that at the step
before this anyway.
Now it's time to read standard input:
my $ent = $parser->parse(\*STDIN);
The $parser object reads the email message from standard input
into memory. If there's any failure here (bad input, bad format),
the parser will die. We'll call this program so that if it fails
in any way, the original message is kept, so the death is not an issue.
Now for the cool part. I can use the methods available on the
message (a MIME::Entity object) to probe into the structure.
One of the first ones I did simply turned the rest of the program
into:
$ent->dump_skeleton(\*STDERR); exit 1;
This caused the program to show the structure of message, so I could
figure out what an HTML-forked mail message looks like, compared to
everything else. After I ran that on a few sample messages, I removed
that line and replaced it with this:
if ($ent->effective_type eq "multipart/alternative"
and $ent->parts == 2
and $ent->parts(0)->effective_type eq "text/plain"
and $ent->parts(1)->effective_type eq "text/html") {
Whoa. Lots of stuff here. Let's go slow. First, I'm seeing
if the top-level structure is a multipart/alternative. A MIME
document is hierarchically structured (attachments can have attachments,
and so on), so we're looking at the root here. If that's
good, then we also make sure there are two alternatives, and that
the first one is a plain text entry, and the second one is HTML. If
so, it's likely to be the evilness that I'm trying to fix.
(There's a very small chance that the text and HTML parts are
radically different and unrelated, but if so, it's mistagged
as multipart/alternative rather than the more proper multipart/mixed
type.)
So the next step is to extract the text part as its own entity,
and then hoist that part to become the entire message. There may
be an easier way of doing this, but here's what I did. First,
make a new entity from the body of the old text one:
my $newent = MIME::Entity->build(Data =>
$ent->parts(0)->body_as_string .
"\n\n[[HTML alternate version deleted]]\n");
Notice that I added a little message on the end to let people know
magic has happened. I could have also inserted it into the mail header
instead, but I wanted it to be prominent.
Next, we toss all the parts except for this one:
$ent->parts([$newent]);
And then, we fold it from a multipart document to a single-part document
(where MIME is not even mentioned, and we have no boundary markers):
$ent->make_singlepart;
And finally, some of the headers were now out of sync, so it's
time to clean it up as best we can:
$ent->sync_headers(Length => 'COMPUTE', Nonstandard => 'ERASE');
}
And that's it. If it met the ugly-message criteria, we now have
a new message in $ent; otherwise, we have the original. Time
to dump it out. First the envelope:
print $envelope;
And now the message itself:
$ent->print;
The next step was to hook it into procmail delivery for the
mailing list. Ahead of the step that does the actual sending, I added
one additional rule:
:0 fw
* ^Content-type:.*boundary
| $HOME/lib/Strip-HTML-fork
where $HOME/lib/Strip-HTML-fork contains the program above.
If the filter is able to do its magic, then the next procmail
rule starting with:
:0
* ^Content-type:.*(html|boundary)
{
.. bouncing logic not shown ..
}
no longer triggers, and the mail goes through! Success.
Well, I hope I've convinced you that a MIME is a terrible
thing to waste, but once wasted, we can fight back properly. Until
next time, enjoy!
Randal L. Schwartz is a two-decade veteran of the software
industry -- skilled in software design, system administration,
security, technical writing, and training. He has coauthored the
"must-have" standards: Programming Perl, Learning
Perl, Learning Perl for Win32 Systems, and Effective
Perl Programming, as well as writing regular columns for WebTechniques
and Unix Review magazines. He's also a frequent contributor
to the Perl newsgroups, and has moderated comp.lang.perl.announce
since its inception. Since 1985, Randal has owned and operated Stonehenge
Consulting Services, Inc.
|