Identifying
Spam Events
Jason A. Richards
The number of spam events is increasing every month. Because of
the sheer number of recipients for any single spam event, it doesn't
take much of a reduction in percentage for a benefit to be seen.
The tricky part for administrators is determining which events are
unsolicited and which are requested and permissible. As an administrator,
you must have an understanding of what is normal for your mail servers
and what is not. If you currently have no idea, then your first
step may well be to set up some MRTG graphs. If you are lucky, your
mail server has an SNMP agent; if not, you just need to write a
few scripts that do a lot of counting.
The most rudimentary way to identify spam events (once you've
established a baseline for your system) is to parse the mail logs.
Every mail server, whether it's sendmail, postfix, qmail, or SunONE
Messaging Server, has the ability (and should always be configured)
to write logs for all SMTP traffic. The goal then, is to parse these
logs in search of commonalities that are unacceptable. Whether you
administer a single system used only for your company's employees,
or you manage multiple servers handling millions of messages for
thousands of Internet subscribers, the most sensible way to undertake
this task is to write a script to reveal the pertinent information.
Differentiating the Good from the Bad
The function of this script then should be to extract the repetitive
information from our mail logs and display it in a fashion that
makes sense. The three details that we care about from the logs
are the source email address, the source domain, and the source
IP address. By classifying and counting this information, we can
see where the email is coming from and then determine whether that
email is good or bad. Good email is defined as email that the recipients
signed up for intentionally. Take BugTRAQ, for instance. Lots of
people want to receive BugTRAQ, so if we deem it bad email and attempt
to block or thwart it, we will be upsetting a great number of people.
Bad email, then, is defined as email that is clearly unsolicited
and detrimental to system performance. Notice the need for it to
be detrimental to the system. Plenty of the email floating around
the Internet is unsolicited, however, if we target these individual
emails, which do not affect the performance of the system, then
we are wasting valuable cycles.
The Script
There are two prerequisites for the script below. The first is
Perl. Since most environments compress and rotate their log messages,
the second prerequisite is the Compress::Zlib Perl module. Perl
can be obtained from any number of locations, depending on your
system architecture, and the module can be installed via CPAN. With
the tools in hand, we can begin. As with any script, we first need
to define some variables to make it portable between systems, and
just for general troubleshooting purposes:
#!/usr/bin/perl -w
use strict;
use Compress::Zlib;
my $base_dir = "/apps/log/imta"; # Log directory
my @log_files; # Log file names
my %file_mod_time; # Stores log file mtime
my $rep_dir = "/tmp"; # Report directory
my $rep_file = "spam_events.txt"; # Report filename
my %senders_by_ip; # IP address hash
my %senders_by_em; # email address hash
my %senders_by_dm; # email domain hash
my $start_time = 86400; # The default start time is 24 hours ago
my $end_time = 0; # The default end time is 0 seconds ago
my $min_rcpts = 30; # Minimum number of matches to include
# in the report
Continuing with this proactive theme, if we truly are being proactive,
then we need a time frame in which to search through the logs. If
we believe the event began within the past hour, there's no need to
search a day's worth of logs. We should then provide this ability
via the command line. Therefore, we need to account for the options
within the script. By default, the script could search over the past
24 hours (86400 seconds), so that we can see any events that may have
occurred during that time:
if ($#ARGV == "-1") {
print "Searching the last " . ($start_time-$end_time)/3600 . " hours.\n";
} elsif ($#ARGV == "1") {
$start_time = $ARGV[0]; # Allow the user to specify the start time
$end_time = $ARGV[1]; # Allow the user to specify the end time
print "Searching the last " . ($start_time-$end_time)/3600 . " hours.\n";
} else { # Otherwise, print the usage
print "Usage: $0";
print "\n Runs over previous 24 hours.\n";
print "Usage: $0 <begin time> <end time>";
print "\n Runs over time, specified in seconds.\n";
exit 0;
}
Next, we read the log directory and determine which log files are
available. At this point, it's necessary to know the naming convention
for your logs. For instance, our logs are of the format mail.log_current
and mail.log_current.[0-9].gz. So by matching on mail.log_current.*,
I can be sure to get all of them. We also want to exclude files that
fall outside of our time criteria. Once we've identified the appropriate
files to search, we store the mtime in the %file_mod_time hash in
order to search the files in the correct order later on:
opendir(BASEDIR, $base_dir) ||
die "Couldn't open $base_dir for reading: $!";
@log_files = readdir(BASEDIR);
closedir(BASEDIR);
foreach my $file (@log_files) {
my $mtime = (stat("$base_dir/$file"))[9];
if (( $file =~ m/^mail.log_current.*$/ ) &&
( $mtime >= time - $start_time ) &&
( $mtime <= time - $end_time )
) {
$file_mod_time{$file} = $mtime;
}
}
Once you have the list of files, you can open each file in order from
oldest to newest, parse each line, and begin the tallies. We care
about the order of the files because the report includes a field detailing
the last file in which the entry appears. Next, you will need to formulate
a regular expression to match your particular log format. An example
log from my mail environment (SunONE Messaging Server) and its field
definitions is as follows:
29-May-2003 10:17:08.70 15e1.221e.23858 tcp_local tcp_intranet E 2
jrichards@gci.com rfc822;jasonr@gci.net @ems-1.gci.net:jasonr@gci.net
/apps/spool/queue1/tcp_intranet/013/ZZ0HFN0049JUSKBL.00
<48ACA9FE7C30D411AF6D009027BA4E170935A3DC@rolla.gci.com> gci.com
(daytona.gci.com [205.140.80.57])
Date: 29-May-2003
Time: 10:17:08.70
MessageID: 15e1.221e.23858
SrcChannel: tcp_local
DestChannel: tcp_intranet
Action: E for Enqueue
Bytes: 2
SrcEmail: jrichards@gci.com
DestEmail: rfc822;jasonr@gci.net
RouteToDest: @ems-1.gci.net:jasonr@gci.net
PathToMsg: /apps/spool/queue1/tcp_intranet/013/ ZZ0HFN0049JUSKBL.00
SrcMessageID: 48ACA9FE7C30D411AF6D009027BA4E170935A3DC@rolla.gci.com
SourceHost: gci.com
SourceHostDNS: daytona.gci.com
SourceHostIP: 205.140.80.57
As mentioned before, of this information we only want to report the
SrcEmail, the domain part of the SrcEmail, and the SourceHostIP. Now
that we understand the format of the log messages, we can write a
regular expression to match the fields needed for the report. You'll
see in the regular expression below that I substituted as many fields
as possible with \S+ (one or more non-spaces) to simplify the expression.
Some fields must be static though, such as the SrcChannel and DestChannel,
to pull out the exact type of lines we're looking for. After incrementing
the associated counters, we can close the current file and continue
with the next:
foreach my $file (sort bymtime keys %file_mod_time) {
my $gz = gzopen("$base_dir/$file", "rb") ||
die "Couldn't open $base_dir/$file: $gzerrno\n";
while ($gz->gzreadline($_) > 0) {
# The following pattern match in the "if" clause should not
# span 3 lines as it does in this publication. It must be
# coded on a single line in order for the pattern match
# to work properly.
if (m!^\S+ \S+ \S+ tcp_intranet\s+tcp_local\s+E \S+ (.*?\@(.*?))
rfc822\;\S+ \S+ \S+ \S+ \S+ \S+
\[(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\].*!) {
my $log_sender = $1;
my $log_domain = $2;
my $log_ipaddr = $3;
$senders_by_ip{$log_ipaddr}{numrcpts}++;
$senders_by_ip{$log_ipaddr}{lastfile} = $file;
$senders_by_em{$log_sender}{numrcpts}++;
$senders_by_em{$log_sender}{lastfile} = $file;
$senders_by_dm{$log_domain}{numrcpts}++;
$senders_by_dm{$log_domain}{lastfile} = $file;
}
}
$gz->gzclose();
}
To gain an accurate picture of the mail traffic, we need to save this
information to a report that we can archive so that we can start developing
trends. Therefore, we must open up a report file and output the information.
We have three different hashes to sort through, so we'll proceed through
them one at a time. As defined above, we'll only be printing entries
that were matched 30 times. This allows us to limit the report and
prevent seeing an overwhelming number of lines. However, you may want
to lower this amount if you believe that the event is distributed
over hundreds or thousands of computers, which may only be sending
10 or 20 at a time:
open(REPORT,">$rep_dir/$rep_file") || die "Couldn't open \
$rep_dir/$rep_file for writing: $!";
foreach my $key (keys %senders_by_ip) {
if ($senders_by_ip{$key}{numrcpts} >= $min_rcpts) {
print REPORT "$senders_by_ip{$key}{numrcpts}:";
print REPORT "$key:$senders_by_ip{$key}{lastfile}\n";
}
}
foreach my $key (keys %senders_by_em) {
if ($senders_by_em{$key}{numrcpts} >= $min_rcpts) {
print REPORT "$senders_by_em{$key}{numrcpts}:";
print REPORT "$key:$senders_by_em{$key}{lastfile}\n";
}
}
foreach my $key (keys %senders_by_dm) {
if ($senders_by_dm{$key}{numrcpts} >= $min_rcpts) {
print REPORT "$senders_by_dm{$key}{numrcpts}:";
print REPORT "$key:$senders_by_dm{$key}{lastfile}\n";
}
}
close(REPORT);
The final step in this process is to sort the report in reverse so
that the most frequent entries appear at the top. And as is standard
practice, our function for sorting the files by modification time
is included at the end of the script:
system('/usr/bin/sort','-rno',"$rep_dir/$rep_file", "$rep_dir/$rep_file");
sub bymtime {
return $file_mod_time{$a} <=> $file_mod_time{$b};
}
Understanding the Report
Once the script is ready, check its syntax by running Perl with
the -c option. Note that you'll probably have to tweak the line
matching regular expression to perfect it. But, once it truly is
ready, run it and then view the report. Below is a sanitized excerpt
from a report we used to identify an actual spammer that we caught
early enough to prevent an outage:
41999:gci.net:mail.log_current
16081:a.b.c.d:mail.log_current.1.gz
15925:yahoo.com:mail.log_current
4687:alaska.com:mail.log_current
4282:ak.net:mail.log_current
3491:e.f.g.h:mail.log_current
3092:legit_customer@gci.net:mail.log_current.1.gz
3088:i.j.k.l:mail.log_current.1.gz
933:alaskalife.net:mail.log_current
735:m.n.o.p:mail.log_current
In this report, we eliminated our primary domains where we expect
to see a lot of mail (gci.net, alaska.com, ak.net, alaskalife.net).
This immediately indicated to us that the spam was originating from
the IP address a.b.c.d. Under no circumstances should a single customer
be sending 20% of our mail traffic (16,091 messages when our largest
domain only sends 41,999). You can also see that almost the same number
of messages originated from yahoo.com (15,925), so it's likely that
the spammer was sending from that domain. The last detail to note
is that the last mail log that matched the IP address was mail.log_current.1.gz,
which indicates that the event has paused since the last log rotation.
It turns out that the culprit was actually a cablemodem subscriber
that had configured their mail server to relay all mail through our
platform, but they had also left it configured as an open relay.
After shutting down the cablemodem and contacting the customer
to close that hole, we were back in service and haven't seen the
spammer since. As for the rest of the report, the remaining top
ten entries are known and customary so they can be ignored. This
goes to show that it will take a few days, if not a few weeks, to
become familiar with the standard flows of traffic through your
platform before you can easily identify spam events.
Conclusion
There you have it. With a little work and some uninterrupted scripting
time, you can join the growing crowd of proactive mail administrators
and maybe save yourself some headaches down the line.
Jason A. Richards is a systems engineer for General Communication,
Inc., the largest ISP in Alaska. He has been employed in systems
administration for five years. During his off hours he pursues a
bachelor's degree in business administration. He can be reached
at: fusconed@slackershaven.org.
|