Identifying Spam Events

Jason A. Richards

The number of spam events is increasing every month. Because of the sheer number of recipients for any single spam event, it doesn't take much of a reduction in percentage for a benefit to be seen. The tricky part for administrators is determining which events are unsolicited and which are requested and permissible. As an administrator, you must have an understanding of what is normal for your mail servers and what is not. If you currently have no idea, then your first step may well be to set up some MRTG graphs. If you are lucky, your mail server has an SNMP agent; if not, you just need to write a few scripts that do a lot of counting.

The most rudimentary way to identify spam events (once you've established a baseline for your system) is to parse the mail logs. Every mail server, whether it's sendmail, postfix, qmail, or SunONE Messaging Server, has the ability (and should always be configured) to write logs for all SMTP traffic. The goal then, is to parse these logs in search of commonalities that are unacceptable. Whether you administer a single system used only for your company's employees, or you manage multiple servers handling millions of messages for thousands of Internet subscribers, the most sensible way to undertake this task is to write a script to reveal the pertinent information.

Differentiating the Good from the Bad

The function of this script then should be to extract the repetitive information from our mail logs and display it in a fashion that makes sense. The three details that we care about from the logs are the source email address, the source domain, and the source IP address. By classifying and counting this information, we can see where the email is coming from and then determine whether that email is good or bad. Good email is defined as email that the recipients signed up for intentionally. Take BugTRAQ, for instance. Lots of people want to receive BugTRAQ, so if we deem it bad email and attempt to block or thwart it, we will be upsetting a great number of people. Bad email, then, is defined as email that is clearly unsolicited and detrimental to system performance. Notice the need for it to be detrimental to the system. Plenty of the email floating around the Internet is unsolicited, however, if we target these individual emails, which do not affect the performance of the system, then we are wasting valuable cycles.

The Script

There are two prerequisites for the script below. The first is Perl. Since most environments compress and rotate their log messages, the second prerequisite is the Compress::Zlib Perl module. Perl can be obtained from any number of locations, depending on your system architecture, and the module can be installed via CPAN. With the tools in hand, we can begin. As with any script, we first need to define some variables to make it portable between systems, and just for general troubleshooting purposes:

#!/usr/bin/perl -w

use strict;
use Compress::Zlib;

my $base_dir = "/apps/log/imta";    # Log directory
my @log_files;                      # Log file names
my %file_mod_time;                  # Stores log file mtime
my $rep_dir  = "/tmp";              # Report directory
my $rep_file = "spam_events.txt";   # Report filename
my %senders_by_ip;                  # IP address hash
my %senders_by_em;                  # email address hash
my %senders_by_dm;                  # email domain hash
my $start_time = 86400;        # The default start time is 24 hours ago
my $end_time = 0;              # The default end time is 0 seconds ago
my $min_rcpts = 30;            # Minimum number of matches to include 
                               # in the report

Continuing with this proactive theme, if we truly are being proactive, then we need a time frame in which to search through the logs. If we believe the event began within the past hour, there's no need to search a day's worth of logs. We should then provide this ability via the command line. Therefore, we need to account for the options within the script. By default, the script could search over the past 24 hours (86400 seconds), so that we can see any events that may have occurred during that time:

if ($#ARGV == "-1") {
   print "Searching the last " . ($start_time-$end_time)/3600 . " hours.\n";
} elsif ($#ARGV == "1") {
   $start_time = $ARGV[0];   # Allow the user to specify the start time
   $end_time = $ARGV[1];     # Allow the user to specify the end time
   print "Searching the last " . ($start_time-$end_time)/3600 . " hours.\n";
} else {            # Otherwise, print the usage
   print "Usage: $0";
   print "\n       Runs over previous 24 hours.\n";
   print "Usage: $0 <begin time> <end time>";
   print "\n       Runs over time, specified in seconds.\n";
   exit 0;
}

Next, we read the log directory and determine which log files are available. At this point, it's necessary to know the naming convention for your logs. For instance, our logs are of the format mail.log_current and mail.log_current.[0-9].gz. So by matching on mail.log_current.*, I can be sure to get all of them. We also want to exclude files that fall outside of our time criteria. Once we've identified the appropriate files to search, we store the mtime in the %file_mod_time hash in order to search the files in the correct order later on:

opendir(BASEDIR, $base_dir) ||
   die "Couldn't open $base_dir for reading: $!";

@log_files = readdir(BASEDIR);

closedir(BASEDIR);

foreach my $file (@log_files) {
   my $mtime = (stat("$base_dir/$file"))[9];
   if (( $file  =~ m/^mail.log_current.*$/ ) &&
       ( $mtime >= time - $start_time )      &&
       ( $mtime <= time - $end_time )
      ) {
      $file_mod_time{$file} = $mtime;
   }
}

Once you have the list of files, you can open each file in order from oldest to newest, parse each line, and begin the tallies. We care about the order of the files because the report includes a field detailing the last file in which the entry appears. Next, you will need to formulate a regular expression to match your particular log format. An example log from my mail environment (SunONE Messaging Server) and its field definitions is as follows:

29-May-2003 10:17:08.70 15e1.221e.23858 tcp_local    tcp_intranet E 2 
[email protected] rfc822;[email protected] @ems-1.gci.net:[email protected] 
/apps/spool/queue1/tcp_intranet/013/ZZ0HFN0049JUSKBL.00 
<[email protected]> gci.com 
(daytona.gci.com [205.140.80.57])

Date:          29-May-2003
Time:          10:17:08.70
MessageID:     15e1.221e.23858
SrcChannel:    tcp_local
DestChannel:   tcp_intranet
Action:        E for Enqueue
Bytes:         2
SrcEmail:      [email protected]
DestEmail:     rfc822;[email protected]
RouteToDest:   @ems-1.gci.net:[email protected]
PathToMsg:     /apps/spool/queue1/tcp_intranet/013/ ZZ0HFN0049JUSKBL.00
SrcMessageID:  [email protected]
SourceHost:    gci.com
SourceHostDNS: daytona.gci.com
SourceHostIP:  205.140.80.57

As mentioned before, of this information we only want to report the SrcEmail, the domain part of the SrcEmail, and the SourceHostIP. Now that we understand the format of the log messages, we can write a regular expression to match the fields needed for the report. You'll see in the regular expression below that I substituted as many fields as possible with \S+ (one or more non-spaces) to simplify the expression. Some fields must be static though, such as the SrcChannel and DestChannel, to pull out the exact type of lines we're looking for. After incrementing the associated counters, we can close the current file and continue with the next:

foreach my $file (sort bymtime keys %file_mod_time) {
   my $gz = gzopen("$base_dir/$file", "rb") ||
      die "Couldn't open $base_dir/$file: $gzerrno\n";

   while ($gz->gzreadline($_) > 0) {

      # The following pattern match in the "if" clause should not
      # span 3 lines as it does in this publication.  It must be
      # coded on a single line in order for the pattern match
      # to work properly.

      if (m!^\S+ \S+ \S+ tcp_intranet\s+tcp_local\s+E \S+ (.*?\@(.*?))
           rfc822\;\S+ \S+ \S+ \S+ \S+ \S+
           \[(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\].*!) {
         my $log_sender = $1;
         my $log_domain = $2;
         my $log_ipaddr = $3;

         $senders_by_ip{$log_ipaddr}{numrcpts}++;
         $senders_by_ip{$log_ipaddr}{lastfile} = $file;
         $senders_by_em{$log_sender}{numrcpts}++;
         $senders_by_em{$log_sender}{lastfile} = $file;
         $senders_by_dm{$log_domain}{numrcpts}++;
         $senders_by_dm{$log_domain}{lastfile} = $file;
      }
   }
   $gz->gzclose();
}

To gain an accurate picture of the mail traffic, we need to save this information to a report that we can archive so that we can start developing trends. Therefore, we must open up a report file and output the information. We have three different hashes to sort through, so we'll proceed through them one at a time. As defined above, we'll only be printing entries that were matched 30 times. This allows us to limit the report and prevent seeing an overwhelming number of lines. However, you may want to lower this amount if you believe that the event is distributed over hundreds or thousands of computers, which may only be sending 10 or 20 at a time:

open(REPORT,">$rep_dir/$rep_file") || die "Couldn't open \
  $rep_dir/$rep_file for writing: $!";

foreach my $key (keys %senders_by_ip) {
   if ($senders_by_ip{$key}{numrcpts} >= $min_rcpts) {
      print REPORT "$senders_by_ip{$key}{numrcpts}:";
      print REPORT "$key:$senders_by_ip{$key}{lastfile}\n";
   }
}

foreach my $key (keys %senders_by_em) {
   if ($senders_by_em{$key}{numrcpts} >= $min_rcpts) {
      print REPORT "$senders_by_em{$key}{numrcpts}:";
      print REPORT "$key:$senders_by_em{$key}{lastfile}\n";
   }
}

foreach my $key (keys %senders_by_dm) {
   if ($senders_by_dm{$key}{numrcpts} >= $min_rcpts) {
      print REPORT "$senders_by_dm{$key}{numrcpts}:";
      print REPORT "$key:$senders_by_dm{$key}{lastfile}\n";
   }
}

close(REPORT);

The final step in this process is to sort the report in reverse so that the most frequent entries appear at the top. And as is standard practice, our function for sorting the files by modification time is included at the end of the script:

system('/usr/bin/sort','-rno',"$rep_dir/$rep_file", "$rep_dir/$rep_file");

sub bymtime {
   return $file_mod_time{$a} <=> $file_mod_time{$b};
}

Understanding the Report

Once the script is ready, check its syntax by running Perl with the -c option. Note that you'll probably have to tweak the line matching regular expression to perfect it. But, once it truly is ready, run it and then view the report. Below is a sanitized excerpt from a report we used to identify an actual spammer that we caught early enough to prevent an outage:

41999:gci.net:mail.log_current
16081:a.b.c.d:mail.log_current.1.gz
15925:yahoo.com:mail.log_current
4687:alaska.com:mail.log_current
4282:ak.net:mail.log_current
3491:e.f.g.h:mail.log_current
3092:[email protected]:mail.log_current.1.gz
3088:i.j.k.l:mail.log_current.1.gz
933:alaskalife.net:mail.log_current
735:m.n.o.p:mail.log_current

In this report, we eliminated our primary domains where we expect to see a lot of mail (gci.net, alaska.com, ak.net, alaskalife.net). This immediately indicated to us that the spam was originating from the IP address a.b.c.d. Under no circumstances should a single customer be sending 20% of our mail traffic (16,091 messages when our largest domain only sends 41,999). You can also see that almost the same number of messages originated from yahoo.com (15,925), so it's likely that the spammer was sending from that domain. The last detail to note is that the last mail log that matched the IP address was mail.log_current.1.gz, which indicates that the event has paused since the last log rotation. It turns out that the culprit was actually a cablemodem subscriber that had configured their mail server to relay all mail through our platform, but they had also left it configured as an open relay.

After shutting down the cablemodem and contacting the customer to close that hole, we were back in service and haven't seen the spammer since. As for the rest of the report, the remaining top ten entries are known and customary so they can be ignored. This goes to show that it will take a few days, if not a few weeks, to become familiar with the standard flows of traffic through your platform before you can easily identify spam events.

Conclusion

There you have it. With a little work and some uninterrupted scripting time, you can join the growing crowd of proactive mail administrators and maybe save yourself some headaches down the line.

Jason A. Richards is a systems engineer for General Communication, Inc., the largest ISP in Alaska. He has been employed in systems administration for five years. During his off hours he pursues a bachelor's degree in business administration. He can be reached at: [email protected].