|  Troubleshooting 
              SolarisTM Network Performance
 Alex Golomshtok
              Networks are the bloodstreams of modern computer systems. Today, 
              nearly all computers are connected to some kind of public or private 
              network, and it is difficult to imagine a system without at least 
              some sort of networking capabilities. As computer technology continues 
              to evolve, the distributed computing model gains more ground, thus 
              increasing the importance of networks. In fact, today most organizations 
              rely on their own complex networking structures so much that even 
              a short period of downtime may easily translate into millions of 
              dollars of lost revenues.
              Modern-day networks are often monstrously complex, convoluted, 
              and rely on a wide spectrum of technologies. A typical corporate 
              network, for instance, may bring together thousands of computer 
              systems from different hardware vendors, running various operating 
              systems. Monitoring the health of such network is quite a challenge 
              and may be impossible without the proper tools. To satisfy growing 
              demands for reliable management of heterogeneous networks, the Simple 
              Network Management Protocol (SNMP) was developed and adopted as 
              a management standard for TCP/IP-based networking systems. SNMP 
              quickly gained popularity and remains the primary mechanism for 
              carrying out a multitude of network management tasks, such as network 
              performance monitoring, fault management, configuration management, 
              and more.
              SNMP
              The foundation of SNMP is the database containing the management 
              data, on which the network management system operates. This database 
              is commonly referred to as the Management Information Base (MIB). 
              SNMP MIB is essentially a tree-like collection of objects, each 
              representing a managed resource on a network. A network management 
              system can monitor the state of these objects by reading their properties 
              and alter the state by modifying these properties. The organization 
              of an MIB is governed by a standard, called Structure of Management 
              Information (SMI) [1] -- it outlines the rules for constructing 
              and defining MIB management objects. Over the years, a few different 
              MIBs have been developed to address various aspects of network and 
              system management, such as Relational Database Monitoring MIB and 
              Mail Management MIB. MIB-II [2], which defines the second version 
              of the management information base for TCP/IP-based Internets, however, 
              remains perhaps the most important and the most commonly used MIB 
              specification. MIB-II defines following broad groups of management 
              information:
              
              
             
               System -- General information about the networked system, 
                such as its identification information, location, uptime, etc. 
               Interfaces -- Information, describing each of the system's 
                network interfaces. 
               AT -- Information pertinent to the operations of an address 
                translation (AT) protocol; essentially the contents of the address 
                translation table. 
               IP -- Information pertinent to the operations of IP protocol 
                on a given system. 
               ICMP -- Information pertinent to the operations of ICMP 
                protocol. 
               TCP -- Information pertinent to the operations of TCP protocol. 
               UDP -- Information pertinent to the operations of UDP protocol. 
               EGP -- Information pertinent to the operations of EGP protocol. 
               DOT3 -- Information pertinent to the transmission schemes 
                and access protocols at each system interface. 
               SNMP -- Information pertinent to the operations of SNMP 
                protocol on a given system.
              Apparently, MIB-II covers many aspects of TCP/IP-based network 
              management and allows for building comprehensive management systems. 
              However, there is one question that MIB specifications do not quite 
              answer -- where does the management data come from? SNMP, as 
              powerful and flexible as it is, is just a mechanism for disseminating 
              and sometimes altering the management information and at no time 
              it is responsible for actually collecting and maintaining the data.
              Streams
              The answer lies with TCP/IP stack. Since System 5, Release 3 (SVR3), 
              UNIX has been equipped with the streams [3] mechanism -- elegant 
              and flexible framework for UNIX System communication services. In 
              the true spirit of UNIX, the streams model encourages the development 
              of compact modules, representing functional components such that 
              these modules can then be dynamically loaded and interconnected 
              to form a fully functional data communication path or stream. Streams 
              closely resemble the layered structure of typical networking protocols, 
              and therefore are perfect for implementing protocol stacks.
              A stream is a communication link between the user space (or application 
              program) and the kernel. Typically, an application will create a 
              stream by opening a streams device, such as /dev/ip, for instance. 
              When a stream is opened, it consists of a stream head -- the 
              interface between the stream and the user process, and a stream 
              driver. An application process may then "push" various 
              modules onto the stream thus enabling certain services. Each stream 
              module is typically responsible for carrying out a set of closely 
              related functional tasks, such as adding network routing information 
              to user packets. Once a stream is assembled, an application may 
              initiate a bi-directional data exchange or stream I/O.
              The data is passed through the stream in the form of messages. 
              When a user process passes a message to the stream head, this message 
              is sent from module to module until it reaches the bottom of the 
              stack -- the stream driver. In this case, the message is said 
              to be traveling downstream. Whenever the kernel replies, the data 
              travels upstream -- each stream module passes the message to 
              the module above it until it reaches the stream head. In case of 
              the TCP/IP stack, not only the data but also control messages can 
              be sent downstream to either alter the behavior of the stream or 
              retrieve some sort of management information maintained by the stream 
              modules.
              One control message that can be sent downstream is the option 
              management request. Typically, the option management request is 
              delivered to a specific module on the stream, but when retrieving 
              MIB data, all stream modules receive the request at once, and the 
              entire universe of operational data, pertinent to all stream modules, 
              is returned to the application program.
              Under Solaris, the majority of management information, described 
              by MIB-II, can be retrieved directly from the stream using a sequence 
              of ioctl(2) calls. Many network monitoring utilities, such as netstat(1M) 
              and most SNMP agents, employ streams option management requests 
              to gather network statistics. These programs typically construct 
              a brand new stream for the purpose of obtaining management data, 
              configure it by pushing the appropriate modules onto the streams 
              head, send an option management request downstream, and then extract 
              the statistical data from the returned message.
              Frankly, SNMP and a slew of Solaris network monitoring utilities 
              solve most of the network monitoring problems while successfully 
              hiding the complexities of streams programming from typical systems 
              administrators. There are, however, some situations where more control 
              or more flexibility is desired. Netstat(1M), for instance, although 
              convenient and easy to use, remains just another program that produces 
              textual output. This makes it difficult to use netstat as a basis 
              for custom monitoring solutions.
              Although it is certainly possible to use some shell magic to extract 
              the values of network counters from netstat output, this approach 
              is not very reliable, inefficient, and just plain ugly. Another 
              problem is that every invocation of netstat results in a new OS 
              process being created, thus consuming precious system resources. 
              Most monitoring tools take periodic snapshots of the statistical 
              data and calculate deltas over a predefined time interval, so a 
              shell script that launches netstat every time it needs a sample 
              of network statistics, makes for a very inefficient and expensive 
              monitoring tool.
              SNMP solves most of these problems -- there are numerous APIs 
              and tools, such as the excellent Net::SNMP Perl module [4], which 
              could be used to read and even modify the network management information 
              in a fairly painless fashion. But even SNMP is not perfect. First, 
              every SNMP-based tool relies on an SNMP agent, which has to be running 
              on every managed computer system. Second, although "simple" 
              is part of its name, SNMP is not quite that simple -- programming 
              SNMP client applications can get fairly involved. Third, there are 
              certain security implications when running SNMP agents on computing 
              nodes, connected to public networks. If not configured correctly, 
              SNMP can provide a wealth of information about your network to a 
              potential intruder.
              Solaris::MIB2
              In an attempt to solve some of these problems, we created a simple 
              yet powerful Perl extension called Solaris::MIB2 [5]. This module 
              allows easy access to most of the statistical and operational data 
              maintained by Solaris stream modules, while imposing only a minimal 
              load on a monitored system. The following few lines of code demonstrate 
              how easy it is to obtain a value of an arbitrary network counter 
              -- for example, tcpCurrEstab or current number of TCP connections 
              in an established state:
              
             
use Solaris::MIB2;
$mib = new Solaris::MIB2("/dev/tcp");
print $mib->{tcp}->{tcpCurrEstab}, "\n";
Apparently, all we have to do is create an instance of Solaris::MIB2 
            object passing "/dev/tcp" as a parameter (so that the module 
            builds the stream over /dev/tcp device) and use the returned hash 
            reference to read the value of interest. As this article will show, many compact and powerful network monitors 
              can be developed with Solaris::MIB2, although the module has a few 
              limitations, which users should be aware of. The first, and perhaps 
              most severe, shortcoming of Solaris::MIB2 is that it, unlike SNMP, 
              cannot read the management data from a remote computer over the 
              network. Although it is possible to create a custom network-based 
              data distribution mechanism, this was not our intention. Those who 
              look for this kind of functionality should turn to SNMP.
              Yet another limitation, which is a result of a conscious design 
              decision, is read-only access to the management data exposed by 
              Solaris::MIB2. Unlike SNMP, which is general-purpose network management 
              facility, Solaris::MIB2 is intended solely for the purposes of network 
              monitoring and, as such, does not allow for any modification of 
              the data on which it operates. While Solaris::MIB2 provides access 
              to most of the management information, described in MIB2 RFC [2], 
              it is not fully compliant with the specification and does not implement 
              some of the groups, such as System, SNMP, or DOT3. In fact, the 
              module exposes most of the structures defined in /usr/include/inet/mib2.h 
              for the exception of some IPv6 tables.
              Finally, the interface to the MIB2 data is not implemented as 
              a tied hash -- in other words, reading a value from the MIB2 
              object will not trigger the option management request to be sent 
              downstream. Instead, when the object is first created, the stream 
              module statistics are read and loaded into the regular hierarchical 
              hash. Every subsequent refresh operation must be initiated explicitly 
              using the update function, which is the part of Solaris::MIB2 interface:
              
             
use Solaris::MIB2;
$mib = new Solaris::MIB2("/dev/tcp");
while( 1 ) {
sleep(5);
$mib->update();
print $mib->{tcp}->{tcpCurrEstab}, "\n";
}
As mentioned earlier, reading MIB2 statistics is all-or-none proposition 
            -- it is impossible to retrieve the values of individual variables 
            and, whenever an option management request is sent downstream, the 
            operational data for all stream modules is returned. Apparently, this 
            particular feature of MIB2 interface makes tied-hash implementation 
            prohibitively expensive. To demonstrate the power and flexibility of Solaris::MIB2, I've 
              provided a few simple examples, designed to illustrate how the functionality 
              afforded by this module can be applied to real-world network monitoring 
              problems. The first sample program, called pif, attempts to mimic 
              some of the functionality of the popular UNIX utility arp(1M). The 
              arp(1M) program displays and modifies the contents of the Internet-to-Ethernet 
              address resolution tables, used by the address resolution protocol 
              [6]. For the sake of saving space, the capabilities of this pif 
              program will be limited to printing the contents of the address 
              translation or Net-to-Media table, which is an equivalent of running 
              the arp(1M) utility with -a command-line switch.
              The following is a complete source code listing of pif:
              
             
1  #!/usr/local/bin/perl
2
3  use Socket;
4  use Solaris::MIB2;
5
6  $mib = new Solaris::MIB2( "/dev/ip" );
7
8  print "Device IP Address      Mask            Flags  Phys Address\n";
9  print "------ --------------- --------------- ------ ------------------\n";
10
11  foreach my $entry ( @{$mib->{ipNetToMediaEntry}} ) {
12     my $device = $entry->{ipNetToMediaIfIndex};
13     my $host   = gethostbyaddr( inet_aton($entry->{ipNetToMediaNetAddress}),AF_INET) ||
14                     $entry->{ipNetToMediaNetAddress};
15     my $flags   =  ($entry->{ntm_flags} & Solaris::MIB2::ACE_F_PERMANENT) ? "S" : "";
16        $flags  .=  ($entry->{ntm_flags} & Solaris::MIB2::ACE_F_PUBLISH)   ? "P" : "";
17        $flags  .= !($entry->{ntm_flags} & Solaris::MIB2::ACE_F_RESOLVED)  ? "U" : "";
18        $flags  .=  ($entry->{ntm_flags} & Solaris::MIB2::ACE_F_MAPPING)   ? "M" : "";
19
20     my $mask   = sprintf("%u.%u.%u.%u",
21        map( hex("0x$_"), unpack("A2A2A2A2", $entry->{ntm_mask})));
22     my $phys   = $entry->{ipNetToMediaPhysAddress};
23
24     printf("%-6s %-15s %-15s %-6s %-20s\n", $device, $host, $mask, $flags, $phys);
25  };
The script uses two extension modules -- Solaris::MIB2, loaded 
            at line 4; and Socket, loaded at line 3. The Socket module exposes 
            the inet_aton function, necessary for converting the character 
            representation of host IP addresses to struct in_addr structure, which 
            can be consumed by gethostbyaddr(3NSL) function. Line 6 constructs a brand new MIB2 object over /dev/ip by passing 
              "/dev/ip" to the constructor function of Solaris::MIB2. 
              Note that, under Solaris, read/write access to /dev/ip is limited 
              to root and members of sys group, therefore, if not run by 
              a privileged user, our script will fail. Always using root or other 
              special user id to run the script is not very convenient, so making 
              the script set-group-id sys seems like the best solution. 
              Many UNIX programs, such as passwd(1) or netstat(1M), are set-user-id 
              or set-group-id, which allows regular users to perform operations 
              that are typically permitted only to root or other privileged users. 
              Set-user-id and set-group-id programs are controversial, as many 
              people consider them inherently unsafe. However, if configured correctly, 
              these programs provide convenient solutions for many otherwise unsolvable 
              problems.
              Our program, however, is a script, interpreted at run time by 
              Perl, as opposed to a binary executable such as netstat(1M), which 
              makes it more of a security concern. First, the script's source 
              code is more readily accessible and, thus can easily be examined 
              for security vulnerabilities by a potential intruder. But most importantly, 
              some UNIX kernels, especially the older ones, have a security problem 
              with set-user-id and set-group-id scripts. When a user executes 
              a file, where the first line starts with #!path_to_interp, 
              the kernel translates this into an exec(2) call, invoking the interpreter, 
              which is identified by path_to_interp and passing the original 
              script file name and other arguments as parameters. For example, 
              if the script "/usr/local/bin/foo" starts with #!/bin/ksh 
              and is invoked as follows:
              
             
/usr/local/bin/foo  arg1  arg2  arg3
the kernel will actually execute the following command:  
             
/bin/ksh  /usr/local/bin/foo  arg1  arg2  arg3
Now let's consider the following scenario: a user makes a symbolic 
            link /tmp/foo_link pointing to /usr/local/bin/foo. In 
            this case, the kernel executes the following command:  
             
/bin/ksh  /tmp/foo_link  arg1  arg2  arg3
There is a window between the time the kernel opens the script file 
            to determine what must be executed and the time when the interpreter 
            (/bin/ksh) reopens the file to actually execute it. As small as this 
            window might be, there is a chance that a malicious user could modify 
            the symbolic link to point to a different file. Thus, if the script 
            is run set-user-id root, some untrustworthy code will execute with 
            superuser privileges.  Recent releases of Solaris close this security hole by passing 
              "/dev/fd/3" a special file, which is already opened over 
              the original script file, to the interpreter instead of the actual 
              path to the script file, thus eliminating any potential race conditions 
              and reducing the security risk. The Perl configuration script checks 
              whether your system supports the secure set-user-id scripts using 
              the following clever trick:
              
             
echo "#!/bin/ls" >reflect
chmod +x,u+s reflect
./reflect >flect 2>&1
if /bin/grep "/dev/fd" flect >/dev/null; then
echo "Congratulations, your kernel has secure setuid scripts!"
else
echo "setuid scripts are not secure!"
fi
If the Perl installation script detects that your system does not 
            support secure set-user-id and set-group-id scripts, it will attempt 
            to build a special set-user-id version of the interpreter, called 
            suidperl. This special executable allows Perl to emulate the set-user-id 
            mechanism, because it is invoked every time Perl detects the set-user-id 
            or set-group-id bit set on a script file. With this in mind, we can 
            assume that set-user-id and set-group-id Perl scripts are reasonably 
            secure. Thus, in order for our pif script to run correctly, it should 
            be made set-group-id 'sys' as follows:  
             
chgrp sys pif
chmod g+s pif
Once the MIB2 object is successfully constructed over /dev/ip, the 
            script prints out the column headings at lines 8 and 9 and then starts 
            iterating over the contents of the Net-to-Media table, using a foreach 
            loop at line 11. Line 12 simple reads the device name, pointed to 
            by the ipNetToMediaIfIndex hash key. Lines 13 and 14 obtain 
            the IP address, associated with a particular address translation table 
            entry and attempt to look up the host name for it, using the gethostbyaddr(3NSL) 
            function. The next four lines of code -- 15 through 18 -- 
            check the address translation flags using a set of predefined constants 
            exposed by the Solaris::MIB2 module.  As with the arp(1M) command, our script prints the following four 
              flags:
              
              
             
               'S' or static as opposed to dynamic address translation 
                entry, learned through the ARP protocol. 
               'P' or published. This means that ARP should respond 
                to requests for the indicated host coming from other machines. 
                Published entries include those explicitly added with the arp(1M) 
                '-s' command-line switch as well as the entry for the 
                local machine. 
               'U' or unresolved. Unresolved entries are those where 
                ARP response has not been yet received. 
               'M' or mapping. This is a special type, used for 
                multicast entry 224.0.0.0.
              Lines 20 and 21 read the value of the netmask for a particular 
              address translation table entry. Solaris::MIB2 returns the netmask 
              as a hex string -ffffff00, for instance. Our script translates the 
              hexadecimal number into a conventional dotted notation by first 
              breaking the string apart with unpack function, pre-pending "0x" 
              to each of the four resulting elements to turn them into hex strings, 
              subsequently converted to integers with hex function; and then putting 
              everything back together with sprintf function.
              Line 22 simply reads the physical or MAC address, and, finally 
              line 24 outputs a formatted address translation entry to the screen.
              When run on one of our Solaris systems, pif script produces the 
              following output:
              
             
Device IP Address         Mask                Flags     Phys Address
------ ---------------    ---------------     ------    ------------------
hme0   sun2               255.255.255.255               08:00:20:90:c5:b6
hme0   sun5               255.255.255.255               08:00:20:81:69:c4
...
hme0   198.162.31.170     255.255.255.255               00:02:55:f4:1c:79
...
hme0   sun3               255.255.255.255     SP        08:00:20:90:cf:1c
...
hme0   224.0.0.0          240.0.0.0           SM        01:00:5e:00:00:00
which is pretty much identical to the output produced by arp -a. 
            The next example is a bit more useful. Instead of mimicking the functionality 
            of an existing program, it demonstrates how Solaris::MIB2 can be used 
            to build lightweight custom network monitoring solutions. The following 
            is a complete source code for the program, called "tcpmon":  
             
1  #!/usr/local/bin/perl
2
3  use Solaris::MIB2 ":all";
4  use Time::HR;
5  use Getopt::Std;
6   
7  # sample thresholds
8  use constant active            => 2.0;
9  use constant retrans_problem   => 25.0;
10  use constant listen_problem   => 0.5;
11  use constant halfopen_problem => 2.0;
12  use constant outrsts_problem  => 2.0;
13  use constant attempt_fails    => 2.0;
14  use constant indup_problem    => 25.0;
15
16  getopts( "i:h" );
17  die "usage: netmon -i<interval> -h\n"
18     if $opt_h;
19
20  $mib = new Solaris::MIB2 q(/dev/tcp);
21  die "failed to create instance of MIB2 object\n"
22     unless $mib;
23
24  $now        = undef;
25  $then       = gethrtime();
26  %stats_now  = undef;
27  %stats_then = %{$mib->{tcp}}; # ensure deep copy
28
29  while(1) {
30     sleep($opt_i||5);
31     $mib->update();
32     $now       = gethrtime();
33     %stats_now = %{$mib->{tcp}};
34
35     $interval = ($now - $then) * 0.000000001;
36     next unless $interval;
37
38 $tcpInDataBytes  =
39    $stats_now{tcpInDataInorderBytes} - $stats_then{tcpInDataInorderBytes};
40 $tcpInDataBytes +=
41    $stats_now{tcpInDataUnorderBytes} - $stats_then{tcpInDataUnorderBytes};
40     $tcpInDataBytes /= $interval;
41
42 $tcpOutDataBytes =
43    ($stats_now{tcpOutDataBytes} - $stats_then{tcpOutDataBytes})/$interval;
44 $tcpRetransBytes =
45    ($stats_now{tcpRetransBytes} - $stats_then{tcpRetransBytes})/$interval;
44     $tcpRetransPercent = $tcpOutDataBytes ?
45        100.0 * $tcpRetransBytes / $tcpOutDataBytes : 0.0;
46
47     $tcpOutRsts      = ($stats_now{tcpOutRsts} - $stats_then{tcpOutRsts})/$interval;
48     $tcpAttemptFails = ($stats_now{tcpAttemptFails} - $stats_then{tcpAttemptFails})/$interval;
49
50 $tcpInDataSegs  =
51    $stats_now{tcpInDataInorderSegs} - $stats_then{tcpInDataInorderSegs};
52 $tcpInDataSegs +=
53    $stats_now{tcpInDataUnorderSegs} - $stats_then{tcpInDataUnorderSegs};
52     $tcpInDataSegs /= $interval;
54 $tcpOutDataSegs =
55    ($stats_now{tcpOutDataSegs} - $stats_then{tcpOutDataSegs})/$interval;
54   
56 $tcpActiveOpens  =
57    ($stats_now{tcpActiveOpens} - $stats_then{tcpActiveOpens})/$interval;
56 $tcpPassiveOpens =
57    ($stats_now{tcpPassiveOpens} - $stats_then{tcpPassiveOpens})/$interval;
57
58     $tcpListenDrop   = ($stats_now{tcpListenDrop} - $stats_then{tcpListenDrop})/$interval;
58 $tcpListenDropQ0 =
59    ($stats_now{tcpListenDropQ0} - $stats_then{tcpListenDropQ0})/$interval;
60 $tcpHalfOpenDrop =
61    ($stats_now{tcpHalfOpenDrop} - $stats_then{tcpHalfOpenDrop})/$interval;
61
62     $tcpInDupBytes  = $stats_now{tcpInDataDupBytes} - $stats_then{tcpInDataDupBytes};
62 $tcpInDupBytes +=
63    $stats_now{tcpInDataPartDupBytes} - $stats_then{tcpInDataPartDupBytes};
64     $tcpInDupBytes /= $interval;
65     $tcpInDupPercent = $tcpInDataBytes ?
66        100.0 * $tcpInDupBytes / $tcpInDataBytes : 0.0;
67
68     %stats_then = %stats_now; $then = $now;
69
70     print "high retransmissions, fix network.\n"
71        if $tcpRetransPercent >= retrans_problem;
72     if ($tcpListenDrop + $tcpListenDropQ0 >= listen_problem) {
73        print "Listen queue dropouts, speedup accept processing.\n";
74        print "Listen HalfOpenDrops, possible SYN denial attack.\n"
75           if $tcpHalfOpenDrop >= halfopen_problem;
76     }
77     print "Incoming connections refused: port scanner attack.\n"
78        if $tcpOutRsts >= outrsts_problem;
79     print "Attempt failures: can't connect to remote application.\n"
80        if $tcpAttemptFails >= attempt_fails;
81     print "High duplicate input, fix net and remote server retrans.\n"
82        if $tcpInDupPercent >= indup_problem;
83  };
The first few lines of the program (lines 3 through 5) load the necessary 
            Perl extensions -- Solaris::MIB2, Time::HR, and Getopt::Std. Time::HR 
            [7] is a very simple module that allows for measuring elapsed time 
            intervals with nanosecond precision. The public interface of Time::HR 
            consists of a single function, gethrtime, which under Solaris simply 
            calls the gethrtime(3C) function. Getopt::Std is the standard 
            Perl extension, used to process command-line arguments. The tcpmon 
            program takes two command-line options, -h, which simply prints 
            out the usage, and -i, which allows the user to override the 
            default setting of 5 seconds for the sampling interval. Lines 8 through 14 declare some thresholds, which will subsequently 
              be used for diagnosing various network problems. Lines 16 through 
              18 parse command-line arguments and, in case the help flag -h 
              is supplied, abort the program, and print the usage information 
              on the screen. Line 20 constructs a MIB2 object over /dev/tcp. Because 
              our script is intended for TCP monitoring, we are no longer required 
              to construct the stream over /dev/ip, hence, there is no need to 
              run this program set-group-id sys. Once the MIB2 object is 
              constructed, the program records the value of the high-resolution 
              timer at line 25 and saves the initial MIB2 statistics into a hash 
              at line 27.
              Once the initialization is completed, the program jumps into an 
              endless loop at line 29 and suspends itself for the duration of 
              the sampling interval -- either the value of -i command-line 
              argument or the default 5 seconds. Upon the expiration of the interval, 
              the MIB2 hierarchical hash is refreshed using the update function 
              at line 31. Then, the current value of the high-resolution timer 
              and current MIB2 statistics are recorded again at lines 32 and 33. 
              We then calculate the elapsed time interval in seconds and restart 
              the while loop if the elapsed time is zero.
              Lines 38 through 66 perform most of the work. This is where we 
              calculate the deltas for the TCP counters over the elapsed time 
              interval. The algorithm for calculating these deltas is borrowed 
              from the tcp_class.se module, distributed as a part of SE Performance 
              Monitoring Toolkit [8].
              The following measures are calculated:
              
              
             
               tcpRetransPercent -- Percentage of retransmitted 
                bytes relative to the total number of bytes transmitted over the 
                time interval. 
               tcpListenDrop and tcpListenDropQ0 -- Number 
                of connections dropped from the completed connection queue and 
                incomplete connection queue, respectively. 
               tcpHalfOpenDrops -- Number of connections dropped 
                after the initial SYN packet was received over the time interval. 
               tcpOutRsts -- Number of TCP segments sent out that 
                contained the RST flag, over the time interval. 
               tcpAttemptFails -- Number of connections that made 
                a direct transition to the CLOSED state from either SYN-SENT state 
                or SYN-RCVD state, plus the number of connections that made a 
                direct transition from SYN-RCVD state to LISTEN state over the 
                time interval. 
               tcpInDupPercent -- Percentage of complete duplicate 
                data segments received relative to the total number of segments 
                received over the time interval.
              Once all measures are calculated, the program saves the current 
              TCP statistics and the value of the high-resolution timer for subsequent 
              iterations of the while loop (line 68) and continues onto carrying 
              out series of checks (lines 70 through 72).
              The program compares the retransmission percentage against the 
              predefined threshold value. Older releases of Solaris (prior to 
              Solaris 2.6) had problems with TCP retransmission algorithms, thus 
              high retransmission percentages seen on these systems may go away 
              when all necessary TCP patches are applied. On newer systems, however, 
              high retransmission percentage usually implies that some network 
              hardware is faulty and dropping packets.
              The next two checks are, perhaps, the most interesting and have 
              more to do with intrusion detection than with performance monitoring. 
              To fully understand what is going on here, one must understand how 
              TCP establishes connections. The 3-way handshake connection establishment 
              process [6] assumes that in order to initiate a connection, a client 
              application will send a SYN (synchronize sequence numbers) segment, 
              which specifies the server port number to which this client wants 
              to connect, and the client's initial sequence number (ISN). 
              The server then replies with a SYN/ACK packet -- the segment 
              that contains the server's initial sequence number and the 
              acknowledgement of the client's SYN. Next, the client acknowledges 
              the server's SYN with another ACK segment. However, if a client 
              attempts to connect to a port to which no service is listening, 
              the server will reply with an RST (reset) packet.
              Port Scanning
              There a few different techniques that port scanners utilize to 
              produce a list of services running on a target machine. The simplest 
              and most basic form of TCP scanning is vanilla connect scan. This 
              technique relies on the connect(3SOCKET) system call to open 
              a connection to each port of interest on a target machine. If the 
              connection succeeds, there's a service listening; otherwise, 
              the port is unreachable. Apparently, TCP connect scan is very "loud" 
              as most systems will log the failed connection attempts, and very 
              inefficient, especially over slow connections.
              A much better scanning technique is SYN or half-open scanning. 
              When using this form of scan, a client will send a SYN packet just 
              like it would do while initiating a normal connection. If the server 
              replies with SYN/ACK, the port is in service; if RST is received, 
              the port is unreachable. Upon receiving a reply from the server, 
              the client immediately sends back an RST packet, thus tearing down 
              a connection, which never goes into the established state. SYN scanning 
              is fairly efficient and significantly less visible, as half-open 
              connection attempts are normally not logged by the target system.
              Yet another scanning technique, even more clandestine than SYN 
              scanning, is FIN scanning. When FIN scanning, a client sends a FIN 
              (finish sending data) packet to a server. If the RST reply is received, 
              the port of interest is closed; however, if the FIN packet is ignored 
              altogether, the port is listening. As we can see, regardless of 
              the scanning technique used, the server will most likely send RST 
              replies out if packets arrive on a closed port. Therefore, to detect 
              a port scan in progress, all our program has to do is to check the 
              number of RST packets sent out (tcpOutRsts) against a pre-defined 
              threshold and report a possible port scan if this threshold is exceeded.
              SYN Flooding
              The next check is a bit more complex, as it attempts to detect 
              a possible denial of service (DoS) attack -- SYN flooding. Normally, 
              while handling incoming connection requests, TCP queues incomplete 
              connections as well as completed connections, which have not been 
              accepted (via the accept(3SOCKET) system call) by an application 
              process. The maximum length of the queue is usually limited to prevent 
              excessive consumption of system memory. Once the limit is reached, 
              TCP will silently discard all new incoming connection requests until 
              all pending connections are processed.
              When launching a SYN-flooding attack, a client will first issue 
              a connection request to the server by sending a packet with SYN 
              flag set. As opposed to a normal SYN packet, however, this one will 
              have a client IP address spoofed to be that of an unreachable host. 
              In an attempt to complete the 3-way handshake, a server will keep 
              trying to send a SYN/ACK packet to this unreachable host for the 
              duration of an arbitrary timeout interval. Apparently if the attacking 
              host sends a few of these SYN requests to a particular port on a 
              target host (for instance, the telnet port 23), the backlog queue 
              will fill up with pending connections to the point when the server 
              starts dropping all new incoming connection requests. Thus, the 
              server remains practically unusable until it finishes handling all 
              outstanding connections on its backlog queue -- it is in effect 
              flooded.
              The tcpmon program, therefore, monitors the total number of connections 
              dropped from the backlog queue (tcpListenDrop + tcpListenDropQ0) 
              over a period of time, trying to determine whether the backlog limit 
              has been reached. Backlog queue drops alone may just mean that the 
              server accept processing is inefficient. However, when paired with 
              excessive number of half-open connection drops (tcpHalfOpenDrop), 
              they may be indicative of a SYN-flooding attack in progress.
              Recent releases of Solaris are quite resilient to SYN flooding. 
              Instead of just one backlog queue, Solaris systems feature two. 
              The first one is the complete connections queue, which holds those 
              connections for which the 3-way handshake has been completed but 
              the accept(3SOCKET) call has not yet been issued. Second 
              is the incomplete connections queue (or Queue 0), which holds one 
              entry for every SYN packet that arrived. Once the server receives 
              an ACK from the client, a connection is moved from an incomplete 
              queue to a complete queue. The size limit value for the incomplete 
              connection queue is typically quite large, which makes a server 
              more resistant to SYN-flooding.
              In fact, size limit values for both queues, as well as another 
              parameter -- connection timeout (which affects the duration 
              of time the server attempts to contact an unreachable host in our 
              SYN-flooding scenario) -- can be further tuned to maximize the 
              server's resistance to SYN floods. Perhaps the easiest way 
              to view or modify the values of these parameters is via the ndd(1M) 
              command. The following are the variable names, that ndd(1M) 
              uses to retrieve of set the values of these tunables:
              
              
             
               tcp_conn_req_max_q -- Maximum value of completed 
                connections waiting for an accept(3SOCKET) call to finish. 
               tcp_conn_req_max_q0 -- Maximum number of connections, 
                where 3-way handshake has not been completed. 
               tcp_time_wait_interval -- Maximum amount of time 
                a TCP socket will remain in TIME_WAIT state.
              Thus to read the value of, for example, the size limit of the 
              completed connection queue, the following command should be executed:
              
             
ndd   /dev/tcp   tcp_conn_req_max_q
For the adventurous types, however, who want complete programmatic 
            control over the TCP/IP tunable parameters, we created another Perl 
            module, called Solaris::NDDI [9]. This module does essentially the 
            same thing as ndd(1M) (although, it doesn't call ndd(1M) 
            internally but rather utilizes some convoluted C code), and can easily 
            be used by a regular Perl script. For instance, to read the value 
            of the same tcp_conn_req_max_q variable, the following code 
            should be used:  
             
use Solaris::NDDI;
$ndd = new Solaris::NDDI ("/dev/tcp");
print $ndd->{tcp_conn_req_max_q}, "\n";
Having finished with intrusion detection checking, our tcpmon program 
            looks at two other very simple conditions -- duplicate input percentage 
            (which is indicative of excessive retransmissions done by remote servers) 
            and the number of failed attempts to connect to remote applications. 
            Obviously, this simple monitor packs a lot of useful functionality 
            into fewer than a hundred lines of code. To ensure that the program 
            actually does its job, we launched our favorite port scanner from 
            a remote host as follows: 
             
nmap   -sS   sun3
Immediately, tcpmon starts outputting the following message:  
             
"Incoming connections refused: port scanner attack."
Although, the example programs described in this article are fairly 
            rudimentary and lack the strength expected in a robust production 
            application, I hope enough background information has been presented 
            to demonstrate the simple yet powerful functionality afforded by the 
            Solaris::MIB2 module. I also hope this article achieves its goal of 
            stimulating the reader's appetite for building lightweight flexible 
            custom network monitors, and that the techniques outlined here can 
            be used to solve the some challenging network-related problems.  References
              1. RFC 1155. Structure and Identification of Management Information 
              for TCP/IP-based networks.
              2. RFC 1213. Management Information Base for Network Management 
              of TCP/IP-based Internets: MIB-II.
              3. Sun Microsystems, Inc. STREAMS Programming Guide. Part Number 
              805-7478-10.
              4. Net::SNMP by David M. Town. www.perl.com/CPAN-local, CPAN directory 
              DTOWN, Net-SNMP-4.0.1-tar.gz
              5. Solaris::MIB2 by Alexander Golomshtok. www.perl.com/CPAN-local, 
              CPAN directory AGOLOMSH, Solaris-MIB2-0.01.tar.gz
              6. TCP/IP Illustrated, Volume 1. W. Richard Stevens. Addison-Wesley 
              Publishing Company, 1994. ISBN 0-201-63346-9.
              7. Time::HR by Alexander Golomshtok. www.perl.com/CPAN-local, 
              CPAN directory AGOLOMSH, Time-HR-0.01.tar.gz
              8. SE Performance Monitoring Toolkit. Adrian Cockcroft, Richard 
              Pettit. www.setoolkit.com.
              9. Solaris::NDDI by Alexander Golomshtok. www.perl.com/CPAN-local, 
              CPAN directory AGOLOMSH, Solaris-NDDI-0.01.tar.gz
              Alexander Golomshtok is a project manager and technology specialist 
              at JP Morgan Chase. He can be reached at: [email protected].
           |