jan2003.tar

Searching in Unusual Ways and Places

Æleen Frisch

A few weeks ago, I was reading an article that cited some statistics about how many times various actions were performed in the course of a lifetime: how many hours a person sleeps, how many miles are driven to work, how much food is consumed -- you get the idea. I started to think about how many times I've done various things, including how many times I'd run various UNIX commands. For me, the top two most frequently used commands are ls and grep. In the course of my career so far, I've run each of them more than 100,000 times.

Clearly, grep is a command I can't live without. I constantly use it on its own and in pipes with other commands. For example:

% ps -aux | egrep 'chavez|PID'
USER      PID  %CPU  %MEM    VSZ  RSS   TTY    STAT  START   TIME  COMMAND
chavez  14355   0.0   1.6   2556  1792  pts/2  S     10:23   0:00  -tcsh
chavez  18684  89.5   9.6  27680  5280  ?      R N   Sep25  85:26  /home/j03/l988

I use this command combination often enough with different usernames that I've defined an alias for it.

There are times, however, when I want to perform grep-like search operations but grep itself is cumbersome or impossible to use: finding data within network traffic, looking for a software package, locating a specific email message. In these contexts where grep can't be applied easily, I have to turn to other tools (some are open source, others are vendor provided). This article will look at some of them.

Searching Network Packets

Searching network traffic for patterns in real time is a useful technique for debugging a variety of network problems. It's not easy to apply grep to this task. It is possible to run a packet-capturing utility like tcpdump and then search the resulting output with grep, but this can be awkward and ineffective. What you often want is to examine the entire packet when some part of its data matches a pattern. Unfortunately, using grep with a packet dump will return only the lines containing the pattern. There are also times when this approach is simply too slow.

Fortunately, there is an open source utility that does exactly this job: ngrep. It is developed and maintained by Jordan Ritter, and the project's home page is http://ngrep.sourceforge.net. ngrep has the following general syntax:

ngrep [options] pattern [filter]

where pattern is a regular expression to search for in network packets, and the optional filter is an expression indicating the sorts of network packets to which to pay attention. This filter, technically known as a Berkeley packet filter (BPF), consists of a series of keywords specifying rules for selecting packets (BPF filters are also used by packet-dumping utilities). These keywords specify the source and/or destination host, network, protocol, and/or port. See the manual page for information about constructing BPF expressions. If the filter is omitted, then all packets are included.

Using a filter in combination with a search string usually makes searching more efficient and ngrep's output more readable. As an example, consider this ngrep command that looks for packets containing output from the finger command among all network traffic:

$ ngrep "No Plan\."

The command searches for a recognizable string from the finger output. If you run this on most systems and run a finger command within earshot of that host, you'll see the relevant packet intermixed with lots of number signs and dot series, and it will often be repeated several times. A number sign is printed for each packet examined, and dots indicate output interruptions. If you use the following command instead, then you won't get the ugly output, and ngrep will do less work:

# ngrep -q "No Plan\." port finger

Now, only packets to or from the finger port (79) are examined. The -q option suppresses the number signs.

Here is a more complex ngrep command that locates some specific FTP connection operations. It examines FTP-related packets sent from host hamlet to host ophelia and searches their data for the string USER:

# ngrep -q -t "USER" src host hamlet and dst host ophelia and tcp port 21
T 2002/09/24 14:11:15.413069 192.168.9.212:32813 -> 192.168.9.84:21 [AP]
  USER chavez..

T 2002/09/24 14:32:07.776476 192.168.9.212:32814 -> 192.168.9.84:21 [AP]
  USER amber..

For each packet, the output begins with a letter indicating the protocol (TCP here), followed by a time stamp (requested with the -t option), and then the source and destination host and port. The second line displays the packet's data. A command like this is one way of capturing all FTP connection attempts. As this example indicates, complex filters can be created by joining clauses with "and". You can also use the "or" and "not" logical operators as well as parentheses for grouping (the latter will need to be escaped to protect them from the shell).

ngrep can be very useful when testing and debugging network services. For example, I found it very useful when I enabled the secure versions of LDAP. Everything seemed to work fine after I was finished, but I wanted to verify that the secure version of the protocol was being used. If it was, then I would not be able to detect any clear text passwords in LDAP traffic. An ngrep command like this one enables me to test this functionality:

# ngrep 'badpassword' src host ldapserver and \( port 636 or port 389 \)

This command watches the ports corresponding to SSL and TLS-secured LDAP. While this command was running, I ran three queries: one using the normal ldap protocol, one using the ldaps protocol, and a third using a GUI LDAP client in TLS mode. All three queries succeeded and displayed the appropriate records. The ngrep command returned output only for the first one, so it was clear that the password had been encrypted in the latter two cases. ngrep was a better choice than a general packet dumper for this job because it limited its scope to exactly the packets in which I was interested.

This example illustrates that ngrep can be useful for not finding things as well as finding them. In this case, the lack of output was what I hoped for. Those of you who had qualms about the finger example earlier can apply this principle as well: because running a finger daemon is not a good idea in most environments, such a ngrep command can function as a security trap. If finger traffic appears on the network, ngrep will detect it and let you know there is a problem.

ngrep has several other useful options:

-i -- Perform a case-insensitive search.
-A n -- Display the n packets following each matched packet.
-d dev -- Use the specified network device.
-O file -- Save matching packets in file in addition to displaying them.
-X -- Interpret the search pattern as hexadecimal.

ngrep is useful for a wide variety of tasks ranging from testing network applications to monitoring network traffic. It is also quite useful for debugging specific operations or programs on busy systems because of its ability to extract very narrow ranges of packets for examination.

Searching Mailboxes

At first thought, grep ought to be able to perform a task like searching mailboxes for specific text. You can search mail files for text, but using grep has at least two disadvantages. First, you may want to retrieve the entire message(s) that matches the pattern, and grep only returns matching lines by default. Second, if any of the mailboxes contain lengthy MIME attachments, searching with grep can produce voluminous output arising from an unlucky false positive within the binary attachment.

A better tool for this job is grepmail, an open source utility written by David Coppit (see http://grepmail.sourceforge.net for more information). grepmail is designed specifically for searching mail folders. Here is a simple example of its use:

% grepmail -R -i -l hilton ~/Mail
Mail/conf/acs_w01

I was looking for the phone number of a specific Hilton hotel, which was in a mail message somewhere, but I couldn't remember where I'd filed it. This command searches for the string "hilton" (-I says to perform a case-insensitive search) in all mail folders under the specified starting directory (-R means recursive), and lists the names of files containing messages that match (-l option). The advantage of this approach is that I can search for the string I remember and find the telephone number even though the two items may be lines apart in the actual message. This command yields the phone number:

% grepmail -i hilton '!!' | grep -i telephone
Telephone:  619-231-4040

This grepmail command searches for the same string in the mail folder returned by the previous command. This time, grepmail will return the entire message as its output (since -l is omitted). The result is then piped to grep to isolate the phone number.

Here is a somewhat more complicated command that uses grepmail twice. Its goal is to find messages from user nadia that mention something related to Naples, Italy:

% grepmail -R -h "^From: .*nadia" ~/Mail | grepmail -b -i "naples|napoli|neapolit"

The first command searches mail headers (-h) for "From" lines including "nadia" somewhere in their text. The second command searches only the body (-b) of the matching messages for the specified strings.

grepmail has several other useful options:

-d date -- Limit search to messages on the specified date or within the specified date range. The date format is very flexible; see the manual page for details.
-v -- Display only non-matching messages.
-u -- Display only unique messages.
-M -- Don't search non-text MIME attachments.
-r -- Display a report listing each folder searched and the total number of matching messages within it.
-m -- Add an X-Mailfolder header to displayed messages; the header's text will be the path to the message's mail folder.
-H -- Display only the headers of matching messages.

It is also very easy to forward a mail message located in this manner. Here is a simple method:

% grepmail -m -u ... | mail -s subject someone@somewhere

Finally, some people prefer to view the search results from a mail client. This is usually easy to accomplish via a simple script that redirects grepmail's output to the mailer's default folder. Several have been created for this purpose:

pine: grepine by Cristin Pietsch -- http://www.dfki.de/~pietsch/software
mutt: grepm by Moritz Barsnick -- http://www.barsnick.net/sw/grepm.html
VM: vm-grepmail.el by Robert Fenk -- http://www.robf.de/Hacking/elisp/vm-grepmail.el

Search Operations for Software Packages

Software packages are another item whose contents are hard to search with grep. More specifically, I often want to answer questions like these:

Is a specific package installed?
What package does a specific file belong to?
What packages are available on an individual CD (or other media)?
What is included within a package (installed and not)?

On many systems, one or more of these questions can be answered using the package management tools supplied with the operating system. For example, the following commands can be used to list all currently installed packages on various UNIX systems:

Linux: rpm -q -a
FreeBSD: pkg_info -a -I
Solaris: pkginfo
HP-UX: swlist
AIX: lslpp -l all

You can pipe any of these commands to grep to determine whether a specific package is present to find its actual package name. For example, the following command lists all packages related to LDAP installed on a Linux system:

% rpm -q -a | grep -i ldap
nss_ldap-184-1
openldap-2.0.23-4
openldap-clients-2.0.23-4
openldap-servers-2.0.23-4

This system has the OpenLDAP servers and client utilities installed, as well as the modules that interface LDAP to PAM and to the name service switch file, /etc/nsswitch.

It's often useful to find out which package a particular file is part of (e.g., when you delete it accidentally and need to restore it). These command forms will indicate which package installed the specified file:

Linux: rpm -q ---whatprovides path
Solaris: pkgchk -l -p path
AIX: lslpp -w path

Here is an example from a Solaris system:

% pkgchk -l -p /etc/init.d/sendmail
Pathname: /etc/init.d/sendmail
Type: editted file
Expected mode: 0744
Expected owner: root
Expected group: sys
Referenced by the following packages: SUNWsndmr
Current status: installed

When you want to know what is contained in an installed package, use these commands:

Linux: rpm -q -l name
FreeBSD: pkg_info -L name
Solaris: pkgchk -l name | grep "^Pathname:"
HP-UX: swlist -l file
AIX: lslpp -f name

Here is an example from a FreeBSD system:

% pkg_info -L grub-0.91_1
Information for grub-0.91_1:

Files:
/usr/local/bin/mbchk
/usr/local/info/grub.info
/usr/local/info/multiboot.info
/usr/local/sbin/grub
...

In general, if you want to list the contents of an uninstalled package, you can replace the package name with the path to the package file in the preceding commands. On Linux systems, however, you must precede the package name with the -p option.

Only HP-UX and AIX have easy-to-use commands for listing the packages available on CDs or other media:

HP-UX: swlist -s path-or-device
AIX: installp -l -d device

On Linux, FreeBSD, and Solaris systems, you must rely on GUI package management tools to handle this function. On Linux systems, you can use gnorpm and similar packages (as well as yast2 on SuSE Linux systems). Under FreeBSD systems, you can use the sysinstall utility and select the Configure=>Packages menu path. On Solaris systems, the Supplementary Software CD includes a GUI installation tool that starts automatically when the CD is inserted, and it can be used to view the contents of the CD as well. On all three systems, you can also examine the directory containing the package files with ls for a quick listing of what is available.

Searching Net-SNMP MIBs

The Simple Network Management Protocol (SNMP) can be used to monitor and reconfigure a wide variety of computer systems and other network devices. The items that can be queried or set are defined in Management Information Bases (MIBs). A MIB is a collection of value and property definitions, and the various items are organized as a tree structure. This hierarchical organizational scheme serves to group related data together. MIB definitions are stored in files and are implemented in the software on the actual computers and devices. The MIB does not hold any data -- it is a schema, not a database.

Here is an example MIB item:

iso.org.dod.internet.mgmt.mib-2.system.sysLocation = "Machine Room"

The long string on the left is the setting's name, and its value is the string to the right of the equals sign. The name is separated into components by periods, and each corresponds to successive levels of the MIB tree. Thus, we can see that the sysLocation node is eight levels from the top of the tree.

Although the MIB is organized as a tree, it is not uniformly populated. The top four levels of the standardized MIB tree exist mainly for historical reasons. Given this rather ad hoc structure, searching the MIB tree for specific items is often essential. However, it is not a job for grep.

Most SNMP implementations provide utilities for examining MIBs. The open source SNMP implementation Net-SNMP is used on Linux and FreeBSD systems (and other UNIX systems, if desired). The tool the package provides to examine the MIB structure is snmptranslate. This command provides information about the MIB structure and its items. For example, you can use it to display a MIB subtree, as in this example:

% snmptranslate -Tp .iso.org.dod.internet.mgmt.mib-2.system
+--system(1)
   |
   +-- -R-- String    sysDescr(1)
   |        Textual Convention: DisplayString
   |        Size: 0..255
   +-- -R-- ObjID     sysObjectID(2)
   +-- -R-- TimeTicks sysUpTime(3)
   +-- -RW- String    sysContact(4)
   |        Textual Convention: DisplayString
   |        Size: 0..255
   ...

I've truncated the output after four entries.

snmptranslate can also provide detailed information about a specific MIB item, as in this example using the sysLocation leaf:

% snmptranslate -Td .iso.org.dod.internet.mgmt.mib-2. system.sysLocation
1.3.6.1.2.1.1.6
sysLocation OBJECT-TYPE
  -- FROM       SNMPv2-MIB, RFC1213-MIB
  -- TEXTUAL CONVENTION DisplayString
  SYNTAX        OCTET STRING (0..255)
  DISPLAY-HINT  "255a"
  MAX-ACCESS    read-write
  STATUS        current
  DESCRIPTION   "The physical location of this node (e.g., 'telephone closet, 3rd
           floor'). If the location is unknown, the value is the zero-length string."
::= { iso(1) org(3) dod(6) internet(1) mgmt(2) mib-2(1) system(1) 6 }

However, the most important searching feature -- finding the location within the tree of a specific leaf -- is not provided automatically by snmptranslate. This command will provide that information for the memTotalReal item:

% snmptranslate -Ts | grep memTotalReal\$
.iso.org.dod.internet.private.enterprises. ucdavis.memory.memTotalReal

This item, the total real memory present on a system, is located at the specified point within the hierarchy. A slightly more complex command can provide both the full location and a description for a MIB leaf:

% snmptranslate -Td 'snmptranslate -Ts | grep memTotalReal\$'

I use it often enough that I've defined an alias for this command:

% alias snmpwhat 'snmptranslate -Td `snmptranslate -Ts | grep \!:1\$`'

Unusual Pattern Matching Requirements

I'll conclude this article with a quick look at two searching/pattern matching topics that can be a bit tricky.

Filtering Foreign Language Email

Like many people, I use procmail to preprocess mail messages, including attempting to remove spam. My current recipes work reasonably well for mail messages in Western languages, but they fail for ones in many other languages (e.g., Japanese, Chinese, Russian). Currently, I get 15-20 such spam messages each day.

Some people deal with this situation by discarding all email from the corresponding countries, but this approach does not work for me as I get legitimate mail from these countries on a regular basis (from non-predictable senders). What I needed was a procmail recipe to identify the foreign characters, which are above the normal ASCII range. The trick here is to get all of these characters into the .procmailrc file. This is easiest to do by entering them on a system/application that supports two-byte characters. The next step is to copy that file in binary mode to the system where procmail is run where its contents can be pasted into the initialization file.

A quick and dirty procmail recipe will look something like this when viewed with most text editors:

:0BH:
* [\200\201\202...\377][\200\201\202...\377][\200\201\202...\377]
$MAILDIR/foreign_spam

For me, three such characters in a row was a good enough first attempt at solving this problem. There are many more elegant solutions available on the Web. One of the best is by Walter Dnes, and it is available at:

http://www.waltdnes.org/email/chinese/index.html

It takes advantage of procmail's weighting capabilities to detect messages containing more than 5% non-ASCII characters.

Less Well-Known Regular Expression Constructs

Most people are familiar with the asterisk, plus sign, and question mark modifiers to regular expression items (match zero or more, one or more, or exactly one of the item, respectively). However, you can specify how many of each item should be matched even more precisely using some extended regular expression constructs (use egrep or grep -E):

Form Meaning

{n} Match exactly n of the preceding item.
{n,} Match n or more of the preceding item.
{n,m} Match at least n and no more than m of the preceding item.

Here are some simple examples:

% grep -E "t{2}" bio
She has written eight books, including Essential Cultural Studies 
from Pitt. When she's not writing

% grep -E "[0-9]{3,}" bio
network of Unix and Windows NT/2000/XP systems. She

% grep -E "(the ){2,}|(and ){2,}" bio
and and creating murder mystery games. She
you'd like to receive the the free newsletter

The first command searches for double t's; the second command looks for numbers of three or more digits; and the third command searches for two consecutive instances of the words "the" and "and" (it's a primitive copy editor). You might be tempted to formulate the final item as:

(the |and ){2,}

However, this won't work, as it will match "and the," which is not generally an error.

Finally, be aware that the constuct {,m}, which might mean "match m or fewer of the preceding item," is not defined.

Æleen Frisch is a systems administrator currently looking after a pathologically heterogeneous collection of computers. She is also the author of Essential System Administration, just released in an expanded third edition, the new System Administration Pocket Reference, as well as several other books. She can be reached by email at: [email protected].