Unix Administration with Self-Parsing Shell Scripts

John Kozubik

Unix systems administration often consists of repeating common, similar operations on different sets of targets. The notion of using, and stringing together, building-block primitives such as cat, sort, strings, wc, sed, and awk is a core tenet of Unix philosophy. Sometimes jobs are too complicated to solve with a single command-line of pipelined or redirected primitives, or perhaps the job will be repeated often enough that it makes sense to create a more permanent script to produce the desired outcome. The problem that arises is how to allow the script to handle a variety of related inputs, so that it will be useful in the future when different data needs to be processed using the same procedures.

The obvious solution is to write a script that operates on data files so that the script can exist on its own, and be used to process any number of new data sets as they are introduced in the future. A good example of this is the rc system used for system initialization in FreeBSD. In this arrangement, a number of rc scripts (such as rc.network, rc.sendmail, and rc.diskless[1,2]) are run, each of which in turn parse through a single user-edited configuration file, /etc/rc.conf. In this configuration, changes to /etc/rc.conf alter the way that rc scripts run and what they perform.

This solution is a useful and elegant one and lends itself to modular organization and an increase in overall ease of use -- especially when implemented as well as it is in the FreeBSD rc system. However, there are more discreet tasks (tasks that do not fit into some larger framework, such as system initialization) that can be simplified by removing even the data files on which the script operates. By incorporating the data into the script itself, you can create powerful systems administration tools in the form of simple shell scripts that consist merely of a single file.

The basic organization of such a script is a set of one or more data items that are commented out, as they are not actual commands, but commented in such a way that they can be distinguished from normal comments:

01: #!/bin/sh
02: #
03: # Here is the data set, and perhaps we will add some other comments here
04: #
05: ##DATA var1 var2 var3
06: ##DATA var1 var2 var3
07: ##DATA var1 var2 var3

As you can see, normal comments are commented out with one # character, but data items are commented with ##. Not only does this allow us the ability to parse through the script and easily identify which lines are data lines (as opposed to normal comments), but it also allows us to quickly disable a data line that we temporarily do not want to use. Simply remove one of the # characters from the data line that is not to be used; it will not be parsed because it has no leading ##, but it still starts with #, and thus does not affect the script because it is still commented out.

The body of the script consists of a variable defining the path of the script itself (obviously the script needs to know the path to itself if it is to parse itself), and then a while loop that reads in every line of the script, but filters out (using grep) only those lines that are data lines, which begin with ##DATA:

08: myself=/usr/local/bin/script.sh
09:
10: while read line
11: do
12:
13:    if echo $line | grep "^##DATA"
14:    then
15:
16:      var1=`echo $line | awk '{print $2}'`
17:      var2=`echo $line | awk '{print $3}'`
18:      var3=`echo $line | awk '{print $4}'`
19:
20:      diff $var1 $var2 >> $var3
21:
22:    fi
23:
24: done < $myself

Here the script (in line 08) defines the path to itself and uses a "while read line" construct, the end of which (in line 24) takes as input the name of the script itself. While the script reads every line of itself, it selects, through the use of grep, only the lines that begin with ##DATA. As noted above, disabling a particular line of data can be done by simply removing one of the # characters, as it will still be commented out, but will not be selected as an input line by grep.

What the script above actually does is trivial -- as you can see, in lines 16, 17, and 18, we simply echo each line and pipe the output through awk to grab the second, third, or fourth word in that line (since the first word is ##ENTRY) and assign it to a variable. Then we perform a simple operation on the variables from that line.

Because we have three total lines of data in this script (lines 05, 06, and 07), the diff action in line 20 will be run a total of three times, each time with a different group of data. There is no limit to the number of data lines we could put in the script. It should be noted, however, that every line in the entire script will be parsed and tested by the grep conditional in line 13, regardless of whether it is a data line.

What are some practical uses for self-parsing shell scripts such as this? In June 2000, I wrote a simple, but powerful backup utility that, three years later, I now use to back up every system in my organization. The script, available here:

http://www.kozubik.com/software/kozubik_backup-1.0.tar.gz

contains data lines like this:

##ENTRY /usr/local/etc www7.kozubik.com-usr-local-etc /mnt/backup1
##ENTRY /var/mail www7.kozubik.com-var-mail /mnt/backup4

The first piece of data is a local path in the filesystem. The second piece of data is a label for that backup, and the final piece of data is a destination for the backup to be placed onto.

The body of this backup script is somewhat complicated, but can be simplified as follows:

00: myself=/usr/local/etc/backup.sh
01: while read line
02: do
03:
04:   if echo $line | grep "^##ENTRY"
05:   then
06:
07:   directory=`echo $line | awk '{print $2}'`
08:   name=`echo $line | awk '{print $3}'`
09:   backupdir=`echo $line | awk '{print $4}'`
10:   date=`date '+%y-%m-%d'`
11:
12:   if test -d $backupdir
13:   then
14:     if test -d $dir
15:     then
16:       tar cvzf $backupdir/$date-$name.tar $dir
17:     else
18:       echo "Directory to back up does not exist"
19:     fi
20:   else
21:     echo "Target directory does not exist"
22:   fi
23: done < $myself

The end result is, given the two data lines above, the creation of two gzipped tar files -- one named 03-07-20-www7.kozubik.com-usr-local-etc.tar that is placed in /mnt/backup1, and one named 03-07-20-www7.kozubik.com-var-mail.tar that is placed in /mnt/backup4. Note that different source directories for the backup can be placed in different target directories, as that is simply a variable defined in each data line.

Another, more complicated (but very useful) script can be found here:

http://www.kozubik.com/software/kozubik_pulse-1.0.tar.gz

This script contains data lines like this:

##QUERY www.example.com 80 "GET /index.html HTTP/1.0" "HTTP/1.1 \
  200 OK" "responds to HTTP requests" "DOES NOT respond to HTTP \
  requests" 2
##QUERY www.example.com 25 "QUIT" "221" "responds to SMTP \
  requests" "DOES NOT respond to SMTP requests" 2

The first piece of data is an address. The second piece of data is a port number. The third piece of information is a string to send to that socket. The fourth piece of information is a string we expect to be returned to us. The fifth and sixth data items are strings to return on success or failure, respectively. Finally, the seventh data item is the number of tries the script should make to query this network service before returning a negative report.

The body of the script is more complicated, as it calls netcat (nc) to query the services in the data lines of the script, but again, the basic functionality can be simplified as follows:

00: myself=/usr/local/etc/pulse.sh
01: while read line
02: do
03:
04:   if echo $line | grep "^##QUERY"
05:
06:   then
07:
08:   address=`echo $line | awk '{print $3}'`
09:   port=`echo $line | awk '{print $4}'`
10:   send=`echo $line | awk -F\" '{print $2}'`
11:   receive=`echo $line | awk -F\" '{print $4}'`
12:   upmsg=`echo $line | awk -F\" '{print $6}'`
13:   downmsg=`echo $line | awk -F\" '{print $8}'`
14:   retries=`echo $line | awk -F\" '{print $9}'`
15:   count=0
16:   successes=0
17:
18:   while ( test $count -lt $retries )
19:   do
20:
21:    output=`printf "$send\r\n\r\n" | nc -v -w 3 $address $port`
22:    count=`$expr $count + 1`
23:    if echo $output | grep "$receive"
24:    then
25:    successes=`expr $successes + 1`
26:    fi
27:
28:    done
29:
30:    if test $successes -eq 0
31:    then
32:    echo "$address $downmsg on port $port" >> /tmp/kozubik.pulse.fail.$$
33:    else
34:    echo "$address $upmsg on port $port" >> /tmp/kozubik.pulse.succeed.$$
35:    fi
36:
37:  fi
38:  done < $myself

In this script, the end result is that each line in the data portion of the script causes netcat to run with the corresponding destination and port number, and is given a number of retries within which to receive a positive response from the server daemon that is presumed to be running at that remote location. Although we do not show it here, the file that this script outputs (either failure or success) can then be emailed to one or more email addresses defined earlier in the script.

Note one significant difference between the backup script shown above, and this pulse.sh script. Because the backup script has data lines whose elements always consist of single words (such as /usr/local/etc), we can simply parse the lines using awk to grab words in the data line by their number. In the pulse.sh script, however, some of the data elements have multiple words, and look like "responds to HTTP requests", which means when we parse the data lines and pull out individual fields with awk, we need to specify a field separator and offset, as shown in lines 10-14:

upmsg=`echo $line | awk -F\" '{print $6}'`

The original basic template, shown as the first code example above, combined with the specific examples of a backup script and a remote monitoring script, show how using a self-parsing shell script can dramatically simplify common Unix systems administration tasks. Compare the simplicity of the pulse.sh script above, which monitors remote server daemons (which I use to monitor more than 800 servers across the world) to the complexity of other remote monitoring applications that encompass complicated software installations and utilize hundreds of program and data files. The abilities to quickly alter one or more data lines, disable lines (but leave them in place) by simply removing one of the two leading # characters, and transport the abilities of these programs in a single, self-contained file represent a significant gain in simplicity and ease of use.

I encourage you to download and begin using the three fully functional examples at http://www.kozubik.com, and to begin grouping common tasks into new self-parsing shell scripts to extend this simplicity and ease to other tasks I have yet to write examples for.

John Kozubik has been designing, administering, and programming Unix-based systems for more than 10 years. John owns and operates JohnCompanies, one of the largest Unix colocation providers in the United States.