Cover V11, I12

Article
Listing 1
Listing 2

dec2002.tar

TCP/IP Networking in Gawk 3.1.0

Mike Warner

In 1997, Jurgen Kahrs and Arnold Robbins added TCP/IP networking capability to "gawk", the Free Software Foundation's implementation of the awk programming language. The networking subsystem that Kahrs and Robbins added to gawk began as a set of patches that eventually migrated into the main source tree in time for Gawk 3.1.0. Just after Arnold Robbins announced the availability of Gawk 3.1.0 on comp.lang.awk, I began downloading the source archive and building Gawk 3.1.0 on various flavors of Linux and BSD. It has always built flawlessly, and the networking capability has worked just as advertised.

In my opinion, gawk's networking syntax is self-explanatory, so I won't spend time describing it here. Gawk's nicely abstracted networking subsystem makes gawk networking scripts extremely compact. In this article, I will present two sets of client/server utilities. For additional information, see the resources listed at the end of this article.

Client/Server Utilities

The first client/server set implements a file transfer capability (FTC). The second set implements a remote execution capability (RXC). The second capability rests on the shoulders of some straightforward bash scripting as well as on gawk. But first, an important note on network security: there is none in these scripts. You should consider these utilities as research and carefully weigh the advisability of deploying scripts like these in a production environment.

Also, note the naming convention: server scripts begin with an "s," client scripts with a "c". For the file transfer server, the server "gets" and the client "puts". Thus, the file transfer server is "sg.awk", and the file transfer client is "cp.awk". I have another pair of servers in which the server puts and the client gets (sp.awk and cg.awk), but this pair is not presented here. In the case of the remote execution server/client pair, the names are "sx.awk" and "cx.awk," respectively.

File Transfer Capability

The file transfer capability (FTC) illustrated here implements the following core architecture. Early on, you might ask yourself, "How do remote clients and servers know on which port to erect a socket?" "Well-known ports" and /etc/services provide one answer to that question. The FTC illustrated here is a standalone solution. It doesn't use /etc/services or any other port-synchronization technology. It proceeds on the assumption that any system that will execute the client cg.awk is already running the server sg.awk. The selection of a port is up to you. You pass the selected port to the server on the command line when you invoke it. The server writes the port to /var/run/sg.port. The client cp.awk expects that file to exist at the time it is invoked.

The client reads the file and erects a socket on that port. The architecture illustrated here implements a superserver/subserver architecture analogous to "inetd". sg.awk and cp.awk are the superservers. sg2.awk and cp2.awk are the subservers. When sg.awk receives a request to store a file on sg.awk's box (the remote host), it spawns a copy of sg2.awk to actually service the request. sg.awk transmits the port on which to service the transfer back to cp.awk. cp.awk then spawns a copy of cp2.awk to transfer the file on the port passed back to it by sg.awk. This process, or something like it, is necessary in order to service multiple simultaneous transfer requests.

If the superserver transfers the file itself, rather than delegating the transfer, then it must implement a facility to queue requests. Although queueing is certainly a feasible approach, it's not the one I used. Where do the port numbers on which the subserver and subclient perform the transfer come from? The superserver sg.awk uses a slice of ports named on the command line that invokes it. Here is a sample invocation string:

sg.awk 50000    50001    50001     50019 &
       ARGV[1]  ARGV[2]  ARGV[3]   ARGV[4]
This line says to communicate with the superclient cp.awk on port ARGV[1], 50000. Use ARGV[2], 50001, as the first port to pass to sg2.awk as the transfer port. As overlapping requests come in, continue to increment ARGV[2] and use it as the transfer port until the incremented value is greater than ARGV[4]. When the incremented value initialized at ARGV[2] is greater than ARGV[4], set it to ARGV[3] and start over. This works because each time sg.awk finishes servicing the request, it spawns a new copy of itself before exiting, passing the last transfer port as ARGV[2]. As sg.awk services requests, you will see this in a series of ps's:

sg.awk 50000 50001 50001 50019
sg.awk 50000 50002 50001 50019
sg.awk 50000 50003 50001 50019
...
sg.awk 50000 500019 50001 50019
sg.awk 50000 50001  50001 50019
The FTC Client

Here is an invocation "man" for the FTC client:

cp.awk remotehost permissions ftype local-file [remote-file]
Here is an example invocation of the FTC client:

cp.awk corsair u+rw t /root/doc/linux.doc
This line says there is an IP abbreviation in /etc/hosts for a box named "corsair". There is a file on the local host /root/doc/linux.doc. It will retain its full pathname on the remote host. The file is text. When copied, it should have the permissions "u+rw" applied to it.

It is necessary to indicate the file type, because a "common" technique like the technique illustrated in gawkinet.info does not perform correctly for both text and binary files. Remember that awk began life as a text-processing language. It expects files to have records. For my FTC to operate correctly, I had to implement a different algorithm for binary and text files. I decided to flag the file type on the command line. If you get into networking with Gawk, you may skin the cat differently. Perhaps you will determine the file type transparently.

This FTC works durably in its present form. I routinely use this FTC to transfer binary ISO images that are about 700 MB in size. I've transferred files close to 2GB in size without problem. Of course, anything smaller is cake. Listing 1 shows the fully internally documented superserver/subserver, superclient/subclient quartet.

Remote Execution Capability

Next, I will illustrate a remote execution capability (RXC) in gawk. The RXC shown here transfers an executable file (script or binary) and zero or more support files from a local host to a remote host where the executable is invoked by the remote RXC server, sx.awk. In addition to the underlying FTC quartet (sg/sg2/cp/cp2), this RXC requiries the remote server sx.awk to be in place. It expects the following local scripts:

  • cx.awk
  • c2x.awk
  • netgz (bash)

c2x.awk invokes cx.awk. The naming convention is likely backwards here. netgz is a bash script that sits over the top of everything. It triggers the RXC.

While cx.awk implements a rather trivial RXC, c2x realizes that a non-trivial RXC may require a suite of files to be transferred. In the technique illustrated here, c2x requires an initialization file. Here is one that I routinely use to transfer a tar'd directory to a remote host where sx.awk invokes a script transferred earlier by an sg.awk executing on the same host:

WAIT 2
LOG /var/log/netgz.log
EXE s /cmn/scr/netgz.sh /var/run/netgz.sh
AUX b u+rw /cmn/tar/scr.tgz /cmn/scr.tgz
This file supports the following architecture: C2x assumes that the binary or script that is transferred to the remote host may operate on one or more files. These are named by the AUX keyword, one to a line. The first argument to a keyword is the local file to be transferred. The second argument is the file to create on the remote host.

I use the bash script "netgz" (see Listing 2) to transfer tar'd directories to remote hosts where they are untar'd. It creates on the fly both the script that executes on the remote host and the "ini" file used by c2x.awk. "bash.i" is a function database (not dumped here). bash.i contains these functions used by netgz: [argc,assert,gethost,x]. "argc" ensures the minimum command-line arguments. "assert" ensures a file/directory exists. "gethost" allows me to further abbreviate my local /etc/hosts declarations. "x" executes a command line, also dumping the command line to stdout.

I hope this series of scripts triggers your imagination. There are many different ways to skin the networking cat. GNU awk gives us yet another one. Gawk rawks.

Resources

The AWK Programming Language by Alfred V. Aho, Brian W. Kernighan, and Peter J. Weinberger. Addison-Wesley, 1988.

The man pages for Gawk 3.1.x

Gawkinet.info -- This has some very nice examples that triggered the utilities presented here. Gawk 3.1.x has begun shipping with some of the latest Linux distributions. If, for some reason, your distribution contains Gawk 3.1.x, but not gawkinet.info, it's always available in the "doc" subdirectory of the source archive. (At the time of writing, gawk was at version 3.1.1.)

Effective awk Programming, 3rd Edition by Arnold Robbins and Michael Brennan. O'Reilly & Associates, 2001.

The Usenet newsgroup comp.lang.awk.

Mike Warner has been a software engineer since the Z80 and is currently an independent consultant.