TCP/IP
Networking in Gawk 3.1.0
Mike Warner
In 1997, Jurgen Kahrs and Arnold Robbins added TCP/IP networking
capability to "gawk", the Free Software Foundation's
implementation of the awk programming language. The networking subsystem
that Kahrs and Robbins added to gawk began as a set of patches that
eventually migrated into the main source tree in time for Gawk 3.1.0.
Just after Arnold Robbins announced the availability of Gawk 3.1.0
on comp.lang.awk, I began downloading the source archive and building
Gawk 3.1.0 on various flavors of Linux and BSD. It has always built
flawlessly, and the networking capability has worked just as advertised.
In my opinion, gawk's networking syntax is self-explanatory,
so I won't spend time describing it here. Gawk's nicely
abstracted networking subsystem makes gawk networking scripts extremely
compact. In this article, I will present two sets of client/server
utilities. For additional information, see the resources listed
at the end of this article.
Client/Server Utilities
The first client/server set implements a file transfer capability
(FTC). The second set implements a remote execution capability (RXC).
The second capability rests on the shoulders of some straightforward
bash scripting as well as on gawk. But first, an important note
on network security: there is none in these scripts. You should
consider these utilities as research and carefully weigh the advisability
of deploying scripts like these in a production environment.
Also, note the naming convention: server scripts begin with an
"s," client scripts with a "c". For the file
transfer server, the server "gets" and the client "puts".
Thus, the file transfer server is "sg.awk", and the file
transfer client is "cp.awk". I have another pair of servers
in which the server puts and the client gets (sp.awk and cg.awk),
but this pair is not presented here. In the case of the remote execution
server/client pair, the names are "sx.awk" and "cx.awk,"
respectively.
File Transfer Capability
The file transfer capability (FTC) illustrated here implements
the following core architecture. Early on, you might ask yourself,
"How do remote clients and servers know on which port to erect
a socket?" "Well-known ports" and /etc/services provide
one answer to that question. The FTC illustrated here is a standalone
solution. It doesn't use /etc/services or any other port-synchronization
technology. It proceeds on the assumption that any system that will
execute the client cg.awk is already running the server sg.awk.
The selection of a port is up to you. You pass the selected port
to the server on the command line when you invoke it. The server
writes the port to /var/run/sg.port. The client cp.awk expects that
file to exist at the time it is invoked.
The client reads the file and erects a socket on that port. The
architecture illustrated here implements a superserver/subserver
architecture analogous to "inetd". sg.awk and cp.awk are
the superservers. sg2.awk and cp2.awk are the subservers. When sg.awk
receives a request to store a file on sg.awk's box (the remote
host), it spawns a copy of sg2.awk to actually service the request.
sg.awk transmits the port on which to service the transfer back
to cp.awk. cp.awk then spawns a copy of cp2.awk to transfer the
file on the port passed back to it by sg.awk. This process, or something
like it, is necessary in order to service multiple simultaneous
transfer requests.
If the superserver transfers the file itself, rather than delegating
the transfer, then it must implement a facility to queue requests.
Although queueing is certainly a feasible approach, it's not
the one I used. Where do the port numbers on which the subserver
and subclient perform the transfer come from? The superserver sg.awk
uses a slice of ports named on the command line that invokes it.
Here is a sample invocation string:
sg.awk 50000 50001 50001 50019 &
ARGV[1] ARGV[2] ARGV[3] ARGV[4]
This line says to communicate with the superclient cp.awk on port
ARGV[1], 50000. Use ARGV[2], 50001, as the first port to pass to sg2.awk
as the transfer port. As overlapping requests come in, continue to
increment ARGV[2] and use it as the transfer port until the incremented
value is greater than ARGV[4]. When the incremented value initialized
at ARGV[2] is greater than ARGV[4], set it to ARGV[3] and start over.
This works because each time sg.awk finishes servicing the request,
it spawns a new copy of itself before exiting, passing the last transfer
port as ARGV[2]. As sg.awk services requests, you will see this in
a series of ps's:
sg.awk 50000 50001 50001 50019
sg.awk 50000 50002 50001 50019
sg.awk 50000 50003 50001 50019
...
sg.awk 50000 500019 50001 50019
sg.awk 50000 50001 50001 50019
The FTC Client
Here is an invocation "man" for the FTC client:
cp.awk remotehost permissions ftype local-file [remote-file]
Here is an example invocation of the FTC client:
cp.awk corsair u+rw t /root/doc/linux.doc
This line says there is an IP abbreviation in /etc/hosts for a box
named "corsair". There is a file on the local host /root/doc/linux.doc.
It will retain its full pathname on the remote host. The file is text.
When copied, it should have the permissions "u+rw" applied
to it.
It is necessary to indicate the file type, because a "common"
technique like the technique illustrated in gawkinet.info does not
perform correctly for both text and binary files. Remember that
awk began life as a text-processing language. It expects files to
have records. For my FTC to operate correctly, I had to implement
a different algorithm for binary and text files. I decided to flag
the file type on the command line. If you get into networking with
Gawk, you may skin the cat differently. Perhaps you will determine
the file type transparently.
This FTC works durably in its present form. I routinely use this
FTC to transfer binary ISO images that are about 700 MB in size.
I've transferred files close to 2GB in size without problem.
Of course, anything smaller is cake. Listing 1 shows the fully internally
documented superserver/subserver, superclient/subclient quartet.
Remote Execution Capability
Next, I will illustrate a remote execution capability (RXC) in
gawk. The RXC shown here transfers an executable file (script or
binary) and zero or more support files from a local host to a remote
host where the executable is invoked by the remote RXC server, sx.awk.
In addition to the underlying FTC quartet (sg/sg2/cp/cp2), this
RXC requiries the remote server sx.awk to be in place. It expects
the following local scripts:
- cx.awk
- c2x.awk
- netgz (bash)
c2x.awk invokes cx.awk. The naming convention is likely backwards
here. netgz is a bash script that sits over the top of everything.
It triggers the RXC.
While cx.awk implements a rather trivial RXC, c2x realizes that
a non-trivial RXC may require a suite of files to be transferred.
In the technique illustrated here, c2x requires an initialization
file. Here is one that I routinely use to transfer a tar'd
directory to a remote host where sx.awk invokes a script transferred
earlier by an sg.awk executing on the same host:
WAIT 2
LOG /var/log/netgz.log
EXE s /cmn/scr/netgz.sh /var/run/netgz.sh
AUX b u+rw /cmn/tar/scr.tgz /cmn/scr.tgz
This file supports the following architecture: C2x assumes that the
binary or script that is transferred to the remote host may operate
on one or more files. These are named by the AUX keyword, one to a
line. The first argument to a keyword is the local file to be transferred.
The second argument is the file to create on the remote host.
I use the bash script "netgz" (see Listing 2) to transfer
tar'd directories to remote hosts where they are untar'd.
It creates on the fly both the script that executes on the remote
host and the "ini" file used by c2x.awk. "bash.i"
is a function database (not dumped here). bash.i contains these
functions used by netgz: [argc,assert,gethost,x]. "argc"
ensures the minimum command-line arguments. "assert" ensures
a file/directory exists. "gethost" allows me to further
abbreviate my local /etc/hosts declarations. "x" executes
a command line, also dumping the command line to stdout.
I hope this series of scripts triggers your imagination. There
are many different ways to skin the networking cat. GNU awk gives
us yet another one. Gawk rawks.
Resources
The AWK Programming Language by Alfred V. Aho, Brian W.
Kernighan, and Peter J. Weinberger. Addison-Wesley, 1988.
The man pages for Gawk 3.1.x
Gawkinet.info -- This has some very nice examples that triggered
the utilities presented here. Gawk 3.1.x has begun shipping with
some of the latest Linux distributions. If, for some reason, your
distribution contains Gawk 3.1.x, but not gawkinet.info, it's
always available in the "doc" subdirectory of the source
archive. (At the time of writing, gawk was at version 3.1.1.)
Effective awk Programming, 3rd Edition by Arnold Robbins
and Michael Brennan. O'Reilly & Associates, 2001.
The Usenet newsgroup comp.lang.awk.
Mike Warner has been a software engineer since the Z80 and
is currently an independent consultant.
|