Optimizing
Time to Completion with runmany
Bill Davidsen
If you are using a computer with multiple CPUs, lots of memory,
and other resources (like bandwidth), and you need to do many small,
similar tasks, it makes sense to simplify the process. However,
trying to do too much at once can really bog down your system. runmany
is a tool I developed to help fit the load to the resources and
complete jobs faster.
runmany is a Perl program that accepts lines of input from stdin,
substitutes the string read into a command-line script, and runs
some limited number of the commands in parallel. Typically, they
don't all run simultaneously, so runmany keeps the load constant
by starting another process as soon as one finishes, rather than
starting a fixed number of processes and waiting for all to finish
before starting more. See Listing 1. (All source code for this article
can be downloaded from: http://www.sysadminmag.com/code/.)
Example
I originally wrote runmany to help me do some odd feeding of Usenet
news articles that didn't fit the normal peer-to-peer model.
For this example, assume I have identified a large number of articles
that need to be fed (or fed again) to a remote site. If I feed them
one at a time, I will take the sum time of all the transfers. If
I have 100-k articles, I don't want to start a process and
open a socket for each, so I run a limited number of sockets (experience
tells me about six), and process a reasonable number of articles
with each socket.
With this (assuming that I want batches of 500 articles), I want
to run six streams at a time, and I have all the article information
in one big file:
split -500 bigfile sfq.
ls sfq.* | runmany "innxmit -a -v server2 $PWD/%s" 6
The "%s" in the runmany argument is replaced with
the value of a single line read from stdin -- in this case the
filename of a small list of articles. I first break the input into
files of 500 items each (using split), then use ls to
generate a list of the small files, which I pipe into runmany.
The time to complete each of the individual small files depends
on article size, bandwidth, and the server. The first processes
are all started at once but the behavior quickly becomes unsynchronized.
With some connections, large files will be slow due to limited bandwidth,
and some servers will be slow to accept many small articles due
to database performance adding the data. This way, I keep a reasonable
load on the machine regardless of the performance of the individual
connections.
Using runmany with Graphics
I sometimes have a large number of graphic images and want to
perform a common operation on each image (e.g., making a scaled
copy of each image). If I were doing this on a system with several
CPUs, I would want to have multiple copies running to finish the
job in minimum time. I use the netpbm package for image manipulation;
I convert each image to standard format, scale it to fit in 640x480,
and then save it as an optimized JPEG file with "_STD"
appended to the name.
Here is one way to do this:
find images -name '*.jpg' |
runmany "djpeg %s | pnmscale -xys 640 480 | cjpeg -o >%s_STD" 2
Depending on your hardware and operating system, the number of processes
might be between one less than the number of CPUs you want to use,
and one more. This shows what someone actually did to address a problem
like this:
find . -name '*.jpg' |
while read filename; do
djpeg $filename | pnmscale -xys 640 480 | cjpeg -o >$filename_STD &
done; wait
Other than showing that AIX will stay up with a load average of 400+,
it was definitely not the way to get the job done in minimum
time. In practice, the number of jobs should be close to the number
of CPUs you want to (or are allowed to) use, limited to the number
of jobs that will fit comfortably in physical memory, or to the number
of processes that will use the bandwidth of the network or disk. Don't
forget disk! The actual clock time using runmany with two processes
(shown previously) was 61% of the time doing one at a time. For 171
images, this was 59:51 vs. 36:31, a savings of more than 23 minutes.
Other Uses
While I wrote runmany to process fixed lists of things to do,
I have since used it at the end of pipe to process lists
generated under program control. Because the command strings are
arbitrary, it is more flexible than xargs, and many versions
of xargs will only run one process at a time.
Final Thoughts
runmany is primarily a tool for "one-time" problems.
It is easy to use and understand and can greatly improve performance
relative to the effort required to use it. It is not intended to
be an optimal solution to any one problem, but rather an option
to solve many problems. Perhaps the next time you have a bunch of
small, identical tasks to perform, it will be as useful to you as
it is to me.
Bill Davidsen has been doing systems programming and administration
since 1968, and was one of the founders of TMR Associates in 1979.
In addition to being a "part-time CTO" at TMR, he works
as a project leader with a national ISP and writes an Internet column.
|