oct2002.tar

Optimizing Time to Completion with runmany

Bill Davidsen

If you are using a computer with multiple CPUs, lots of memory, and other resources (like bandwidth), and you need to do many small, similar tasks, it makes sense to simplify the process. However, trying to do too much at once can really bog down your system. runmany is a tool I developed to help fit the load to the resources and complete jobs faster.

runmany is a Perl program that accepts lines of input from stdin, substitutes the string read into a command-line script, and runs some limited number of the commands in parallel. Typically, they don't all run simultaneously, so runmany keeps the load constant by starting another process as soon as one finishes, rather than starting a fixed number of processes and waiting for all to finish before starting more. See Listing 1. (All source code for this article can be downloaded from: http://www.sysadminmag.com/code/.)

Example

I originally wrote runmany to help me do some odd feeding of Usenet news articles that didn't fit the normal peer-to-peer model. For this example, assume I have identified a large number of articles that need to be fed (or fed again) to a remote site. If I feed them one at a time, I will take the sum time of all the transfers. If I have 100-k articles, I don't want to start a process and open a socket for each, so I run a limited number of sockets (experience tells me about six), and process a reasonable number of articles with each socket.

With this (assuming that I want batches of 500 articles), I want to run six streams at a time, and I have all the article information in one big file:

split -500 bigfile sfq.
ls sfq.* | runmany "innxmit -a -v server2 $PWD/%s" 6

The "%s" in the runmany argument is replaced with the value of a single line read from stdin -- in this case the filename of a small list of articles. I first break the input into files of 500 items each (using split), then use ls to generate a list of the small files, which I pipe into runmany.

The time to complete each of the individual small files depends on article size, bandwidth, and the server. The first processes are all started at once but the behavior quickly becomes unsynchronized. With some connections, large files will be slow due to limited bandwidth, and some servers will be slow to accept many small articles due to database performance adding the data. This way, I keep a reasonable load on the machine regardless of the performance of the individual connections.

Using runmany with Graphics

I sometimes have a large number of graphic images and want to perform a common operation on each image (e.g., making a scaled copy of each image). If I were doing this on a system with several CPUs, I would want to have multiple copies running to finish the job in minimum time. I use the netpbm package for image manipulation; I convert each image to standard format, scale it to fit in 640x480, and then save it as an optimized JPEG file with "_STD" appended to the name.

Here is one way to do this:

find images -name '*.jpg' |
runmany "djpeg %s | pnmscale -xys 640 480 | cjpeg -o >%s_STD" 2

Depending on your hardware and operating system, the number of processes might be between one less than the number of CPUs you want to use, and one more. This shows what someone actually did to address a problem like this:

find . -name '*.jpg' |
while read filename; do
  djpeg $filename | pnmscale -xys 640 480 | cjpeg -o >$filename_STD &
done; wait

Other than showing that AIX will stay up with a load average of 400+, it was definitely not the way to get the job done in minimum time. In practice, the number of jobs should be close to the number of CPUs you want to (or are allowed to) use, limited to the number of jobs that will fit comfortably in physical memory, or to the number of processes that will use the bandwidth of the network or disk. Don't forget disk! The actual clock time using runmany with two processes (shown previously) was 61% of the time doing one at a time. For 171 images, this was 59:51 vs. 36:31, a savings of more than 23 minutes.

Other Uses

While I wrote runmany to process fixed lists of things to do, I have since used it at the end of pipe to process lists generated under program control. Because the command strings are arbitrary, it is more flexible than xargs, and many versions of xargs will only run one process at a time.

Final Thoughts

runmany is primarily a tool for "one-time" problems. It is easy to use and understand and can greatly improve performance relative to the effort required to use it. It is not intended to be an optimal solution to any one problem, but rather an option to solve many problems. Perhaps the next time you have a bunch of small, identical tasks to perform, it will be as useful to you as it is to me.

Bill Davidsen has been doing systems programming and administration since 1968, and was one of the founders of TMR Associates in 1979. In addition to being a "part-time CTO" at TMR, he works as a project leader with a national ISP and writes an Internet column.