Article

A Community-Style Overnight Job Spooler

Leor Zolman

For a small business running a single multi-user UNIX system, processes typically fall into one of two categories: real-time, interactive programs or batch style/background jobs. Interactive programs such as the system shell, editors/word processors, spreadsheets, and data entry systems all vie concurrently for slices of the CPU pie. Such programs spend most of their real time blocked waiting for user input, so they tend not to have much impact on system performance (as long as there is enough main memory to keep the jobs from getting swapped out to disk).

Batch-style jobs such as reports, backup scripts, or any CPU- or disk-intensive processes, on the other hand, have a relatively large impact on system performance. Such jobs demand as much of the available CPU resources as they can possibly get. It doesn't take many such CPU- or disk-intensive background jobs running simultaneously to slow user terminal response time down to a crawl.

In many cases, those "expensive" batch-style jobs would be less of a pain-in-the-CPU if they could be scheduled so as not to compete head-to-head with the interactive processes for system resources. In a business environment, the natural solution would be to run those jobs overnight whenever feasible, and reserve the business hours for interactive processes and high-priority batch jobs only.

Users should routinely be given the option of running batch-style jobs overnight. For jobs that must run immediately, background execution should also be an option. However, with a little bit of encouragement (and after enough instances of the molasses-syndrome due to an overloaded system), users will understand that overnight queueing works out best for everyone.

Pros and Crons

The basic UNIX System V configuration includes only a rudimentary set of job scheduling tools. The primary facilities, cron and at, allow the scheduling of jobs for execution at particular dates and times, but have no provision for prioritizing or sequencing those jobs in order to maximize system performance. Three users might, unbeknownst to each other, all schedulejobs for execution at 8:30 P.M. cron will dutifully start them all up at 8:30 P.M., resulting in some serious context-switching overhead while the jobs vie for system resources.

Now consider the case of automated daily backups. You could just set up the cron table to run the backup software every morning at 5:00 A.M., but what happens if those three long batch jobs are still slugging it out at 5:00 A.M., and changing critical data in the process? cron doesn't care, it just runs the backup; if the backup utility cannot properly coordinate file-locking issues in the course of a backup, the result may be lost data.

A better solution would be for all jobs scheduled for overnight processing to be registered with a single overseeing system, and for that system to be responsible for running the jobs in an orderly, non-interfering manner. The simplest way to implement this "ordering" is to ensure that all jobs are scheduled sequentially, such that each job is run to completion with as little competition from other jobs as possible -- especially other resource-intensive jobs.

With the addition of a prioritizing scheme, critical job-sequencing issues can also be properly managed. Then, for example, the daily backup script can be configured at the lowest possible priority, so that it runs only after all other jobs have been completed.

In this article, I describe a set of Bourne Shell scripts that work together to provide a sequential overnight job-spooling facility. The package is geared towards a "community-style" computing environment -- that is, an environment that allows any user to invoke a particular overnight job and that prints out or places the output resulting from the job in a public destination area on-line, so that any other user may choose to view or print out the results, as required by the specific application.

Any stdout/stderr output not explicitly directed into an output file by an overnight job will be captured into a default location, generally accessible only by a system administrator. This feature may be used as a simple status- and error-logging mechanism.

Directory Structure

The onitesetup.sh script (Listing 1) may be used to set up the directory structure and appropriate permission settings for the basic onite system. I've chosen a master directory location of /usr/spool/onite for the example implementation; another location may be more appropriate for your site. In those scripts where applicable, the SPOOLDIR configuration variable identifies the master onite directory.

Several subdirectories exist immediately beneath the master directory. The subdirectory jobs itself contains another tier of subdirectories corresponding to the various job priority levels. The system may be configured for any number of priority levels; when there are n levels of priority, the subdirectories are named P1 through Pn.

In scripts where applicable, the NPRIORITIES variable defines the number of priority levels implemented.

The subdirectory stdout receives the intermixed, non-directed ("bit bucket") output of both the stdout and stderr streams for the last NTOLEAVE jobs that have been run through the spooler. The value of NTOLEAVE is configured in the master driver script, onitego.sh.

The subdirectory jobsdone receives the "used" job scripts for the last NTOLEAVE completed jobs. The contents of this directory, along with the contents of stdout, as previously noted, exist primarily to support post-mortem analysis by the system administrator.

The onitego.sh script emits a log of all overnight spooler activity on its standard output and error streams. I've arbitrarily configured the log file to record this output in /usr/spool/onite/onite.log. The log file is created with the proper permissions by the installation script, setuponite.sh, but no other scripts explicitly write to the log file. With the following line in the "root" cron table,

0 20,23 * * *
/usr/local/onitego.sh
>>/usr/spool/onite/onite.log 2&1

the output of the master driver script is appended onto the end of the log file every time the master driver script executes.

A brief description of each individual script and auxiliary tool in the onite package follows.

The Configuration Script

onitesetup.sh (Listing 1) initializes the directory structure for your custom implementation of the onite system. Configure lines 15-18 for your system; line 14, the debug flag, may be used to create a "dummy" hierarchy in the current directory for testing purposes. To test the onite system using this dummy directory, copy all the scripts into your testing directory and change the initialization of debug to Y in all scripts where debug appears. This is especially useful once the system has been officially installed and you wish to test some new modifications without corrupting the currently active code and job queue directories.

The Master Driver Script

onitego.sh (Listing 2) invoked from the cron table, as shown above, "wakes up" to execute all spooled overnight job scripts in sequence. It scans all the $SPOOLDIR/jobs/P* directories in order, beginning with P1, looking for job files and submits each job file encountered to the shell for processing.

The standard output and standard error from each job is written to a file in the $SPOOLDIR/stdout directory with the same name as the job file. All program output from the job script should take the form of explicit output files or physical output. Any output emitted through the stdout and stderr streams should be considered for the system administrator's eyes only.

After the job has finished executing, the job file itself is moved to the $SPOOLDIR/jobsdone directory.

The standard output of the onitego.sh script provides a running log of job activity. If no jobs at all were queued for overnight processing, then a message to that effect is emitted. Otherwise, the script creates a lock file that exists for the duration of all job processing, and, for each job, writes a message announcing the name of that job and the time it begins its run.

When all jobs have been processed, the fleave.sh utility script is called to delete all files in the jobsdone and stdout directories except for those corresponding to the most recent $NTOLEAVE jobs. This keeps those directories from filling up with too much junk.

There are some basic limitations to the design of the onite system. The primary hazard is the case where a user is permitted to queue a job after the driver script has already begun execution for the evening. If the job is queued at a priority level equal to or greater than the priority level currently being processed, then the job may not be run until the next night. I've partially addressed this issue by scheduling the driver script for two runs per night, so that a job missed during the "first round" is picked up for execution in the "second round." This approach, however, assumes that all jobs from the first round are completed before the scheduled time for the second round comes up; if the earlier instance of the driver script is still running when the later instance "wakes up," the later instance will see the lock file, immediately abort, and go back to sleep. Also, a high-priority job that ends up running in the second round will effectively have been bumped down to the lowest possible priority, since all jobs from the first round will by then have already completed. In other words, if the priorities are really critical, then don't schedule the master driver script for more than one run per night.

The best way to prevent these kinds of conflicts is to make sure no jobs are queued past the time when the first instance of onitego.sh wakes up (see the discussion of spoolonite.sh below for some built-in protective measures).

"Run Driver NOW" Script

From time to time, you might discover that onitego.sh has not executed as normally scheduled. For instance, someone may have inadvertently broken the root cron table entry while doing administrative maintenance, or perhaps the system had experienced a crash before spooler startup time and hadn't been brought back up until after the startup time, so cron never had a chance to start the process. onitenow.sh (Listing 3) is designed for one-shot invocation by the system administrator in just such an event. The script simply starts up the master driver immediately as a background task immune to hang-up, and sends the output into the appropriate log file.

The Job Queuing Script

The last of the major scripts in this package, spoolonite.sh (Listing 4), schedules an overnight job for execution. spoolonite.sh is typically run from within a shell script, accepting the text of the job to be spooled on its standard input stream. There is only one mandatory command line parameter, the job name, and one optional parameter, the job priority level. If no priority level is specified, then the job is assigned a priority of $DEFAULT_PRIORITY as defined in the script.

The two variables USE_CUTOFF and CUTOFF_TIME may be configured to reject job submissions past a particular time of day. If USE_CUTOFF is Y, then any attempt to queue a job after the clock time specified by CUTOFF_TIME will be rejected (lines 40-47).

The variable CHECK_LOCK may be configured to reject job submissions once the nightly queue has begun executing; this, in conjunction with the USE_CUTOFF mechanism, effectively eliminates the possibility of "orphaned" jobs in the queue after the master driver script has completed its run (lines 49-57).

Since the contents of the stdout and jobsone directories are not broken down by priority level, only one instance of any specific job name is allowed per night (lines 55-65). It is left up to the system administrator, using the tools provided in this package (such as oname.sh), to construct unique names for all job scripts.

Environmental Issues

Since the master driver script is invoked from root's cron table, all jobs are actually run under the root's user-ID and environment, not under the user-ID and environment of the invoking user. Thus, spoolonite.sh must see to it that the original user's environment is replicated as faithfully as possible at the time his/her overnight job script is run.

Line 79 begins to construct the job file by dumping the entire contents of the user's environment settings into it. Line 78 prevents a nasty problem in the case where the user's PS1 (primary prompt string) variable was exported and happens to contain a multiline string. If PS1 were not redefined in this case to isolate the embedded newline within a set of quote marks, then the shell would become confused by the multiline string when the time came to interpret the job script. If there are any other variables in your user's environments that could conceivably be set to multiline string values and then exported, those variables must be redefined in a similar manner before line 79 executes.

If any programs invoked from a user's job script need access to any variables in the user's environment, then those environment variables must be exported by the job script. The design of this package assumes that "unsophisticated" users will not be creating their own custom environment variables and spooling jobs for overnight execution that depend on those variables. Sophisticated users can include the commands to define and export such variables, if necessary, on their own when preparing their scripts.

When the list of common critical environment variables is known, however, then that list may be specified as the value of toexport (line 29). For our installation, this list includes the PATH, two variables relating to database configuration, and two that affect printer output routing. I know these variables are defined in every user's startup profile, because I maintain those profiles.

In line 83, spoolonite.sh generates a cd statement that sets the current directory for job execution to the user's actual current directory. Finally, the explicit job script text is copied from the standard input onto the end of the job file.

Displaying the List

showonite.sh (Listing 5) summarizes all jobs queued for overnight processing, showing the job name, name of the invoking user, and priority level. The contents of each priority directory are displayed by piping the output of the l command to awk for formatting.

Cancelling a Job

A user may change his/her mind about an overnight job, and need to cancel it. killonite.sh (Listing 6) performs that duty. It may be configured to restrict users to killing only their own jobs, or to allow users to kill anyone's queued jobs, depending upon the value of the OwnOnly variable (line 9).

This script uses the utility script lpick.sh, described below, to let the user pick a job "by number".

Looking for a Particular Job

It may not make any sense for certain kinds of jobs -- for example, a process that checks a mailing list for illegal addresses before a monthly mailing -- to be run more than once per night. If someone requests such a job for the second time in a single day, it can only be because they didn't realize someone else had already scheduled it. isonite.sh (Listing 7) helps the system administrator detect such duplications. Given a job name as the command-line parameter, it returns a true status if a job by that name has already been scheduled.

Generating a Unique Name

When it makes sense for a certain type of job to be scheduled for multiple runs in one evening, each instance of that job must still be given a unique job name. The oname.sh script (Listing 8) is a simple inline tool for generation of unique file names; it uses the tmpname.c program described below to generate a file name in the system /tmp directory, then chops off the /tmp/ prefix to return just the base file name on the standard output. For example, to generate a unique job name for an instance of a report identified as ren, I might use:

jobname=`oname.sh ren`

General Utility Programs and Scripts

All the scripts described above were written specifically for the Overnight Spooler system. The short scripts and C programs described in this section are general-purpose tools used by many of our shell scripts, including the onite system.

checknum.c (Listing 9)

This C program examines its first command-line parameter, converts the leading portion of it into a number value, and returns that ASCII number alone on the standard output. If the parameter contains no leading numeric component, the string ERROR is returned instead and the script terminates with an error status of 1. checknum is used by spoolonite.sh and onitego.sh.

tmpname.c (Listing 10)

tmpname.c simply extends the functionality of the tempnam() C library function to create a tool available for use directly in a shell script. For example, the following command creates a unique file name in /tmp that begins with the characters "abc":

filename=`tmpname abc`

pick.sh (Listing 11)

Given a text file containing a list of items to select from and a generic description of the flavor of item being chosen, this script describes, sequentially numbers, and displays the list, then waits for the user to select one of the items according to the displayed sequence numbers. The user may either enter a sequence number to make a selection, or press the return key alone to indicate "none."

If the user makes a selection, lpick.sh returns the text of the selected item on the standard output; else, the text ABORT is returned. killonite.sh uses lpick.sh for prompting the user to select a job to cancel.

fleave.sh (Listing 12)

onitego.sh calls this utility script to clean out old files in the jobsdone and stdout subdirectories.

ask.sh (Listing 13)

This little script prompts the user with a given text string, insists upon a y/n response, and returns Y or N accordingly on the standard output.

A Report Queuing Example

Listing 14 shows an example script that spools a user-requested report program as an overnight job. This script, invoked from a menu system in our case, prompts the user for a publication code (using the getmag shell tool) and proceeds to set up a job that runs a set of mailing address consistency checks for the specified publication. Some other internal shell tools, such as magname and nissue, appear in the script, but their use is related to the specific application and not to the spooler system in general.

The job text is first written to a temporary file, then the temporary file is fed to spoolonite.sh in line 49. After return from spoolonite.sh, the temporary file is deleted.

A Periodic Job Spooling Example

Earlier I mentioned the problem of backup scheduling conflicts. By spooling the backup routine as the lowest-priority overnight job, all potential concurrency issues can be avoided, and it is guaranteed that the backup program doesn't run until after all other processes have completed their tasks.

Say you have a backup driver script named dump.sh that performs the physical backup operations, and you're currently calling it directly from the cron table at some fixed hour of the night. To convert this task into a spooled overnight job, create a special driver to spool the dump.sh script as an overnight job. Such a driver, named spooldumps.sh, is shown in Listing 15.

Then, in your cron table, simply change the line that used to call dump.sh to call spooldumps.sh instead, some time before the nightly onitego.sh run is scheduled to begin. For example, here is the root cron table entry from our system:

30 18 * * 1-5 /usr/local/spooldumps.sh

This causes the spooldumps.sh script to execute every evening at 6:30 P.M. (our onitego.sh is scheduled to start up at 8:00 P.M.). spooldumps.sh schedules the dump.sh process (which resides in the /u3/Backup directory) at priority 7, the lowest priority. Thus, the dump.sh script is the last program to execute every night.

About the Author

Leor Zolman wrote BDS C, the first C compiler targeted exclusively for personal computers. He is currently a system administrator and software developer for R&D Publications, Inc., and columnist for both The C Users Journal and Windows/DOS Developer's Journal. Leor's first book, Illustrated C, has just been published by R&D. He may be reached in care of R&D Publications, Inc., or via net E-mail as [email protected] ("...!uunet!bdsoft!rdpub!leor").