Beowulf Batch Processors and Job Schedulers
Edward L. Haletky and Patrick Lampert
Beowulf and grid technology provide an attractive mechanism to build
powerful compute clusters with inexpensive off-the-shelf components.
Yet this technology also introduces new and complex scheduling problems.
How do you distribute your work over the cluster? Can you schedule
your work for times of lower activity? These issues are addressed
by a variety of job-scheduling and load-balancing software tools that
we will examine in this article. We review seven different queuing
engines that can manage your resources, schedule jobs, and even interlock
runs based on execution dependencies. These seven systems range from
the simplest cron-related tools to grid engines.
We will examine ease of installation, configuration, creation
of a single queue, the steps required to submit an extremely simple
job, as well as the steps to dispatch jobs based on time of day
without using cron to enable and disable queues. Furthermore, we
will comment upon availability of multi-node capability, support,
and security, then present a simple chart to assist you in picking
your Job Scheduling/Queuing Software.
The systems discussed are: at(1)/bbq(1), Clusterware/Load Sharing
Facility (LSF), Condor, Generic Network Queuing System (GNQS), GNU
Queue, Open Portable Batch System (OpenPBS), and the Sun Grid Engine
(SGE). Although two of these systems are primarily grid engines
(Condor and SGE), they all provide a way to queue up jobs for execution
as the resources allow.
Each system was installed upon a Scyld Beowulf cluster with six
slave nodes comprising off-the-shelf spare parts. Installation of
the queuing agents took place on the master node leaving the six
slave nodes as computational nodes. To test that the queues were
running, we used a simple uptime script that follows. This allowed
us to test that everything was run as required by viewing the various
output and log files:
#!/bin/sh
date
bpsh -p -a /usr/bin/uptime | /usr/bin/sort -n
The date in the script allows us to verify when the job submittal
occurred. Since each queuing agent is different, the research was
originally performed as a learning exercise to determine what is different
between them and to determine which is best for each type of job.
We discovered that, although these systems differ, each contains a
mechanism to submit a job, get the status of a job, and to kill a
job. However, the means to configure a queue, the installation, and
the names of the commands are remarkably inconsistent. We also found
that the default Scyld installation had to be modified to allow the
queuing agents to be installed on all nodes of the cluster. Any comments
on the ability to run queuing agents on each node of the cluster are
purely from the documentation provided; we were not able to readily
verify this functionality.
At(1)
bbq(1)
At(1) has a dual distinction. Not only is it the oldest queuing
agent, it is also the one that will be available on the most operating
systems. It follows the "keep it simple..." rules attributed
to older Unix programs and, because of that, spans hardware and
time itself. At(1) manages time very easily, you can specify when
a job is to be run in a myriad of ways, and it allows jobs to be
queued into named queues of a single letter. By default there is
a "do it now" queue and a batch queue. The batch command
will process jobs as system load permits. While in a cluster this
means the load of the master node, it is still very useful if you
use your master node as the only interactive node and mainly for
job scheduling. At(1) cannot be used across nodes to manage resources.
Installation
At(1) should be installed by default and atd should be running.
However, you can use /sbin/chkconfig and /etc/service to verify
and start/stop atd; it is part of the at-3.1.8-24_Scyld RPM.
Use
Run testscript at or during a specific time range:
at -f ./testscript now (at a specific time)
No method to specify a time range
Run testscript as load allows (< 0.8):
batch -f ./testscript
Run testscript using a specific queue:
batch -q [A-z] -f ./testscript
To remove a job:
atrm jobId To list queues:
atq
bbq
or use the Scyld Webmin interface for bbq.
Support
In general, you would go to man pages or the Web for support
of at(1)/bbq(1).
Notes
At(1) has basic queuing mechanisms for firing off jobs at specific
times (time of day test) and based on load average. Although you
cannot combine them, you can either fire off jobs at time of day
or based on load average. You can also specify queues with a single
character from a-z or A-Z. The niceness of the run goes up as the
queue designator goes up. So submittal in a "Z" queue
will have more CPU than a run in an "a" queue. At(1) by
default uses the "a" queue, and batch by default uses
the "b" queue. There is no way to specify configurations
for a queue. Uppercase queues are treated as batched at the time
of "now". If users are competing for resources, you may
want to limit them to only use one queue. However, there is no way
to enforce this behavior.
Security
At(1) has a basic security concept. You can specify whether
a user can use at/batch to submit jobs but you cannot specify which
queue they can use. You use the files /etc/at.allow and /etc/at.deny
to specify which users have access to at/batch. If either file is
empty or non-existent, then all users can access the commands; however,
if you specify even one user, you must specify them all. If your
username is not in /etc/at.allow, then it is automatically denied
and vice versa for /etc/at.deny.
Clusterware/LSF
Clusterware and LSF are for fee products of Platform Computing
(http://www.platform.com). Clusterware is a subset of the full LSF
package. While Clusterware is specific to Beowulf, usage of the
queuing agent is so similar to LSF that we will cover both abilities
and differences in this section. Because this is a fee product,
you will need to purchase a license. Once you have that, you are
ready to begin.
Installation
Download and unpack lsf5.0_lsfinstall.tar.Z, and download the
media package but do not unpack it. We recommend that you review
the PDF file that comes with the installation. You will need to
substitute Clusterware for wherever you see LSF, when installing
Clusterware. You will need to write a startup script for the license
manager (lmgrd) if you do not already have one, and edit the license
file to represent the directory containing the program lsf_ld. If
you install LSF on the slave nodes, then they can reference the
master server. A default queue is automatically created.
Use
Run testscript at or during a specific time range:
To submit job at specific time:
bsub [-b mm:dd:hh:mm] ./testscript
To run job during a specific time
range:
Edit lsb.queues and modify the queue by adding RUN_WINDOW =
9:00,11:00 to the queue definition and then reconfigure the system
using badmin reconfig. This queue will only run jobs from 9 to 11
AM:
bsub -q timedq ./testscript
Run testscript using a specific queue:
bsub [-q queuename] ./testscript
To remove a job:
bkill jobId
To list queues:
bqueues (lists queues)
bjobs -a (lists jobs in queues)
Notes
Instead of "RUN_WINDOW" you may want to use "DISPATCH_WINDOW"
for the timed queue, as RUN_WINDOW will suspend jobs that do not
finish within the time specified. LSF also has configuration options
to operate with NQS and documentation on interoperability with condor
and specific Beowulf items (Clusterware). LSF will also run jobs
dependent upon the completion of other jobs.
Support
Platform Computing's support is wonderful. We had several
problems regarding the license file, which they answered in a timely
fashion. They also called us back and asked us to undo a change
that was not beneficial to the system and later on could cause problems.
This type of proactive support is highly appreciated.
Security
There is host, user, and group-level access controls, as well
as time of day. Additionally, there are some security options that
are dependent on other options and can change the run behavior.
Tie these standard options to the timed queue options and you have
a fairly complete security picture.
Condor
Condor is a grid engine available from:
http://www.cs.wisc.edu/condor/
All grid engines include a way to schedule and submit jobs to the
many systems connected to a grid. Although this is a grid tool, it
is included here for its abilities in submitting jobs. Its multi-architecture
approach and job control language (JCL) are extremely powerful if
you have multiple types of CPUs within your cluster, which is our
case. Condor has limited queues, called Universes, that cannot be
modified much, and it is mainly used for matching up machines with
specific types of jobs. You can run Condor daemons on every node of
your cluster. Condor will then choose the best hosts on which to run
jobs based on a classified ad (class ad) used to advertise machine
capabilities. The Condor testscript JCL is:
Universe = vanilla
Executable = /home/elh/testscript
Output = /home/elh/out.$(Process)
Error = /home/elh/err.$(Process)
Log = /home/elh/log.$(Process)
Arguments =
Queue
Installation
We installed Condor as a single machine entity by running condor_install
after creating a Condor user and adding a /etc/hosts entry for the
master node (which is required for Scyld Beowulf) but not for Condor.
Next, we copied the default central manager version of condor_config.local
to the appropriate directory and modified the file. We modified
the condor_config file to NOT "glidein" (join) a grid,
to change the CONDOR_STARTD options to start all the appropriate
daemons for a local configuration, and to always allow processes
to run. You can install Condor on each node of the cluster.
Use
Run testscript at or during a specific time:
No method to start a job at a specific time. You can create
a complex "START" line in your configuration file that
will only allow jobs to run during a specific time range --
in this case from 10 to 11AM only. The time range is number of minutes
since midnight and the configuration file should be modified as
such:
RunHours = (ClockMin > 600 && ClockMin < 660)
START = $(RunHours)
To run the job:
condor_submit condor_test
Run testscript using a specific queue:
condor_submit condor_test (condor_test contains the universe
specification.)
To remove a job:
condor_rm
To list queues:
condor_q
Notes
The condor_test JCL was not difficult to understand, but the
language can be very dense and, for complex jobs, a little daunting.
Because Condor is used for a grid, it must include such things as
architecture and operating system. There was a hole in the cluster
at test time (node 3 was down), and the complete test output was
never returned to us. Condor has multiple universes but not multiple
queues within a universe. Each universe is strictly defined based
on the type of job submitted (Shell scripts, MPI, PVM, and other
programs). Note that Condor will only allow one job to run at a
time, no matter which universe is in use. Items queued up to run
during the allowed run window were properly run.
Support
There is quite a large online documentation repository for
Condor, but some of it is incomplete. Mail to "condor_admin"
was queued and responded to within 24 hours. The total time to answer
our security-related question was roughly a week because of delays
in email and initial understanding of the problem. The configuration
file is fairly well documented.
Security
Condor has host-level access control to the pool of machines
available via the Condor daemon and its universes. There is a way
to only allow certain users administrative access to the system
but no way to block a universe by user except with an extremely
complex START variable involving the RemoteOwner macro. Condor will
"call home" to cs.wisc.edu and send information about
your Condor install unless you set "CONDOR_DEVELOPERS_COLLECTOR"
to "NONE", which is an option inside condor_config.
Generic Network Queuing System (GNQS)
GNQS is the granddaddy of all queuing tools and shows its age,
as there have been no improvements in quite a while. Even so, GNQS
will compile and run on a surprising number of systems. It is available
from:
http://www.gnqs.org/
Installation
Installation is very straightforward and is self-contained
in the SETUP script that you run to do everything. Additionally,
the script contains all the necessary commands for setting up a
default batch queue, executing the server, and other tasks that
you must do by hand to complete the install. We did have to create
our own daemon startup file, however. To run NQS on every node requires
NQS to be installed on every node and all daemons to be running.
You would specify redirect queues so that the scheduling would take
place on the master node. Setting up a redirect queue is rather
complex, yet there is a small example from which to extrapolate
a suitable solution.
Use
Run testscript at or during a specific time:
Not possible.
Run testscript using a specific queue:
qsub -q [queuename] ./testscript
To remove a job:
qdel jobId
To list queues:
qstat [-a]
Notes
The qmgr program can be used to manage the complete system,
and there are many options. Most of these options guard system resources.
Support
Support for GNQS is purely what is available in the package
and on the Web. The manual pages are very good.
Security
You can specify user- and group-level permissions for each
queue created, but not host-level permissions.
GNU Queue
GNU Queue is a relatively new item in the list of queuing agents;
however, this open source project has had no active development
in more than a year, as can be seen by the need to fix the source
to allow it to run on Red Hat 7.3 based Scyld Beowulf. After fixing
two source issues (one missing file, and incorrect programming for
Red Hat 7.3 based Scyld Beowulf), GNU Queue finally executed. GNU
Queue is primarily a load balancing tool.
Installation
We modified define.h to add #include <time.h> as the
very first line and then ensured that the setrlimit code in queued.c
(which is already conditional for Solaris) be conditional for Red
Hat 7.3 based Scyld Beowulf. We used the configure option -enable-root=YES
to allow a root install of the program so more than one user could
use it. Also, we created our own startup file. You would install
GNU Queue (queued) on each node of the cluster and let it manage
your MPI, or other jobs based on load average. Additionally, you
will need to create a GNU Queue host file to tell queued whom to
contact.
Use
Run testscript at or during a specific time:
No method to submit job at specific time.
To submit job during specific time:
cd /var/queue; cp -r wait timedq
Edit:
timedq/profile
Modify timesched line and make it read:
timesched Any,1100-1200
Restart GNU Queue daemon.
Run testscript:
queue -q -d timedq -- /home/elh/testscript
Run testscript using a specific queue:
queue <-i|-q -d queuename> -- /home/elh/testscript
To remove a job:
task_control -k jobId (if compiled)
To list queues:
View the queuestat file in queue directory
Notes
Using a time of Any or Tu (tuesday) worked for the timed Q test;
yet, whenever we add a 24-hour clock time for a low and a high value,
it ceased to work. There is no documentation for a timed queue,
but you can read through the source code. Any, Wk, Evening, Night,
Sa, Su, Mo, Tu, We, Th, Fr, and low-high 24-hour times are supported
as arguments to the timesched and timestop keywords. Each element
on the line is comma-separated so you can create very complex timings.
You may want to redirect the output per "man queue" in
order to not stall the job if there is output. Jobs submitted prior
to the timesched are not run, yet will let jobs be submitted between
the times specified.
Support
Support is limited on the Web site (http://www.gnuqueue.org) and
to understand all the options you must read the code.
Security
GNU Queue has no mechanism to limit or control the use of the
system. If you can issue the queue command, you can use the tool.
Open Portable Batch System (OpenPBS)
OpenPBS, found at http://www.openpbs.org, provides a very good
cluster batch processing system. OpenPBS has a for fee version named
PBSpro that is not reviewed here. OpenPBS uses the Tcl/Tk scripting
language, allowing other schedulers to be added to the program suite
quite easily. This is also the case for the Maui Scheduler for OpenPBS.
Installation
Installation of OpenPBS is quite simple and requires that two
RPMS files be installed on the system. To get xpbs (the graphical
interface) to work, we had to run the program, exit, and then edit
the resultant .xpbsrc file to change all instances of the server
to be the name chosen for the machine. By default, the workq is
created. We chose the server name by changing the files /usr/spool/PBS/server_name
and /usr/spool/PBS/mom_priv/config as appropriate. Other documents
describe how to run pbs_mom on each cluster node in order to get
the most out of OpenPBS resource management.
Use
Run testscript at or during a specific time:
qsub -a datetime [-q queue] ./testscript (at specific time)
No way to specify a time range
Run testscript using a specific queue:
qsub [-q queue] ./testscript
To remove a job:
qdel jobID
To list queues:
qstat -a
xpbs
Notes
The command syntax for OpenPBS is similar to GNQS, which is
its ancestor. There is even a tool to convert job scripts from GNQS
to PBS. Additionally, OpenPBS has a lot of controls governing job
dependencies (e.g., My job will not run until these other jobs have
run). These controls can be intricate and are an area of strength
for OpenPBS.
Support
Support is via the Web site unless you want to purchase PBSpro,
then there is support via Veridian Systems. There is also an extremely
active mailing list that will happily give you answers to your OpenPBS
questions. Additionally, monitoring the list gives a good feel for
the OpenPBS power configurations.
Security
Security is available in the form of host-, group-, and user-level
access to a queue providing a comprehensive access control for the
OpenPBS system. There are also mechanisms to force the use of routing
queues to further protect and choose the appropriate queue for a
job.
Sun Grid Engine (SGE)
SGE can be found at http://gridengine.sunsource.net and provides
another grid computing facility. Unlike Condor, it has no limitations
as a queuing agent.
Installation
Installation was straightforward as we created the sgeadmin account,
updated /etc/services to reference the SGE daemons, then unpacked
all the files into the installation directory. Please note that
we did unpack the documentation tarball first to get the instructions.
We then stepped through these instructions and had to provide arguments
to setfileperm.sh because we ran it as root, which is undocumented.
Once we ran install_qmaster, selected all the defaults (including
the GID range that was suggested), and ran install_execd, the tool
was ready to use. We also created an appropriate startup file for
SGE. For each slave node, you need to run install_execd to install
the execution environment.
Use
Run testscript at or during a specific time:
Submit job at specific time:
qsub -a datetime [-q queue] ./testscript (at specific
time)
Submit job during specific time range:
Define calendar file according to calendar_conf(5):
# cat domy
calendar_name domy
year 1.1.2000=on
week 1-24=off 12-13=on
Run qconf -mq courageous.q and modify the calendar line to
read calendar domy instead of calendar NONE. This allows a job to
be run on between 12 and 1pm:
qsub [-q queue] ./testscript
Run testscript using a specific queue:
qsub [-q queue] ./testscript
To remove a job:
qdel jobID To list queues:
qstat -f
Notes
SGE provides a wide range of grid and cluster control facilities
that exceed the abilities of Condor. Its calendar functionality
is well documented yet still confusing because any time entered
is denied by default. It took some trial and error to get this correct.
A job queued before the runtime window of the queue will run at
the appropriate time.
Support
The documentation for SGE is quite complete and answered all
our questions. There is further documentation on the SGE Web site
where you can purchase the Enterprise Edition and get a few more
features and enterprise-level support.
Security
SGE will support user, group, and host access control lists
to a queue. Groups are set using the user-level access list control
by specifying the group with the @ symbol preceding the name. SGE
has a complete set of queue controls. You can also specify projects
using the Enterprise version of SGE. Projects appear to be a broader
scope than groups.
Matrix of Programs
Grading
Each system was graded using a 5-point scale with 5 being the
highest. A 0 score implies that the feature was not available at
all.
Installation
The measurement is based on the complexity of the install (e.g.,
were any manual modifications required), code changes, new scripts,
and any other changes that were required to get a basic queue to
work. In this case, the use of an RPM command to do everything was
considered to be a 5. Modification of code was considered to be
a 1.
SimpleQ
How difficult was it to set up and run the commands to submit
a simple job to a queue? We looked at general job management within
a queue. A 1 would be that the job could never be submitted. A 5
was considered to have many job management capabilities.
TimedQ
We have two options under TimedQ. The first was the ability
to submit a job at a given time and the other was the ability to
schedule jobs to dispatch and run during a specific time. A 0 would
imply that there is no way to do either, while a 5 would imply that
both mechanisms are available and the jobs would be suspended if
the job ran outside the runtime window. Note that we were looking
for facilities for timed base queues inside the tool, not the use
of cron to make this work.
MultiNode
Can you monitor the load and other resources on multiple nodes
for submittal of jobs via the system? A 0 implies there is no mechanism,
while a 5 implies that there is and that the steps to do so were
not exceedingly complex (e.g., no queue-specific setups to make
it work). We did not verify these steps and only comment on what
the documentation states.
Support
How good is the support for the tool? A 1 implies we had to
read the source to figure things out, and a 5 implies that questions
were answered immediately and the support was also proactive.
Security
What queue security was available as a part of the tool? A
1 implies no security at the queue level, while a 5 implies that
host, user, group, and resource access controls are available. No
queuing agent will run an arbitrary job as anything but the user
who submitted the job.
Score
We added up the numbers and divided by 6 to get an unweighted
score from 0-5. (See Table 1.)
Conclusion
We were not surprised by the results except for GNU Queue.
GNU Queue is a very simple queuing agent that runs jobs based on
system load. What surprised us was how easy the code was to read
and understand. Although the code is not the best documentation
for any system, in this case it did assist us in getting our tests
to run. GNU Queue would never be a good choice for someone who needs
a well-supported and vetted utility. Furthermore, we understand
that just about all the functionality for each of these tests can
be performed with a little shell scripting, with the possible exception
of security. For that, you need to be a little more creative and
hide programs, and deny access based on user and groups. However,
that's not what we wanted to know about each tool. We wanted
to know whether it was a self-contained unit and whether it could
do it all.
Clusterware/LSF, OpenPBS, and SGE all have numerous features
and, again, we were not surprised by the scores. We did happen to
find that Clusterware/LSF has a more easily understood dispatch/run
timed queue capability and much better support. Also, we found support
in LSF for working with NQS, and documentation within Condor for
using LSF to perform a wider range of features. OpenPBS mentions
a way to convert NQS scripts to OpenPBS, while SGE does not mention
any other package. Features of interoperability will help in the
larger world where you have multiple clusters using slightly different
tools.
In our tests, we learned much about these queuing agents and
we think the choice of one over another depends on your own capabilities
and needs. Each tool meets a need, and the specific needs for your
cluster should be your first consideration when choosing a tool.
Although at(1) may be sufficient for some systems, others may need
the resources, job interlock checking, or queue security of another
tool.
Edward L. Haletky graduated from Purdue University in 1988 with a
degree in Aeronautical and Astronautical Engineering. Since then,
he has worked programming graphics and other lower-level libraries
on various Unix platforms. He currently works for Hewlett-Packard
in the High Performance Technical Computing team and as a security
consultant for the virtual office community.
Patrick M. Lampert graduated from the University of Massachusetts
in 1981 with a degree in Applied Mathematics. Since then, he has
worked in software development and support with various platforms.
He currently works for Hewlett Packard Corporation in the High Performance
Computing Expert Team, supporting development tools, and Tru64 UNIX
and Linux Clustering technologies.
|