Recursive
Use of SystemImager for Cloning Entire Clusters
Scott Delinger
The SystemImager (SI) utility for automated Linux installations
is invaluable in the construction of clusters (Figure 1 shows a
typical cluster configuration). The SystemImager Web site discusses
many of the uses for SI, including server farms, clusters, and desktop
environment rollouts. SystemImager is based on a client-server model:
the boxes to be imaged are the clients, and the image repository
is the master. SI also allows administrators to perform safe upgrades
with revision control by saving different images; rolling back to
a "last known good" image can return production boxes
in service very quickly.
Background
I have had the good fortune to have received too much cluster-building
work in the past two years (Table 1). Our department has installed
five new clusters in that time, as well as renewing a Windows computing
lab that serves as a Linux cluster at night (named werewulf). Our
first 16-node cluster, beowulf, which dates from 1999, was constructed
the hard way following these steps: build up the master; build up
a compute node; take the side panels off the compute node; slave
a new node's hard disk to the build node; "dd" the
drive image to the new node's drive; change the static IP address
and hostname; repeat n times. When building the newer clusters,
we needed a better way -- the two clusters built in 2001 each
have 20 nodes, and the three built in 2002 have 24, 40, and 192
processors, respectively.
I originally looked for a package to simply clone the compute
nodes, and was pleased to discover SystemImager. SI hosts the node
(client) image on the master node and clones the bare nodes from
this image. Better yet, SI uses DHCP in the process, so I could
select from static IPs, dynamic IPs, or dynamically provided IPs
locked to MAC addresses. Before using SI to build clusters, I tested
it on a desktop Linux box -- changing some files, re-imaging,
and generally familiarizing myself with the different aspects of
the program.
Inside SystemImager
Briefly, this process involves creating a "golden client"
on one of the nodes, and loading the SystemImager client packages
onto the golden client. The golden client will serve as a prototype
for the nodes in the cluster. Then, you can set up a master where
all of the clients can access it, and load the SystemImager server
onto the master. The image created on the golden client is then
taken up to the master, and that image is available to the other
nodes, which can request the image by booting from a floppy, CD,
network, or hard disk. Figure 2 provides a schematic of the process.
You can update the image directly on the master (chrooting into
the image is suggested) or make a change to a golden client and
then take the revised client image up to the master. Updating the
other clients only requires an update to the changed files. You
do not have to install a complete new image.
Revision control is possible by making a new image of the golden
client and pulling this new image down to the other nodes. Although
this requires copying a greater amount of data across the network,
recovery from a mistake in the new image requires only that the
"last known good" image be downloaded to the clients.
I tend to install security patches directly to a copy of the last
known good image (copying images on the server is possible). I can
then test the image on a node. Once that node appears to have remained
stable, I download the patched image to the balance of the nodes.
When adding software packages or updating distributions, I create
a new image to allow for revision control.
Variations on a Theme by Finley
In 2001, the two clusters (pleaides and naegling) I needed to
build were quite similar with respect to hardware (see Table 1)
and software load. Since the two clusters were running similar computational
chemistry packages in two different research groups, many libraries
were common to both clusters. Having built up the master node of
pleaides and the first compute node, I quickly (6 minutes/box!)
imaged the other 19 nodes. Almost all of the time in construction
was consumed by planning and by the installation of all of the libraries
and packages needed in the cluster onto the master and that first
compute node.
When it came time to build the master node of naegling, sys admin
laziness1 set in. Since the master node of pleiades contained the
image of its compute nodes and had nearly all of the software load
needed by naegling, I thought why not make this cluster master an
SI client, make my office desktop a "super" master, and
use SI recursively? (See Figure 3.) In other words, why not use
SI to clone a complete cluster? The only factor requiring consideration
was that the hard disks of naegling's compute nodes were 50%
larger, as the master nodes had identical hardware but for the CPU.
I installed the SI packages intended for the SI image server onto
my desktop and the SI client packages onto pleiades's master,
ran "prepareclient" on pleiades's master, and then
ran "getimage" on my desktop box. The image uploaded without
issue.
Now, naegling's master was to have a static IP on the public
NIC, which obviously needed to be different from pleiades's
public NIC IP. The SI team had figured that out: a local.cfg file
on my boot floppy would be used to push the eth0 networking info
down to naegling. The networking information for the compute nodes
could match that of pleaides because it used an RFC 1918 private
address space. I made a link in the images directory for naegling.sh
to point to master.sh and awaited the results. The image of pleiades's
master was smoothly downloaded to naegling's master, and after
many minutes, the familiar "I've been done for x seconds,
reboot me" beeps were heard. I rebooted, and naegling's
master node sat awaiting a login. Furthermore, naegling's master
node had the SI master node packages installed, as well as the images
for the compute nodes, so the next step in the recursion was nearly
ready.
To address the hard disk size difference on the compute nodes,
I noted that the reason 30-GB disks were purchased for naegling
was a desire for additional scratch space on the compute nodes.
Therefore, I could leave all partitions but /scr identical to those
on pleiades. The script compute_v1.1.sh in /tftpboot/systemimager
had a stanza concerning disk partitioning, and all that was required
was to increase the /scr size to reflect the balance of the disk
reserved for /scr on naegling's compute nodes. I used a boot
floppy created for imaging pleiades's compute nodes, and all
was well -- the node when rebooted had a /scr partition of 22
GB rather than 13 GB as on pleiades. I then installed the few packages
needed on naegling that were not needed on pleiades, pulled a new
image up to the master, imaged three more nodes, asked the researchers
to test the new packages, and then the remaining nodes were imaged.
Variations Dal Capo
In 2002, three grants written for the purchase of clusters were
funded at the same time. To save work (laziness again), as well
as to leverage volume purchase power, all of the nodes for the three
clusters (oh, hrothgar, and plethora) were purchased together with
nearly identical specifications: 126 dual AMD Athlon MP 1800+ (TM)
compute nodes with 40-GB hard disks and 3Com 3c920 NICs. Master
nodes were similar, except for disk size. A wrinkle in this hardware
config is the 3Com (TM) NICs. An updated kernel with the proper
version of the NIC driver on the SI boot floppy solved the "I
can't receive DHCP leases" problem. Another wrinkle was
my use of ext3 filesystems for these newer clusters. fsck'ing
22 GB of temp files when a box crashes mid-calculation is no one's
idea of a good time. The symptom was that the nude nodes being imaged
wouldn't image any partition other than the root. Editing "updateclient"
to allow ext3 as well as ext2 mounted partitions to be imaged solved
that.
Now, past experience says to start with plethora's master,
and grow /home from 18 GB up to the 36-GB disks on oh and hrothgar.
However, because plethora's owner has two clusters already
and could wait, I needed to build the similarly equipped oh and
hrothgar before building plethora. So in this variation, I needed
to shrink a partition. Also, SystemImager had been improved since
my last use to the 2.0.1 version, which moved some of the directories
and contents around (/tftpboot no longer used in my application),
and included System Configurator, which handles newer distributions
more easily than what had been hard-coded into SI in the past. As
before, as I built up the client images, I "sprayed" the
current image out to four or so nodes, tested, got feedback from
the users, and then made changes and re-imaged. Very slick. Anticipating
larger and more numerous images, a proper image master was purchased
with sufficient disk space for saving images for all of the clusters,
relieving my desktop box of this task.
I insisted on installing as much of the software that belonged
on any one of these three clusters onto all of them. In some cases,
certain packages would never get touched on a cluster. On the other
hand, building time would be reduced to cloning time. Users were
getting frustrated waiting for their cluster while I tested packages.
However, when I finished oh, I cloned hrothgar from oh in a single
day. Users admitted that doubling their available CPUs in a day
was worth the wait.
Plethora was (and is) a bit of a challenge. The difficulty isn't
in the cloning; that's straightforward enough. The shrinking
of the /home partition was trivial -- I edited the stanza regarding
partition sizes as before, and since the material in the /home/*
directories required much less than 18 GB, no problems arose. Keeping
almost 0.5 THz of computational resources cool has been the challenge.
The 20 or so nodes of plethora that are currently running work fine,
and I've taught one of the faculty how to add a package to
the "golden client" and take an incremental image and
pass that onto the other nodes. If something gets fouled, we just
image back to the previous working image and we're back running
again.
D.S. al Fine: Werewulf
Finally, also during the summer of 2002, we renewed the hardware
in a computing lab. In 1997, the lab was outfitted with 40 PentiumII/233
stations with 64 MB RAM, 3-GB HDs, and 17" monitors. During
the day, the lab ran Windows 95(TM) on all stations, and undergraduate
students used the stations for email, browsing, and report writing.
At 17:00, the doors locked and the lights went out. Fifteen minutes
later, the stations rebooted into Linux, and ran as a cluster until
08:00, at which point they reverted to Windows. This facility was
well received as an overnight cluster, and later the Linux image
was used on a few stations during the day to provide X-terminal
capabilities for NMR data processing software that runs on Solaris.
This summer, we purchased 40 AMD Duron(TM) 1.1-GHz boxes with
256 MB SDRAM and 20-GB IDE hard disks. We needed to keep the dual-use
nature of the laboratory. The Windows image (handled with Symantec's
Ghost(TM)) was little changed, once we made the proper alterations
for the new hardware. The Linux image created in 1999, however,
was quite dated. To quickly renew the Linux side of the lab, I imaged
a new master (werewulf) from the image I had for the oh/hrothgar/plethora
clusters, then attempted to image a compute node from the new master.
Since we'd used SMC 9432 NICs in the lab in 1997 and had DHCP-served
static IPs (locked to MAC addresses), we migrated those cards into
the new boxes. Under the time constraints we faced, we weren't
able to use SI for imaging the 40 new Durons, because the EPIC100
driver wasn't in the standard 2.2.18 kernel provided with SI.
Given another day or two before classes starting, we could have
made it. The new version of SI being developed will use modular
NIC drivers, removing one of the last hindrances I've encountered
with SI. Overall, recursive use of SI has saved me weeks of time
in a large project, as well as offer software revision control.
1 "Laziness is a very important system administrative virtue."
from Essential System Administration, 2nd Ed. by Æleen Frisch.
O'Reilly & Associates, p. 342.
Scott Delinger is a lapsed Ph.D. analytical chemist who now
serves as the IT Administrator for the Department of Chemistry at
the University of Alberta. He maintains seven supercomputational
clusters in one of the largest cluster facilities in Canada. He
spends time off-line with his wife and daughter. He can be emailed
at: scott.delinger@ualberta.ca.
|