Orchestrating a Kinder, Gentler Disaster
Dorian Deane
When a system crashes, the system administrator's goal
is to return
the system, as closely as possible, to its original
state. Presumably,
any system administrator worthy of the title will have
the backups
required to accomplish this, but how quickly? Magnetic
tape, the media
of necessity for most system administrators, is notoriously
slow,
and that means that a single misstep, such as having
to read a tape
twice because of poor planning, can cost hours of downtime.
Making
the situation even worse is the time dilation factor:
that well-known
phenomenon in which time slows in direct proportion
to the number
of people waiting for you to finish your task.
The corruption of a single file system is generally
not a disaster
-- merely a time-consuming annoyance. My definition
of a disaster
involves the loss of an entire shared disk or even loss
of the server
itself. Under these circumstances, the system administrator
had better
be ready to approach the problem creatively. After months
without
a major crash, getting the right answer to questions
you had thought
were easy all at once becomes critical. How many file
systems from
the newly deceased disk were absolutely necessary to
get users working
again? Did that disk provide any swap area? Which clients
imported
files from it? Can the most important partitions be
restored to another
disk so that you can worry about the repair process
later? Each one
of these questions can be answered in some way, but
few of them can
be answered quickly -- unless you have prepared in advance.
The information most difficult to recapture from tape
is the underlying
disk structure. If a disk loses its label, the system
administrator
may be left floundering as he or she tries to make reasonable
guesses
about what the users need. Recreating the original disk
label is sometimes
vital, but not always easy (and almost never quick)
given the information
stored by a typical backup system -- such as one based
on the standard
dump and restore programs. Getting the swap partition
right is particularly
important, and no longer simply a matter of using the
swap area that
came on your preconfigured system. Swap requirements
get bigger every
year: where it used to be just Lisp developers who sat
outside your
door whining for a bigger swap partition, lately even
applications
written entirely in C are requiring hundred-megabyte
swap areas.
Leaving disk problems alone for a moment, consider what
happens if
your main disk server starts spitting smoke and fire
(most likely,
just as you were getting ready to go home for the evening).
If you
have a spare machine ready and waiting, you are in an
unusual and
privileged position. The rest of us end up making an
eleventh-hour
decision to designate a client as the new server, and
suddenly it
becomes important to know which clients imported which
file systems.
Intelligent approximations will get you much of the
way home, particularly
in a homogeneous environment, but the occasional odd
situation can
cause real trouble. Again, it's trivial to mount tape
after tape and
read in every fstab and exports file, but this can take
massive amounts of time.
The Bourne Shell script in Listing 1, which I call dkmap,
answers
these problems quite simply. For a network of one or
two systems,
the script is a mere convenience, giving the system
administrator
a means to easily automate an already fairly trivial
task. The larger
the network, the more useful dkmap becomes. dkmap queries
each machine in its host list for various information,
the most important
being the labels of all attached disks. The other information
it gathers
can be used as a map of shared file systems in the network.
Answering
questions such as "Was /usr/local on partition
d or e?" becomes
easy. This information is on your backup tapes, but
if you ever have
to stream through a 2-Gigabyte tape for nothing more
than a file system
table, you'll know the meaning of frustration. Some
administrators
wing it -- they know the systems and can make reasonable
guesses
-- but many users, particularly application developers,
are picky
enough that a reasonable guess won't do.
Parts I and II of dkmap could be simplified, but I find
it
convenient to have it formatted this way. If your system
configuration
changes often, you may want to run it from crontab,
once a
week, sending the output to a printer. You will probably
want to run
dkmap after hours; it does not run quickly because it
is thorough
-- rather than assuming that you have all disk types
listed in
your fstab, it patiently tries each disk in DISK_LIST
for device identifiers from 0 to 9. One of the more
interesting techniques
in the script is the method of copying a script to a
remote machine
and then executing it; though this trick strays from
the goal of simplicity,
it saves a lot of extra rshs and satisfies a preference
of
my own by not requiring installation of a separate script
on each
remote machine.
The dkmap script was written to run in a Sun environment.
It
is a given that changes need to be made for it to run
on other systems.
I have tried to place comments at all points in the
script where I
suspect there will be portability issues, and in one
instance, I've
added a case statement to resolve one of the most obvious
problems
in a heterogeneous network. Error-checking is minimal;
I did not feel
justified in doubling the length of the script to make
it only a little
more robust. Use it and enjoy -- I find myself running
to look
at its printout at least once every couple of months.
About the Author
Dorian Deane has been a UNIX systems programmer/administrator
for the last five years. He currently works with the
Advanced Decisions
Systems division of Booz, Allen and Hamilton. You may
contact Dorian
at ddeane@ads.com.
|