Remote
Installation of Heterogeneous Linux Clusters Using LUI
Richard Ferri
To begin, I will review some terminology. What do I mean by "remote
installation of heterogeneous Linux clusters"? By remote, I
mean "installation over a network". This is in contrast
to local installation methods, which might use media like CD-ROM,
diskettes, or tape for node installation. By installation, I mean
copying a version of the Linux operating system to the permanent
hard drive of the node. Heterogeneous refers to nodes that are inherently
different. These might require different Linux kernels, different
file systems, have different size and style hard drives, or have
different sets of packages to install. And, using Greg Pfister's
definition in his book In Search of Clusters, clusters are
"collections of interconnected whole computers ... used as
a single, unified computing resource". By expanding Pfister's
definition, true clusters should also require less work to administer
than a similar number of disjointed workstations. There must be
some benefit to the administrator for organizing these nodes into
a cluster, as opposed to a random set of workstations.
So, what is all the fuss about yet another remote Linux installation
method? After all, there is already SystemImager from VA Linux,
Kickstart from Red Hat, and many informal methods developed at various
national labs and universities. This solution is different in its
attention to solving the heterogeneous cluster installation problem
-- it is designed to be far less work than installing the nodes
individually, and it pays special attention to the inherent "differentness"
of the nodes. Right now, we're still in the early stages of
custom-manufactured Linux nodes. Today, when a customer gets a Linux
cluster from a manufacturer, all the nodes tend to be the same or
nearly the same. But over time, newly developed nodes will differ
from present-day nodes. Customers will want to integrate their current
nodes with the nodes now in the pipeline, and they'll find
current installation methods do not fare well in heterogeneous environments
-- thus enters LUI.
What is LUI?
The Linux Utility for cluster Install, aka LUI, is an open source
project sponsored by IBM that was released in April of 2000 under
the GPL (GNU Public License). LUI draws on technology developed
for the RS/6000 SP, and was written to address the issue of how
to install heterogeneous Linux nodes. Since the SP nodes changed
dramatically over their 8 years in the marketplace, this installation
technology also applies well to evolving Linux nodes. At the beginning
of this project, the LUI developers asked the question "what
makes Linux nodes different from one another?". They came up
with answers like:
- The hard drive, or hard drives, might be SCSI, IDE, or RAID
- File system requirements, local or remote
- Different kernels
- Some require special instructions at bootup time
- Networking attributes
- Software packages to install
- Amounts of swap space
These things that make nodes different from one another are called
resources in LUI. There is also a distinction made between the installation
server, and the clients, or the nodes that the server installs.
These machines collectively are known as machine objects in LUI;
the server is known as the server machine object; and the clients
referred to as client machine objects. So, at the most basic level,
a LUI user defines a server machine object, defines a set of client
machine objects, defines a set of resources, and applies a custom
set of resources to clients during installation. Resources may be
reused on many nodes -- this takes advantage of areas in which
nodes are the same. Reuse of resources is one way in which LUI exploits
the cluster nature of the nodes.
Remote Installation in General
Before I begin a more detailed discussion of how LUI installs
heterogeneous Linux clusters, I'll look at current remote installation
techniques. Linux remote installation falls into one of two camps
-- those that require a diskette to "pre-boot" the
node, and those that can boot directly over the network without
a diskette pre-boot. For a node to boot directly over the network,
both the firmware of the Ethernet adapter (I will only discuss Ethernet
as a boot medium here) and the firmware of the node must support
direct booting. Most of the nodes that boot remotely use PXE (Preboot
eXecution Environment), a direct boot standard developed by Intel.
This article will discuss only the direct boot method, because it's
more current and more in keeping with modern high-end clusters.
If you're interested in pre-booting a node via diskette, please
refer to http://sourceforge.etherboot.com for information
on the etherboot project maintained by Ken Yap. LUI works in either
the direct environment or the diskette pre-boot environment.
Given that the nodes and Ethernet adapter both support the direct
boot method, here is roughly a set of steps that network installation
follows:
1. The server is conditioned to listen to broadcasts from connected
clients.
2. The clients are forced to broadcast over the local LAN in search
of an installation server.
3. The server responds to the client with the client's IP
information.
4. The client configures its Ethernet adapter with its IP information
and requests a kernel.
5. The kernel is TFTP'ed from server to client.
6. The kernel is read into the memory of the client and gets control.
7. The kernel mounts its remote root file system from the server
via NFS.
8. The kernel mounts other file systems and brings up various
services.
9. The kernel hands off to specialized installation code.
10. The specialized installation code partitions the hard drive,
creates file systems, and installs the operating system files on
the client.
11. Various customizations are done on the client.
12. The client completes installation of its local hard drive,
and is ready to boot from its hard drive for the first time.
An Introduction to LUI
The major paradigm shift in using LUI is that the user is not
installing an installation image -- rather, the user is installing
a set of resources that, taken together, comprise a complete, customized
image. The resources are modular and completely reusable. Once a
resource is defined, it may be allocated to one node or many nodes
to be applied during the installation process. In fact, one of the
major contributions of the LUI project is in answering the question
"what resources combine to comprise a complete Linux installation?"
Once the types of resources that LUI supports is understood, it
will become evident how they complement each other and collectively
form a complete node installation image. Table 1 lists the resources
that LUI now supports (in the latest 1.7 release of November 2000)
with a brief description of each.
Experienced Linux users will see that the list of resources, taken
collectively, almost define a node -- but there is no mention
of networking individuality in the resource list. Clearly, not all
nodes in a cluster can have the same IP address. While some clusters
opt for their IP information to be assigned dynamically via DHCP,
most clusters still use static IP assignment. Some clusters use
a hybrid approach -- their IP information is assigned dynamically
once, and then kept by the node forever. LUI supports the static
IP information model and assigns IP information to individual nodes
using node attributes. Table 2 lists server and client node attributes.
Collectively, the node attributes and resources completely define
a client node. The major difference between attributes and resources
is that attributes are items that are unique to a single node, and
are assigned when the node is defined. Resources can be shared among
multiple nodes and are created during the resource definition process.
Using LUI
There are three methods of accessing LUI -- the bottom-line
commands, the graphical interface (hereafter referred to as the
GUI for LUI, or GLUI), and the programmable interface (the API).
Since few readers will be interested in programming in LUI using
the API, I'll leave that discussion for another time. Let's
concentrate on the bottom-line commands and the GLUI.
To set up a LUI cluster, you must roughly follow these steps:
1. Understand your network addresses and layout (all nodes to
be installed must be connected to the server).
2. Install your installation server with Linux.
3. Download and install LUI on the installation server.
4. Bring up your favorite browser and follow the step-by-step
LUI instructions for installing and starting required services.
5. Define your server to LUI.
6. Define your clients to LUI.
7. Define your resources to LUI.
8. Allocate customized sets of resources to each node.
9. Network boot the nodes to force network installation.
10. Check the LUI logs for successful completion.
11. Once the nodes are installed correctly, reboot the node from
the local hard drive
A Sample Install
I'll describe an installation of a sample node using LUI.
I'll pick up with the server installed with Linux, LUI downloaded
and installed, and your installation server services already started.
The number of services that LUI requires is relatively small:
- tftp-hpa (thanks, H. Peter Anvin) -- For boot kernel transmission
- DHCP for IP information assignment, syslinux RPM -- For
the network loader (pxelinux.bin)
- Perl-Tk -- Which is required only for the GLUI
- inetd -- To start tftp automatically
- NFS -- To provide the remote root client file systems
Installing and starting these services are discussed in some detail
in the LUI help that gets downloaded as part of the LUI package.
With the server up and running, and the network connected to the
client nodes, we can start issuing LUI commands. First, define the
server, using the mklimm command, as in:
mklimm -n hacker -t server -i 9.117.20.31 -m 255.255.255.0
This command defines the machine named "hacker", of type
server, with an IP address of 9.117.20.31, and a netmask of 255.255.255.0.
That wasn't too hard. With the GLUI, it's even easier. First,
bring up the GLUI using the command glui (see Figure 1). From
the initial GLUI screen, choose "Define a Server Machine",
and you'll see a window as shown in Figure 2.
Now that you've seen both the GLUI and bottom-line methods
of using LUI, I'll continue with the graphical examples, but
mention which bottom-line command each example corresponds to. With
your server defined, you're now ready to define a set of client
nodes. Again, use the mklimm command, only this time with
the "client" option instead of "server". See
Figure 3 for definition of a client using the GLUI. This example
defines a client machine named "node1", with an IP address,
MAC address, netmask, etc., as above.
One advantage of using the GLUI is that information entered into
the graphical interface remains persistent after you press the OK
button. If you wish to add additional clients, you need only to
change the attributes that are different for that client.
Once all your clients are defined to LUI, you can start defining
resources. A minimal set of resources for a node would include two
file resources, (boot and /), a disk partition file to tell LUI
how to partition the local hard drive and how much swap space to
use, and an RPM resource to describe what set of RPMs to install
on a node. LUI is also capable of installing a node from a set of
tarballs that are archives of file systems, but if you want to do
that, you'll have to download LUI and refer to the help documentation.
See Figure 4 for how to define a file resource to LUI. This defines
a file resource named myboot, which represents the /boot
filesystem. You would define the root file resource in exactly the
same way, and give it the LUI resource name of "myroot".
To define a disk partition resource, you must edit a file that contains
an entry for each filesystem to be created, and an entry for swap.
If you want to use logical partitions, you can define that to LUI
as well. A sample disk partition table resource for our example
would be:
/dev/sda1 ext2 3 c y /boot
/dev/sda2 extended 1000 c n
/dev/sda5 ext2 980 c n /
/dev/sda6 swap 20 c
If you created your disktable in the file /tmp/mydisk.table,
you could define it to LUI using the bottom-line command:
mklimr -n mydisk -t disk -d /tmp/mydisk.table
Clearly, a future enhancement for LUI will be to generate the disk
partition table resource from a graphical interface. LUI should also
provide defaults for those administrators who really don't know
how big the client hard drive is.
To create the last resource, the RPM resource, copy any RPMs you
want to install to /tftpboot/rpm (which is where LUI will
find them on the installing node) and edit a list of RPMs that you
want to install. A sample RPM list might begin with an entry like:
ElectricFence-2.1-3 ...
and end with:
dhcp-2.0.5
and have dozens or hundreds of RPMs in between. Sample complete RPM
lists are shipped with LUI. Again, use the mklimr command or
the GLUI to define the resource:
mklimr -n myrpmlist -t rpm -d /tmp/myrpmlist
Now that your minimum set of resources are defined, you should allocate
them to a client node. If you had a client named "node1"
for example, you would simply allocate all the resources to node1
using the allimr command, as in:
allimr -n myrpmlist -m node1
allimr -n mydisktable -m node1
allimr -n myroot -m node1
allimr -n myboot -m node1
You would do this for each node in the cluster. Admittedly, this is
a rather wordy way to allocate resources to a node, particularly if
there are lots of resources and lots of nodes. That's why the
LUI team introduced grouping in LUI 1.7. Grouping will provide a handle
for a group of nodes, and a handle for a group of resources. If you
had a group of nodes named "frame1" for example, and a group
of resources named "my_resource_list", you could allocate
the group of resources to a group of nodes, as in:
allimr -n my_resource_list -g frame1
With the server, clients, and resources all defined, and the resources
allocated to the nodes, it's almost time to start installation
of the nodes. However, if you're like most administrators, you
want to do that one last check to make sure everything is configured
properly. LUI provides a set of list commands that list information
about nodes and resources. lslimr lists information about resources,
and lslimm lists information about nodes. If you need to make
changes, you might use the unallimr command to deallocate resources
from a node. Your might even need to delete a resource (dellimr)
or delete a node definition (dellimm). You can probably intuitively
figure out what these commands do, and they are documented in the
LUI help files.
After that final check, you're ready to start installing
the client nodes. To start the node installation, you must force
the nodes to boot over the network. This usually involves powering
up the nodes, perhaps pressing a special key, and modifying the
boot list to come up over Ethernet. With a little foresight, you
might request that your nodes be set to boot from the network first
when ordered from the manufacturer. Regardless, initiate network
boot on the nodes, and LUI should get control during the installation
process.
Client Node Installation Process
During the client installation process, the client node will boot
over the network, and load the linux boot kernel supplied with LUI.
This boot kernel is a specialized kernel that has support built
in for most Ethernet adapters and SCSI devices. It is not the same
as the installation kernel that the user defines as a kernel resource,
or installs via RPM. This boot kernel is used only by LUI during
the installation process. This kernel will mount the remote root
file system via NFS that was defined during client machine resource
definition. It will bring up various services required by the installation
process, and eventually pass control to a LUI script named clone.
Clone actually operates on the allocated resources to customize
the client node. It first references the disk partition resource
to partition the nodal hard drive and to create file system and
swap space. The clone script then installs all the RPMs (or tarballs)
that the user requested in the RPM resource. If the user allocated
any source resources, those files are copied over to the permanent
file systems. Lastly, IP configuration takes place for the Ethernet
adapter, including IP address, netmask, hostname, and default route.
During the cloning process, LUI writes progress messages to the
log file. After install is complete, you should see a message that
says "Installation is complete -- it's time to reboot!".
Then you can reboot the node from the local hard drive for the first
time.
Changes
Invariably, during the installation process, you realize some
tweak or changes that you want to make on your cluster. Or, perhaps
your cluster grows, and you'd like to add new nodes. Or, the
next release of your favorite Linux distribution comes out, and
you need to reinstall. When you need to change something, the true
power of LUI becomes evident. Building on the previous example,
let's say you want to reinstall the nodes to repartition the
hard drive. While you're there, you decide that you want to
NFS mount /home read-write to all the nodes in your cluster.
This is a straightforward modification. You simply define a new
file resource using the mklimr command, define a new disktable
resource using mklimr, and then allocate the resources to
the node using allimr. Voilà! You reinstall the node,
and the /home directory is NFS mounted, and your disk is
repartitioned.
Conclusion
To quote Eric Raymond, every piece of software is developed to
satisfy someone's "personal itch". Our "personal
itch" (and I wish he had used another phrase) was that we had
a pile of castoff machines in our lab positively screaming to be
installed with Linux. However, all the nodes were different in some
way, and it was clear that we'd be reinstalling them often.
To solve our problem, we had to come up with a way to install these
machines easily, often, and with different distributions and releases
of Linux. We found that by defining the machine once to LUI, and
then by defining customized sets of resources for each personality
for each node, we were able to easily install a pile of heterogeneous
PCs. As the scope of LUI expands, so should our ability to expand
our clusters to different platforms and distributions, all from
a single control point.
The Future of LUI
The goal of LUI is completely hands-off installation for cluster
nodes, across a variety of platforms, with any distribution of Linux.
To date, LUI has been tested on various releases of Red Hat Linux,
on IBM Netfinity servers, various IBM workstations, and SGI and
Dell servers. It has also been tested on RAID, IDE, and SCSI disk
drives. Clearly, for LUI to take the next step, it must be ported
to other platforms such as Alpha and PowerPC, and to all the major
distributions of Linux. There are also some inherent weaknesses
in the installation process itself. Some of the areas in which LUI
needs to expand are:
- IP assignment. Today, LUI uses static DHCP, which must be administered
by the user. Clearly, LUI should manage the dhcp.conf file
for the cluster, and obviate the need for MAC address collection.
Mike Brim from Oak Ridge National Lab has already contributed
this code -- look for it in a future release of LUI.
- Bootlist. Typically, BIOSs are not set with "network"
as the first bootable device, which means the user must set the
bootlist by hand on each node. For true hands-off installation,
the bootlist should be set once for the node and never changed,
preferably by the manufacturer.
- Cluster database. Currently, LUI uses flat files for its cluster
database, NFS mounted to the clients during installation. NFS
is not a good long-term solution for a write-consistent cluster
database. We will be looking into an open source database solution
in the near future.
- A GUI to define the disk partition table.
- A GUI interface to view the client installation logs. Acronyms
and Abbreviations
- LUI -- The Linux Utility for cluster Install (pronounced
LOO-e)
- DHCP -- Dynamic Host Control Protocol, a method of assigning
IP information to nodes
- TFTP -- Trivial File Transfer Protocol, a lightweight method
of transferring files from one system to another
- NFS -- Network File System, a method of mounting files
systems from one node to another across a network
Further Reading
How to Build a Beowulf, Sterling, et al, 1999, MIT Press,
which describes how to build a beowulf supercomputer in your attic
or basement.
http://sourceforge.etherboot.com -- Read all about
network install from diskette boot, and how to build your own boot
eproms.
http://oss.software.ibm.com -- Pull down LUI from the
"projects" menu, to read all about the LUI project, and
to download the source.
Richard Ferri is a Senior Programmer in the IBM Linux Technology
Center in Poughkeepsie, NY. He works on open source projects like
LUI and OSCAR, an open source clustering tool for high-performance
computing. His previous projects have included network installation
and diagnostics for the RS/6000 SP, and systems management code
for AIX/ESA. He received a BA in English from Georgetown University
many years ago, and now lives in cramped quarters with his wife
Pat, three teenaged sons, and three dogs of various sizes and suspect
lineage.
|