Article
Figure 1
Figure 2
Figure 3
Figure 4
Table 1
Table 2

apr2001.tar

Remote Installation of Heterogeneous Linux Clusters Using LUI

Richard Ferri

To begin, I will review some terminology. What do I mean by "remote installation of heterogeneous Linux clusters"? By remote, I mean "installation over a network". This is in contrast to local installation methods, which might use media like CD-ROM, diskettes, or tape for node installation. By installation, I mean copying a version of the Linux operating system to the permanent hard drive of the node. Heterogeneous refers to nodes that are inherently different. These might require different Linux kernels, different file systems, have different size and style hard drives, or have different sets of packages to install. And, using Greg Pfister's definition in his book In Search of Clusters, clusters are "collections of interconnected whole computers ... used as a single, unified computing resource". By expanding Pfister's definition, true clusters should also require less work to administer than a similar number of disjointed workstations. There must be some benefit to the administrator for organizing these nodes into a cluster, as opposed to a random set of workstations.

So, what is all the fuss about yet another remote Linux installation method? After all, there is already SystemImager from VA Linux, Kickstart from Red Hat, and many informal methods developed at various national labs and universities. This solution is different in its attention to solving the heterogeneous cluster installation problem -- it is designed to be far less work than installing the nodes individually, and it pays special attention to the inherent "differentness" of the nodes. Right now, we're still in the early stages of custom-manufactured Linux nodes. Today, when a customer gets a Linux cluster from a manufacturer, all the nodes tend to be the same or nearly the same. But over time, newly developed nodes will differ from present-day nodes. Customers will want to integrate their current nodes with the nodes now in the pipeline, and they'll find current installation methods do not fare well in heterogeneous environments -- thus enters LUI.

What is LUI?

The Linux Utility for cluster Install, aka LUI, is an open source project sponsored by IBM that was released in April of 2000 under the GPL (GNU Public License). LUI draws on technology developed for the RS/6000 SP, and was written to address the issue of how to install heterogeneous Linux nodes. Since the SP nodes changed dramatically over their 8 years in the marketplace, this installation technology also applies well to evolving Linux nodes. At the beginning of this project, the LUI developers asked the question "what makes Linux nodes different from one another?". They came up with answers like:

The hard drive, or hard drives, might be SCSI, IDE, or RAID
File system requirements, local or remote
Different kernels
Some require special instructions at bootup time
Networking attributes
Software packages to install
Amounts of swap space

These things that make nodes different from one another are called resources in LUI. There is also a distinction made between the installation server, and the clients, or the nodes that the server installs. These machines collectively are known as machine objects in LUI; the server is known as the server machine object; and the clients referred to as client machine objects. So, at the most basic level, a LUI user defines a server machine object, defines a set of client machine objects, defines a set of resources, and applies a custom set of resources to clients during installation. Resources may be reused on many nodes -- this takes advantage of areas in which nodes are the same. Reuse of resources is one way in which LUI exploits the cluster nature of the nodes.

Remote Installation in General

Before I begin a more detailed discussion of how LUI installs heterogeneous Linux clusters, I'll look at current remote installation techniques. Linux remote installation falls into one of two camps -- those that require a diskette to "pre-boot" the node, and those that can boot directly over the network without a diskette pre-boot. For a node to boot directly over the network, both the firmware of the Ethernet adapter (I will only discuss Ethernet as a boot medium here) and the firmware of the node must support direct booting. Most of the nodes that boot remotely use PXE (Preboot eXecution Environment), a direct boot standard developed by Intel. This article will discuss only the direct boot method, because it's more current and more in keeping with modern high-end clusters. If you're interested in pre-booting a node via diskette, please refer to http://sourceforge.etherboot.com for information on the etherboot project maintained by Ken Yap. LUI works in either the direct environment or the diskette pre-boot environment.

Given that the nodes and Ethernet adapter both support the direct boot method, here is roughly a set of steps that network installation follows:

1. The server is conditioned to listen to broadcasts from connected clients.

2. The clients are forced to broadcast over the local LAN in search of an installation server.

3. The server responds to the client with the client's IP information.

4. The client configures its Ethernet adapter with its IP information and requests a kernel.

5. The kernel is TFTP'ed from server to client.

6. The kernel is read into the memory of the client and gets control.

7. The kernel mounts its remote root file system from the server via NFS.

8. The kernel mounts other file systems and brings up various services.

9. The kernel hands off to specialized installation code.

10. The specialized installation code partitions the hard drive, creates file systems, and installs the operating system files on the client.

11. Various customizations are done on the client.

12. The client completes installation of its local hard drive, and is ready to boot from its hard drive for the first time.

An Introduction to LUI

The major paradigm shift in using LUI is that the user is not installing an installation image -- rather, the user is installing a set of resources that, taken together, comprise a complete, customized image. The resources are modular and completely reusable. Once a resource is defined, it may be allocated to one node or many nodes to be applied during the installation process. In fact, one of the major contributions of the LUI project is in answering the question "what resources combine to comprise a complete Linux installation?" Once the types of resources that LUI supports is understood, it will become evident how they complement each other and collectively form a complete node installation image. Table 1 lists the resources that LUI now supports (in the latest 1.7 release of November 2000) with a brief description of each.

Experienced Linux users will see that the list of resources, taken collectively, almost define a node -- but there is no mention of networking individuality in the resource list. Clearly, not all nodes in a cluster can have the same IP address. While some clusters opt for their IP information to be assigned dynamically via DHCP, most clusters still use static IP assignment. Some clusters use a hybrid approach -- their IP information is assigned dynamically once, and then kept by the node forever. LUI supports the static IP information model and assigns IP information to individual nodes using node attributes. Table 2 lists server and client node attributes.

Collectively, the node attributes and resources completely define a client node. The major difference between attributes and resources is that attributes are items that are unique to a single node, and are assigned when the node is defined. Resources can be shared among multiple nodes and are created during the resource definition process.

Using LUI

There are three methods of accessing LUI -- the bottom-line commands, the graphical interface (hereafter referred to as the GUI for LUI, or GLUI), and the programmable interface (the API). Since few readers will be interested in programming in LUI using the API, I'll leave that discussion for another time. Let's concentrate on the bottom-line commands and the GLUI.

To set up a LUI cluster, you must roughly follow these steps:

1. Understand your network addresses and layout (all nodes to be installed must be connected to the server).

2. Install your installation server with Linux.

3. Download and install LUI on the installation server.

4. Bring up your favorite browser and follow the step-by-step LUI instructions for installing and starting required services.

5. Define your server to LUI.

6. Define your clients to LUI.

7. Define your resources to LUI.

8. Allocate customized sets of resources to each node.

9. Network boot the nodes to force network installation.

10. Check the LUI logs for successful completion.

11. Once the nodes are installed correctly, reboot the node from the local hard drive

A Sample Install

I'll describe an installation of a sample node using LUI. I'll pick up with the server installed with Linux, LUI downloaded and installed, and your installation server services already started. The number of services that LUI requires is relatively small:

tftp-hpa (thanks, H. Peter Anvin) -- For boot kernel transmission
DHCP for IP information assignment, syslinux RPM -- For the network loader (pxelinux.bin)
Perl-Tk -- Which is required only for the GLUI
inetd -- To start tftp automatically
NFS -- To provide the remote root client file systems

Installing and starting these services are discussed in some detail in the LUI help that gets downloaded as part of the LUI package.

With the server up and running, and the network connected to the client nodes, we can start issuing LUI commands. First, define the server, using the mklimm command, as in:

mklimm -n hacker -t server -i 9.117.20.31 -m 255.255.255.0

This command defines the machine named "hacker", of type server, with an IP address of 9.117.20.31, and a netmask of 255.255.255.0. That wasn't too hard. With the GLUI, it's even easier. First, bring up the GLUI using the command glui (see Figure 1). From the initial GLUI screen, choose "Define a Server Machine", and you'll see a window as shown in Figure 2.

Now that you've seen both the GLUI and bottom-line methods of using LUI, I'll continue with the graphical examples, but mention which bottom-line command each example corresponds to. With your server defined, you're now ready to define a set of client nodes. Again, use the mklimm command, only this time with the "client" option instead of "server". See Figure 3 for definition of a client using the GLUI. This example defines a client machine named "node1", with an IP address, MAC address, netmask, etc., as above.

One advantage of using the GLUI is that information entered into the graphical interface remains persistent after you press the OK button. If you wish to add additional clients, you need only to change the attributes that are different for that client.

Once all your clients are defined to LUI, you can start defining resources. A minimal set of resources for a node would include two file resources, (boot and /), a disk partition file to tell LUI how to partition the local hard drive and how much swap space to use, and an RPM resource to describe what set of RPMs to install on a node. LUI is also capable of installing a node from a set of tarballs that are archives of file systems, but if you want to do that, you'll have to download LUI and refer to the help documentation.

See Figure 4 for how to define a file resource to LUI. This defines a file resource named myboot, which represents the /boot filesystem. You would define the root file resource in exactly the same way, and give it the LUI resource name of "myroot". To define a disk partition resource, you must edit a file that contains an entry for each filesystem to be created, and an entry for swap. If you want to use logical partitions, you can define that to LUI as well. A sample disk partition table resource for our example would be:

/dev/sda1      ext2          3        c     y     /boot
/dev/sda2      extended      1000     c     n
/dev/sda5      ext2          980      c     n     /
/dev/sda6      swap          20       c

If you created your disktable in the file /tmp/mydisk.table, you could define it to LUI using the bottom-line command:

mklimr -n mydisk -t disk -d /tmp/mydisk.table

Clearly, a future enhancement for LUI will be to generate the disk partition table resource from a graphical interface. LUI should also provide defaults for those administrators who really don't know how big the client hard drive is.

To create the last resource, the RPM resource, copy any RPMs you want to install to /tftpboot/rpm (which is where LUI will find them on the installing node) and edit a list of RPMs that you want to install. A sample RPM list might begin with an entry like:

ElectricFence-2.1-3 ...

and end with:

dhcp-2.0.5

and have dozens or hundreds of RPMs in between. Sample complete RPM lists are shipped with LUI. Again, use the mklimr command or the GLUI to define the resource:

mklimr -n myrpmlist -t rpm -d /tmp/myrpmlist

Now that your minimum set of resources are defined, you should allocate them to a client node. If you had a client named "node1" for example, you would simply allocate all the resources to node1 using the allimr command, as in:

allimr -n myrpmlist -m node1
allimr -n mydisktable  -m node1
allimr -n myroot  -m node1
allimr -n myboot  -m node1

You would do this for each node in the cluster. Admittedly, this is a rather wordy way to allocate resources to a node, particularly if there are lots of resources and lots of nodes. That's why the LUI team introduced grouping in LUI 1.7. Grouping will provide a handle for a group of nodes, and a handle for a group of resources. If you had a group of nodes named "frame1" for example, and a group of resources named "my_resource_list", you could allocate the group of resources to a group of nodes, as in:

allimr -n my_resource_list -g frame1

With the server, clients, and resources all defined, and the resources allocated to the nodes, it's almost time to start installation of the nodes. However, if you're like most administrators, you want to do that one last check to make sure everything is configured properly. LUI provides a set of list commands that list information about nodes and resources. lslimr lists information about resources, and lslimm lists information about nodes. If you need to make changes, you might use the unallimr command to deallocate resources from a node. Your might even need to delete a resource (dellimr) or delete a node definition (dellimm). You can probably intuitively figure out what these commands do, and they are documented in the LUI help files.

After that final check, you're ready to start installing the client nodes. To start the node installation, you must force the nodes to boot over the network. This usually involves powering up the nodes, perhaps pressing a special key, and modifying the boot list to come up over Ethernet. With a little foresight, you might request that your nodes be set to boot from the network first when ordered from the manufacturer. Regardless, initiate network boot on the nodes, and LUI should get control during the installation process.

Client Node Installation Process

During the client installation process, the client node will boot over the network, and load the linux boot kernel supplied with LUI. This boot kernel is a specialized kernel that has support built in for most Ethernet adapters and SCSI devices. It is not the same as the installation kernel that the user defines as a kernel resource, or installs via RPM. This boot kernel is used only by LUI during the installation process. This kernel will mount the remote root file system via NFS that was defined during client machine resource definition. It will bring up various services required by the installation process, and eventually pass control to a LUI script named clone.

Clone actually operates on the allocated resources to customize the client node. It first references the disk partition resource to partition the nodal hard drive and to create file system and swap space. The clone script then installs all the RPMs (or tarballs) that the user requested in the RPM resource. If the user allocated any source resources, those files are copied over to the permanent file systems. Lastly, IP configuration takes place for the Ethernet adapter, including IP address, netmask, hostname, and default route. During the cloning process, LUI writes progress messages to the log file. After install is complete, you should see a message that says "Installation is complete -- it's time to reboot!". Then you can reboot the node from the local hard drive for the first time.

Changes

Invariably, during the installation process, you realize some tweak or changes that you want to make on your cluster. Or, perhaps your cluster grows, and you'd like to add new nodes. Or, the next release of your favorite Linux distribution comes out, and you need to reinstall. When you need to change something, the true power of LUI becomes evident. Building on the previous example, let's say you want to reinstall the nodes to repartition the hard drive. While you're there, you decide that you want to NFS mount /home read-write to all the nodes in your cluster. This is a straightforward modification. You simply define a new file resource using the mklimr command, define a new disktable resource using mklimr, and then allocate the resources to the node using allimr. Voilà! You reinstall the node, and the /home directory is NFS mounted, and your disk is repartitioned.

Conclusion

To quote Eric Raymond, every piece of software is developed to satisfy someone's "personal itch". Our "personal itch" (and I wish he had used another phrase) was that we had a pile of castoff machines in our lab positively screaming to be installed with Linux. However, all the nodes were different in some way, and it was clear that we'd be reinstalling them often. To solve our problem, we had to come up with a way to install these machines easily, often, and with different distributions and releases of Linux. We found that by defining the machine once to LUI, and then by defining customized sets of resources for each personality for each node, we were able to easily install a pile of heterogeneous PCs. As the scope of LUI expands, so should our ability to expand our clusters to different platforms and distributions, all from a single control point.

The Future of LUI

The goal of LUI is completely hands-off installation for cluster nodes, across a variety of platforms, with any distribution of Linux. To date, LUI has been tested on various releases of Red Hat Linux, on IBM Netfinity servers, various IBM workstations, and SGI and Dell servers. It has also been tested on RAID, IDE, and SCSI disk drives. Clearly, for LUI to take the next step, it must be ported to other platforms such as Alpha and PowerPC, and to all the major distributions of Linux. There are also some inherent weaknesses in the installation process itself. Some of the areas in which LUI needs to expand are:

IP assignment. Today, LUI uses static DHCP, which must be administered by the user. Clearly, LUI should manage the dhcp.conf file for the cluster, and obviate the need for MAC address collection. Mike Brim from Oak Ridge National Lab has already contributed this code -- look for it in a future release of LUI.
Bootlist. Typically, BIOSs are not set with "network" as the first bootable device, which means the user must set the bootlist by hand on each node. For true hands-off installation, the bootlist should be set once for the node and never changed, preferably by the manufacturer.
Cluster database. Currently, LUI uses flat files for its cluster database, NFS mounted to the clients during installation. NFS is not a good long-term solution for a write-consistent cluster database. We will be looking into an open source database solution in the near future.
A GUI to define the disk partition table.
A GUI interface to view the client installation logs. Acronyms and Abbreviations
LUI -- The Linux Utility for cluster Install (pronounced LOO-e)
DHCP -- Dynamic Host Control Protocol, a method of assigning IP information to nodes
TFTP -- Trivial File Transfer Protocol, a lightweight method of transferring files from one system to another
NFS -- Network File System, a method of mounting files systems from one node to another across a network

Further Reading

How to Build a Beowulf, Sterling, et al, 1999, MIT Press, which describes how to build a beowulf supercomputer in your attic or basement.

http://sourceforge.etherboot.com -- Read all about network install from diskette boot, and how to build your own boot eproms.

http://oss.software.ibm.com -- Pull down LUI from the "projects" menu, to read all about the LUI project, and to download the source.

Richard Ferri is a Senior Programmer in the IBM Linux Technology Center in Poughkeepsie, NY. He works on open source projects like LUI and OSCAR, an open source clustering tool for high-performance computing. His previous projects have included network installation and diagnostics for the RS/6000 SP, and systems management code for AIX/ESA. He received a BA in English from Georgetown University many years ago, and now lives in cramped quarters with his wife Pat, three teenaged sons, and three dogs of various sizes and suspect lineage.