Article

aug2003.tar

Introducing the OpenSSI Project

Richard Ferri and Brian J. Watson

Although we often think of Linux clusters synonymously with High Performance Linux clusters, there are, in fact, many different types of Linux clusters. In addition to the well-known High Performance Computing (HPC) Linux cluster type, in the technical paper "Open Single System Image (OpenSSI) Linux Cluster Project", Bruce Walker lists five other cluster types:

1. Load leveling -- Clusters that move processes from highly loaded nodes to more lightly loaded nodes, as in openMosix (see related article "The Secrets of openMosix" in this issue)

2. Web serving -- Typically a cluster with one or more front-end nodes sending HTTP requests to back-end Web serving nodes

3. Storage -- Clusters that provide coherent views of modern cluster-wide file systems

4. Database -- Clusters that provide coherent access to databases

5. High Availability (HA) -- Clusters that provide resource redundancy and failover in case of failure

Not that there is any definitive list of Linux cluster types, mind you. In Linux Clustering, Charles Bookman mentions another type of cluster, the distributed cluster, in which loosely coupled workstations cooperate to solve grand problems, such as the search for extraterrestrials. However, Bookman groups MOSIX (and by association, openMosix) in with the distributed class, and puts Web-serving solutions in the load-leveling class. Walker put MOSIX in the load-leveling class and put Web serving in a class by itself. To muddy the waters further, Greg Pfister, in In Search of Clusters, classifies clusters as "a subparadigm of the parallel/distributed realm". But wait, didn't Bookman classify distributed and parallel as types of clusters? So, are clusters subcategories of distributed and parallel computing, or are distributed and parallel types of clusters?

Clearly, classifying the different types of clusters into the greater realm of computers is a slippery slope. However, regardless of how many ways we might classify clusters, one element of clustering remains constant -- there is not nearly enough consideration given to sharing elements of solutions among the cluster segments. One exception, the Open Cluster Framework (OCF, http://www.opencf.org) is a group that has ambitiously taken on the mission to define APIs that pertain to both HA and HPC clusters. For example, the OCF defines a "node liveness" service, which tells members of the cluster which nodes are currently up and which are non-responsive. But, if you think about it, wouldn't a node liveness function be useful in Web serving as well? And isn't Linux and open source all about sharing code to minimize re-invention? Well, finally there is a project that addresses the general needs of all types of clusters -- the OpenSSI Linux Cluster project.

Overview

The OpenSSI project (http://www.OpenSSI.org) is an open source clustering project that addresses three major issues that cut across all the various cluster segments: high availability, scalability, and manageability. They recently released their 1.0 version, which is available under the GPL2 license. What distinguishes the OpenSSI project in the Linux space is its ambition and scope -- it intends to become the definitive project that unites all the Linux cluster factions. It's ambitious enough that, to my knowledge, such an inclusive, sweeping approach to clustering has never before been attempted.

When Pfister talks about Single System Image (SSI), he talks about layers and levels of SSI, depending on how many subsystems and resources cooperate to create the illusion of a single system, comprising perhaps hundreds or thousands of nodes, that provide the look and feel of a single workstation (albeit on steroids). In an attempt to achieve true SSI that cooperates at many different levels, the OpenSSI team has analyzed and dissected clusters into their component parts to provide a set of SSI building blocks at the system call level from which any type of cluster can be implemented. These SSI building blocks, which were both developed by the OpenSSI team and borrowed from the open source community, provide the individual process a cluster-wide view of all the available cluster resources. The building blocks consist of the component parts:

1. CLuster Membership Service (CLMS) -- The role of CLMS is to keep an accurate accounting of the state of each node (e.g., up, down, going down, etc.) to make sure each node has the same view of the state of all the other nodes in the cluster, and finally to provide APIs to make the information available. Like all the kernel subsystems, CLMS relies on the ICS (Inter-node Communication Subsystem), which provides a method for the nodes to exchange information.

2. Clusterwide Filesystems -- Mounts are automatically distributed to all nodes, which ensures a consistent view of the filesystem tree across the cluster. Each mounted filesystem can be accessed via one of several methods:

(a) Cluster FileSystem (CFS) -- Stacks over a physical filesystem, such as ext3 or XFS, to provide transparent cluster-wide access to that filesystem from any node in the cluster. CFS is similar to NFS, except it is highly coherent, supports client-at-server access, and can transparently failover in the event of server failure:

Coherency -- Any node sees a change to a file or the directory structure as soon as it happens.
Client-at-server access -- The server node for a filesystem accesses it via the same coherency mechanisms as any other node.
Transparent failover -- Another node can automatically become the server for a filesystem, if it is physically connected to the storage device.

(b) Global FileSystem (GFS) -- A parallel physical filesystem that allows direct access to a storage device from any node in the cluster. The storage device must be physically connected to all nodes with a storage area network (SAN), which limits its applicability to smaller clusters. The open source version of GFS is maintained by the OpenGFS project (http://www.OpenGFS.sf.net).

(c) Lustre -- A network attached storage (NAS) filesystem that "stripes" across multiple servers. It is useful for very large clusters, because it eliminates the performance bottleneck of a single CFS server without requiring a storage device to be physically connected to every node. This filesystem is maintained by the Lustre project (http://www.Lustre.org).

(d) Special filesystems -- OpenSSI has built-in cluster-wide support for special filesystems, such as devfs and procfs.

3. Cluster-wide Process Subsystem -- Provides a view of all processes from all nodes in the cluster, through both system calls and /proc/<PID>. This allows commands like ps and gdb to work cluster-wide without modification. It also provides tools to easily migrate processes in mid-execution, either manually or via automatic load balancing.

4. Cluster-wide Inter-Process Communication (IPC) -- Provides a cluster-wide address space for each Linux IPC mechanism (pipes, Unix domain sockets, semaphores, etc.), and allows IPC to work transparently between processes on different nodes.

5. Cluster-wide Devices Subsystem -- Provides a naming method for all the devices in the cluster, and easy access to any device from any node.

6. Cluster-wide Networking Subsystem -- Provides a single IP address to the external world, and load balances TCP connections among the cluster nodes. Much of this functionality is derived from the Linux Virtual Server project (http://LinuxVirtualServer.org).

7. Distributed Lock Manager -- Runs on each node to enable various forms of cache coherency.

8. System Management -- The goal of System Management is to use familiar commands that are slightly modified instead of a new set of cluster commands. System Management also provides a method of adding new nodes to the cluster trivially.

9. Application/Resource Monitoring and Failover/Restart -- A subsystem that takes certain user-defined actions when a node fails, or an application stops.

Taken collectively, these components provide a single system image look and feel across different resources and provide a set of building blocks for all the different cluster types (much of this information is taken from the Walker paper -- refer to it for a complete discussion).

Installing an OpenSSI Cluster

The minimum hardware for a basic OpenSSI cluster is two or more computers connected with an Ethernet network. For security reasons, this network should be private. Any node that has a second NIC can use it to connect to external networks or the Internet.

OpenSSI can run on several Linux distributions, including Red Hat and Debian. The focus of this discussion is on Red Hat, so install the latest version of it on your first node. You do not need to install a distribution on any other machine. This is because OpenSSI shares the first node's filesystems with the rest of the cluster (via CFS) -- one reason why it is so easy to administer an OpenSSI cluster. Nevertheless, there are a few small restrictions on how you should configure your installation. You also need a few packages that may not be installed by default, such as dhcp and tftp-server. See OpenSSI.org for more details.

Next, download the latest RPMs for OpenSSI. They can be found together in a tarball on OpenSSI.org. Login as root, extract the tarball, and run the provided installation script. It will ask you a few questions about how you want to configure your first node, then the script will begin installing packages.

The process starts by replacing a few Red Hat packages with OpenSSI-enhanced versions. They have extra features when you are running on an OpenSSI kernel, and behave normally when you are not. These extra features include:

init -- Is highly available and can start daemons and gettys on nodes joining the cluster

mount -- Understands which node has what filesystem (or swap device)

mkinitrd -- Can build the special ramdisk required for booting a node into the cluster

The script also installs new packages that provide commands and APIs for such things as:

Examining the membership state of cluster nodes
Migrating processes on the fly
Remotely executing and forking processes
Configuring automatic load balancing of processes and TCP connections
Simplifying cluster installation and management
Automatically restarting critical applications that die

Finally, the script installs the OpenSSI kernel package, which contains most of the clustering code. It is based on the Red Hat kernel shipped with your distribution, but it is modified so that system calls can transparently access remote objects (processes, filesystems, devices, sockets, semaphores, etc.) on other nodes in the cluster. Because every object is guaranteed to have a unique name and can be transparently accessed as if it were local, unmodified programs running on an OpenSSI kernel are tricked into believing that the cluster is a single, giant machine.

Once the install script is done running, you can simply reboot and select OpenSSI at the bootloader prompt. Congratulations! You now have a one-node OpenSSI cluster.

Adding a New Node

Adding a new node is easy. Most of the work is done by the openssi-config-node command. It has an interactive interface to help you add nodes, remove nodes, and change their configurations.

To add a node, bring it up on the private network with a network booting protocol such as PXE or Etherboot (http://www.Etherboot.org). The node will display its network hardware address (MAC address) on its console. Run openssi-config-node on the first node, tell it you want to add a new node, and select the appropriate MAC address from a list. Assign the node a unique number (such as 2) and IP address. After a few moments, the node will boot and join the cluster. That's all there is to it. You can add more than a hundred nodes following this procedure, although a more automated system is under consideration for clusters that large.

For each new node, you can configure special per-node features. These features include:

1. Local swap device -- Each node with local storage can use it for swap.

2. Local boot device -- Each node with local storage can boot from it, rather than doing a network boot.

3. Additional CFS-stacked filesystem -- Each node with local storage can share a filesystem on it with the rest of the cluster via CFS.

4. External network interface -- Each node with a secondary network interface can use it to connect to external networks.

See OpenSSI.org for more details.

Configuring Optional Features

The instructions I've provided will help you set up a basic OpenSSI cluster, without any special HA or scalability features. On the OpenSSI.org Web site, you can find supplemental documentation for configuring optional features, such as:

1. High-availability CFS (HA-CFS) -- Enables transparent failover of a CFS-stacked filesystem between nodes that are physically connected to a shared storage device.

2. Direct-access GFS -- A parallel physical filesystem described previously.

3. High-scaling Lustre -- A "striped" NAS filesystem described previously.

4. Process load balancing -- Automatically migrates processes from overloaded nodes to underloaded nodes. This feature is provided by openMosix node selection algorithms combined with OpenSSI's process migration architecture.

5. TCP connection load balancing -- A Cluster Virtual IP (CVIP) address can be configured to automatically distribute connection requests among all nodes that service a given TCP port.

6. HA CVIP address -- A CVIP address can be automatically hosted on a new node if its original host node goes down.

7. HA application monitoring and restart -- A utility package that can be used to automatically monitor applications and restart them if they die. The monitoring and restart daemon stores its state on disk so that it can be restarted by init without forgetting what it was watching.

Playing Around with OpenSSI

The following examples demonstrate some of the features described above. They assume you have a basic two-node cluster. To get started, log in to both consoles as root. Confirm that both nodes have joined the cluster:

(node 1)
# cluster -v
1:  UP
2:  UP

Also confirm the node number for each console:

(node 1)
# clusternode_num
1

(node 2)
# clusternode_num
2

These are some of the commands used for examining the membership state of cluster nodes. Read the OpenSSI documentation for more information about these and related commands.

Now it is time to do something really interesting.

Clusterwide Filesystems

Start a vi session on the first node (or emacs if you prefer):

(node 1)
# vi /tmp/newfile

Type a few lines:

Hello World!!
Beautiful day, isn't it?

Save the file, but do not quit. On the second node's console, log in as root and view the contents of your new file:

(node2)
# cat /tmp/newfile

The output is the same text you entered on the first node. This demonstrates that the filesystems are shared coherently across the cluster. Whether you are creating, editing, moving, or deleting files and directories, any change you make on one node is immediately visible to all other nodes. Feel free to try other experiments with cluster-wide filesystems.

Cluster-wide Processes

In the previous section, remember that you did not quit vi. That is because we are going to play some games with the cluster-wide processes. Look for its PID from the second node's console:

(node 2)
# ps | grep vi
66123 tts/0    00:00:00 vi

The ps command is unmodified and unaware of the cluster, yet it is able to see a process running on node 1. How is this possible? It is because the OpenSSI kernel can transparently access remote objects, such as a process on node 1, to create the illusion that the cluster is a single, giant machine.

You can confirm the location of the process with a special OpenSSI command:

(node 2)
# where_pid 66123
1

One more experiment before we move on:

(node 2)
# kill 66123

Like ps, the kill command is unmodified and unaware of the cluster. Nevertheless, the vi session on node 1 is killed. Not only are all processes visible throughout the cluster, but they can be signaled from anywhere in the cluster. All processes appear to be local.

Clusterwide IPC

Why stop there? IPC objects are visible cluster-wide, too:

(node 2)
# mkfifo /tmp/fifo


(node 1)
# echo 'Hello World!!' >/tmp/fifo


(node 2)
# cat /tmp/fifo

Hello World!!

At the risk of sounding like a broken record, I will say that the mkfifo, echo, and cat commands are also unmodified and unaware of the cluster. The OpenSSI kernel supports truly cluster-wide IPC objects, including pipes, FIFOs, Unix domain sockets, semaphores, message queues, SysV shared memory, and memory mapped files.

Process Load Balancing

Process load balancing is neatly demonstrated by a demo package available on OpenSSI.org. It was first shown at Linux World San Francisco in 2002, where OpenSSI won the Best Open Source Project award for the conference. It consists of three Perl scripts and a detailed README. The first two scripts (demo-proclb and demo-proclb-child) are unaware of the cluster and can run on any Linux system. The demo-proclb script forks off a number of children and then reads from a message queue. The children execute the demo-proclb-child script, which reads a record from disk, spins in a busy loop to simulate a CPU-intensive calculation, then writes to the message queue to inform the parent it has finished that record. Each child processes an assigned number of records before exiting.

The master uses the information it reads from the message queue to track each child's progress and to display a bar graph showing the total number of records processed per second. This gets particularly interesting when process load balancing is turned on. Suddenly the performance increases dramatically. The more processors that are in the cluster, the higher the performance goes.

To visualize this process load balancing, run the third script (demo-proclb-monitor). For each node, it displays node number, load averages, and the PIDs of any demo-proclb-child processes running on that node. If you start demo-proclb-monitor before activating process load balancing, you will see all children on the same node as their parent. After activating load balancing, they are migrated to other nodes until all load averages are approximately the same. Neither demo-proclb nor its children are aware of this load balancing. The children continue to read their records from disk and write to the message queue without any problems.

Another experiment to try is to power-cycle one of the nodes (do not power-cycle the node on which demo-proclb is running). The children on that node die with the power failure, but within moments demo-proclb forks new children to take their place. This is possible because demo-proclb is keeping track of each child's progress. But wait a second! If demo-proclb is unaware of the cluster, how does it know a node died? It does not know. It only knows that some of its children died, through the standard Linux mechanisms of receiving a SIGCHLD and doing wait() system calls.

This set of scripts offers a convincing example of the scalability and availability provided by OpenSSI clustering. What about another example of manageability? Try sending a kill signal to the demo-proclb process group:

(any node)
# ps | grep demo-proclb
66543 tts/0    00:00:00 demo-proclb
# kill -66543

Despite the fact that demo-proclb and its many children are scattered throughout the cluster, this simple (and unmodified) command kills them all. If you desire manageability, availability, and scalability in one clustering solution, OpenSSI may be just what you are seeking.

Conclusion

The OpenSSI project will continue to integrate with other open source and proprietary software to increase the value of existing solutions by improving their manageability, availability, and scalability. By providing the basic building blocks for all cluster types, the OpenSSI project also creates the potential for entirely new solutions based on OpenSSI. In the future, it will simplify the development of highly available and scalable applications, which will reduce the cost of innovation in these critical markets.

OpenSSI is an open source project whose success depends on the contributions of its community. If its vision for the future of clustering is something you would like to support, please consider helping by submitting detailed bug reports, writing documentation, fixing bugs, or developing new features. You can send email to: [email protected] to express your interest.

Acknowledgements

The authors thank Bruce J. Walker of HP's Office of Strategy and Technology for his review and advice in preparing this article.

References

Pfister, Gregory F. In Search of Clusters. Prentice Hall, 1998, ISBN 0-13-899709-9.

Bookman, Charles. Linux Clustering. New Riders, 2003, ISBN 1-57870-274-7.

Walker, Bruce J. "Open Single System Image (OpenSSI) Linux Cluster Project", a technical whitepaper available on the OpenSSI.org Web site.

The OpenSSI home page -- http://OpenSSI.org

The Etherboot home page -- http://Etherboot.org

The OpenGFS home page -- http://OpenGFS.sf.net

The Lustre home page -- http://Lustre.org

The LVS home page -- http://LinuxVirtualServer.org

Richard Ferri is a Senior Programmer at IBM's Linux Technology Center, where he contributes to and writes about open clustering solutions. He studied at Georgetown University with a concentration in English. He lives in upstate NY with his wife, Pat, three teenaged sons, and an ever-changing cast of critters.

Brian J. Watson is an OS Developer for HP's Office of Strategy and Technology, where he designs features and writes documentation for the OpenSSI Clustering Project. He majored in Astronautical Engineering at Penn State University. He enjoys hiking mountain trails around Los Angeles and savoring fine beer at his local home-brewing club.