Introducing the OpenSSI Project
Richard Ferri and Brian J. Watson
Although we often think of Linux clusters synonymously with High
Performance Linux clusters, there are, in fact, many different types
of Linux clusters. In addition to the well-known High Performance
Computing (HPC) Linux cluster type, in the technical paper "Open
Single System Image (OpenSSI) Linux Cluster Project", Bruce
Walker lists five other cluster types:
1. Load leveling -- Clusters that move processes from highly
loaded nodes to more lightly loaded nodes, as in openMosix (see
related article "The Secrets of openMosix" in this issue)
2. Web serving -- Typically a cluster with one or more front-end
nodes sending HTTP requests to back-end Web serving nodes
3. Storage -- Clusters that provide coherent views of modern
cluster-wide file systems
4. Database -- Clusters that provide coherent access to databases
5. High Availability (HA) -- Clusters that provide resource
redundancy and failover in case of failure
Not that there is any definitive list of Linux cluster types,
mind you. In Linux Clustering, Charles Bookman mentions another
type of cluster, the distributed cluster, in which loosely coupled
workstations cooperate to solve grand problems, such as the search
for extraterrestrials. However, Bookman groups MOSIX (and by association,
openMosix) in with the distributed class, and puts Web-serving solutions
in the load-leveling class. Walker put MOSIX in the load-leveling
class and put Web serving in a class by itself. To muddy the waters
further, Greg Pfister, in In Search of Clusters, classifies
clusters as "a subparadigm of the parallel/distributed realm".
But wait, didn't Bookman classify distributed and parallel
as types of clusters? So, are clusters subcategories of distributed
and parallel computing, or are distributed and parallel types of
clusters?
Clearly, classifying the different types of clusters into the
greater realm of computers is a slippery slope. However, regardless
of how many ways we might classify clusters, one element of clustering
remains constant -- there is not nearly enough consideration
given to sharing elements of solutions among the cluster segments.
One exception, the Open Cluster Framework (OCF, http://www.opencf.org)
is a group that has ambitiously taken on the mission to define APIs
that pertain to both HA and HPC clusters. For example, the OCF defines
a "node liveness" service, which tells members of the
cluster which nodes are currently up and which are non-responsive.
But, if you think about it, wouldn't a node liveness function
be useful in Web serving as well? And isn't Linux and open
source all about sharing code to minimize re-invention? Well, finally
there is a project that addresses the general needs of all types
of clusters -- the OpenSSI Linux Cluster project.
Overview
The OpenSSI project (http://www.OpenSSI.org) is an open
source clustering project that addresses three major issues that
cut across all the various cluster segments: high availability,
scalability, and manageability. They recently released their 1.0
version, which is available under the GPL2 license. What distinguishes
the OpenSSI project in the Linux space is its ambition and scope
-- it intends to become the definitive project that unites all
the Linux cluster factions. It's ambitious enough that, to
my knowledge, such an inclusive, sweeping approach to clustering
has never before been attempted.
When Pfister talks about Single System Image (SSI), he talks about
layers and levels of SSI, depending on how many subsystems and resources
cooperate to create the illusion of a single system, comprising
perhaps hundreds or thousands of nodes, that provide the look and
feel of a single workstation (albeit on steroids). In an attempt
to achieve true SSI that cooperates at many different levels, the
OpenSSI team has analyzed and dissected clusters into their component
parts to provide a set of SSI building blocks at the system call
level from which any type of cluster can be implemented. These SSI
building blocks, which were both developed by the OpenSSI team and
borrowed from the open source community, provide the individual
process a cluster-wide view of all the available cluster resources.
The building blocks consist of the component parts:
1. CLuster Membership Service (CLMS) -- The role of CLMS is
to keep an accurate accounting of the state of each node (e.g.,
up, down, going down, etc.) to make sure each node has the same
view of the state of all the other nodes in the cluster, and finally
to provide APIs to make the information available. Like all the
kernel subsystems, CLMS relies on the ICS (Inter-node Communication
Subsystem), which provides a method for the nodes to exchange information.
2. Clusterwide Filesystems -- Mounts are automatically distributed
to all nodes, which ensures a consistent view of the filesystem
tree across the cluster. Each mounted filesystem can be accessed
via one of several methods:
(a) Cluster FileSystem (CFS) -- Stacks over a physical filesystem,
such as ext3 or XFS, to provide transparent cluster-wide access
to that filesystem from any node in the cluster. CFS is similar
to NFS, except it is highly coherent, supports client-at-server
access, and can transparently failover in the event of server failure:
- Coherency -- Any node sees a change to a file or the directory
structure as soon as it happens.
- Client-at-server access -- The server node for a filesystem
accesses it via the same coherency mechanisms as any other node.
- Transparent failover -- Another node can automatically
become the server for a filesystem, if it is physically connected
to the storage device.
(b) Global FileSystem (GFS) -- A parallel physical filesystem
that allows direct access to a storage device from any node in the
cluster. The storage device must be physically connected to all nodes
with a storage area network (SAN), which limits its applicability
to smaller clusters. The open source version of GFS is maintained
by the OpenGFS project (http://www.OpenGFS.sf.net).
(c) Lustre -- A network attached storage (NAS) filesystem
that "stripes" across multiple servers. It is useful for
very large clusters, because it eliminates the performance bottleneck
of a single CFS server without requiring a storage device to be
physically connected to every node. This filesystem is maintained
by the Lustre project (http://www.Lustre.org).
(d) Special filesystems -- OpenSSI has built-in cluster-wide
support for special filesystems, such as devfs and procfs.
3. Cluster-wide Process Subsystem -- Provides a view of all
processes from all nodes in the cluster, through both system calls
and /proc/<PID>. This allows commands like ps
and gdb to work cluster-wide without modification. It also
provides tools to easily migrate processes in mid-execution, either
manually or via automatic load balancing.
4. Cluster-wide Inter-Process Communication (IPC) -- Provides
a cluster-wide address space for each Linux IPC mechanism (pipes,
Unix domain sockets, semaphores, etc.), and allows IPC to work transparently
between processes on different nodes.
5. Cluster-wide Devices Subsystem -- Provides a naming method
for all the devices in the cluster, and easy access to any device
from any node.
6. Cluster-wide Networking Subsystem -- Provides a single
IP address to the external world, and load balances TCP connections
among the cluster nodes. Much of this functionality is derived from
the Linux Virtual Server project (http://LinuxVirtualServer.org).
7. Distributed Lock Manager -- Runs on each node to enable
various forms of cache coherency.
8. System Management -- The goal of System Management is to
use familiar commands that are slightly modified instead of a new
set of cluster commands. System Management also provides a method
of adding new nodes to the cluster trivially.
9. Application/Resource Monitoring and Failover/Restart --
A subsystem that takes certain user-defined actions when a node
fails, or an application stops.
Taken collectively, these components provide a single system image
look and feel across different resources and provide a set of building
blocks for all the different cluster types (much of this information
is taken from the Walker paper -- refer to it for a complete
discussion).
Installing an OpenSSI Cluster
The minimum hardware for a basic OpenSSI cluster is two or more
computers connected with an Ethernet network. For security reasons,
this network should be private. Any node that has a second NIC can
use it to connect to external networks or the Internet.
OpenSSI can run on several Linux distributions, including Red
Hat and Debian. The focus of this discussion is on Red Hat, so install
the latest version of it on your first node. You do not need to
install a distribution on any other machine. This is because OpenSSI
shares the first node's filesystems with the rest of the cluster
(via CFS) -- one reason why it is so easy to administer an OpenSSI
cluster. Nevertheless, there are a few small restrictions on how
you should configure your installation. You also need a few packages
that may not be installed by default, such as dhcp and tftp-server.
See OpenSSI.org for more details.
Next, download the latest RPMs for OpenSSI. They can be found
together in a tarball on OpenSSI.org. Login as root, extract the
tarball, and run the provided installation script. It will ask you
a few questions about how you want to configure your first node,
then the script will begin installing packages.
The process starts by replacing a few Red Hat packages with OpenSSI-enhanced
versions. They have extra features when you are running on an OpenSSI
kernel, and behave normally when you are not. These extra features
include:
init -- Is highly available and can start daemons and
gettys on nodes joining the cluster
mount -- Understands which node has what filesystem
(or swap device)
mkinitrd -- Can build the special ramdisk required
for booting a node into the cluster
The script also installs new packages that provide commands and
APIs for such things as:
- Examining the membership state of cluster nodes
- Migrating processes on the fly
- Remotely executing and forking processes
- Configuring automatic load balancing of processes and TCP connections
- Simplifying cluster installation and management
- Automatically restarting critical applications that die
Finally, the script installs the OpenSSI kernel package, which
contains most of the clustering code. It is based on the Red Hat
kernel shipped with your distribution, but it is modified so that
system calls can transparently access remote objects (processes,
filesystems, devices, sockets, semaphores, etc.) on other nodes
in the cluster. Because every object is guaranteed to have a unique
name and can be transparently accessed as if it were local, unmodified
programs running on an OpenSSI kernel are tricked into believing
that the cluster is a single, giant machine.
Once the install script is done running, you can simply reboot
and select OpenSSI at the bootloader prompt. Congratulations! You
now have a one-node OpenSSI cluster.
Adding a New Node
Adding a new node is easy. Most of the work is done by the openssi-config-node
command. It has an interactive interface to help you add nodes,
remove nodes, and change their configurations.
To add a node, bring it up on the private network with a network
booting protocol such as PXE or Etherboot (http://www.Etherboot.org).
The node will display its network hardware address (MAC address)
on its console. Run openssi-config-node on the first node,
tell it you want to add a new node, and select the appropriate MAC
address from a list. Assign the node a unique number (such as 2)
and IP address. After a few moments, the node will boot and join
the cluster. That's all there is to it. You can add more than
a hundred nodes following this procedure, although a more automated
system is under consideration for clusters that large.
For each new node, you can configure special per-node features.
These features include:
1. Local swap device -- Each node with local storage can use
it for swap.
2. Local boot device -- Each node with local storage can boot
from it, rather than doing a network boot.
3. Additional CFS-stacked filesystem -- Each node with local
storage can share a filesystem on it with the rest of the cluster
via CFS.
4. External network interface -- Each node with a secondary
network interface can use it to connect to external networks.
See OpenSSI.org for more details.
Configuring Optional Features
The instructions I've provided will help you set up a basic
OpenSSI cluster, without any special HA or scalability features.
On the OpenSSI.org Web site, you can find supplemental documentation
for configuring optional features, such as:
1. High-availability CFS (HA-CFS) -- Enables transparent failover
of a CFS-stacked filesystem between nodes that are physically connected
to a shared storage device.
2. Direct-access GFS -- A parallel physical filesystem described
previously.
3. High-scaling Lustre -- A "striped" NAS filesystem
described previously.
4. Process load balancing -- Automatically migrates processes
from overloaded nodes to underloaded nodes. This feature is provided
by openMosix node selection algorithms combined with OpenSSI's
process migration architecture.
5. TCP connection load balancing -- A Cluster Virtual IP (CVIP)
address can be configured to automatically distribute connection
requests among all nodes that service a given TCP port.
6. HA CVIP address -- A CVIP address can be automatically
hosted on a new node if its original host node goes down.
7. HA application monitoring and restart -- A utility package
that can be used to automatically monitor applications and restart
them if they die. The monitoring and restart daemon stores its state
on disk so that it can be restarted by init without forgetting what
it was watching.
Playing Around with OpenSSI
The following examples demonstrate some of the features described
above. They assume you have a basic two-node cluster. To get started,
log in to both consoles as root. Confirm that both nodes have joined
the cluster:
(node 1)
# cluster -v
1: UP
2: UP
Also confirm the node number for each console:
(node 1)
# clusternode_num
1
(node 2)
# clusternode_num
2
These are some of the commands used for examining the membership state
of cluster nodes. Read the OpenSSI documentation for more information
about these and related commands.
Now it is time to do something really interesting.
Clusterwide Filesystems
Start a vi session on the first node (or emacs if
you prefer):
(node 1)
# vi /tmp/newfile
Type a few lines:
Hello World!!
Beautiful day, isn't it?
Save the file, but do not quit. On the second node's console,
log in as root and view the contents of your new file:
(node2)
# cat /tmp/newfile
The output is the same text you entered on the first node. This demonstrates
that the filesystems are shared coherently across the cluster. Whether
you are creating, editing, moving, or deleting files and directories,
any change you make on one node is immediately visible to all other
nodes. Feel free to try other experiments with cluster-wide filesystems.
Cluster-wide Processes
In the previous section, remember that you did not quit vi.
That is because we are going to play some games with the cluster-wide
processes. Look for its PID from the second node's console:
(node 2)
# ps | grep vi
66123 tts/0 00:00:00 vi
The ps command is unmodified and unaware of the cluster, yet
it is able to see a process running on node 1. How is this possible?
It is because the OpenSSI kernel can transparently access remote objects,
such as a process on node 1, to create the illusion that the cluster
is a single, giant machine.
You can confirm the location of the process with a special OpenSSI
command:
(node 2)
# where_pid 66123
1
One more experiment before we move on:
(node 2)
# kill 66123
Like ps, the kill command is unmodified and unaware
of the cluster. Nevertheless, the vi session on node 1 is killed.
Not only are all processes visible throughout the cluster, but they
can be signaled from anywhere in the cluster. All processes appear
to be local.
Clusterwide IPC
Why stop there? IPC objects are visible cluster-wide, too:
(node 2)
# mkfifo /tmp/fifo
(node 1)
# echo 'Hello World!!' >/tmp/fifo
(node 2)
# cat /tmp/fifo
Hello World!!
At the risk of sounding like a broken record, I will say that the
mkfifo, echo, and cat commands are also unmodified
and unaware of the cluster. The OpenSSI kernel supports truly cluster-wide
IPC objects, including pipes, FIFOs, Unix domain sockets, semaphores,
message queues, SysV shared memory, and memory mapped files.
Process Load Balancing
Process load balancing is neatly demonstrated by a demo package
available on OpenSSI.org. It was first shown at Linux World San
Francisco in 2002, where OpenSSI won the Best Open Source Project
award for the conference. It consists of three Perl scripts and
a detailed README. The first two scripts (demo-proclb and
demo-proclb-child) are unaware of the cluster and can run
on any Linux system. The demo-proclb script forks off a number
of children and then reads from a message queue. The children execute
the demo-proclb-child script, which reads a record from disk,
spins in a busy loop to simulate a CPU-intensive calculation, then
writes to the message queue to inform the parent it has finished
that record. Each child processes an assigned number of records
before exiting.
The master uses the information it reads from the message queue
to track each child's progress and to display a bar graph showing
the total number of records processed per second. This gets particularly
interesting when process load balancing is turned on. Suddenly the
performance increases dramatically. The more processors that are
in the cluster, the higher the performance goes.
To visualize this process load balancing, run the third script
(demo-proclb-monitor). For each node, it displays node number,
load averages, and the PIDs of any demo-proclb-child processes
running on that node. If you start demo-proclb-monitor before
activating process load balancing, you will see all children on
the same node as their parent. After activating load balancing,
they are migrated to other nodes until all load averages are approximately
the same. Neither demo-proclb nor its children are aware
of this load balancing. The children continue to read their records
from disk and write to the message queue without any problems.
Another experiment to try is to power-cycle one of the nodes (do
not power-cycle the node on which demo-proclb is running).
The children on that node die with the power failure, but within
moments demo-proclb forks new children to take their place.
This is possible because demo-proclb is keeping track of
each child's progress. But wait a second! If demo-proclb
is unaware of the cluster, how does it know a node died? It does
not know. It only knows that some of its children died, through
the standard Linux mechanisms of receiving a SIGCHLD and
doing wait() system calls.
This set of scripts offers a convincing example of the scalability
and availability provided by OpenSSI clustering. What about another
example of manageability? Try sending a kill signal to the demo-proclb
process group:
(any node)
# ps | grep demo-proclb
66543 tts/0 00:00:00 demo-proclb
# kill -66543
Despite the fact that demo-proclb and its many children are
scattered throughout the cluster, this simple (and unmodified) command
kills them all. If you desire manageability, availability, and scalability
in one clustering solution, OpenSSI may be just what you are seeking.
Conclusion
The OpenSSI project will continue to integrate with other open
source and proprietary software to increase the value of existing
solutions by improving their manageability, availability, and scalability.
By providing the basic building blocks for all cluster types, the
OpenSSI project also creates the potential for entirely new solutions
based on OpenSSI. In the future, it will simplify the development
of highly available and scalable applications, which will reduce
the cost of innovation in these critical markets.
OpenSSI is an open source project whose success depends on the
contributions of its community. If its vision for the future of
clustering is something you would like to support, please consider
helping by submitting detailed bug reports, writing documentation,
fixing bugs, or developing new features. You can send email to:
ssic-linux-devel@lists.sourceforge.net to express your interest.
Acknowledgements
The authors thank Bruce J. Walker of HP's Office of Strategy
and Technology for his review and advice in preparing this article.
References
Pfister, Gregory F. In Search of Clusters. Prentice Hall,
1998, ISBN 0-13-899709-9.
Bookman, Charles. Linux Clustering. New Riders, 2003, ISBN
1-57870-274-7.
Walker, Bruce J. "Open Single System Image (OpenSSI) Linux
Cluster Project", a technical whitepaper available on the OpenSSI.org
Web site.
The OpenSSI home page -- http://OpenSSI.org
The Etherboot home page -- http://Etherboot.org
The OpenGFS home page -- http://OpenGFS.sf.net
The Lustre home page -- http://Lustre.org
The LVS home page -- http://LinuxVirtualServer.org
Richard Ferri is a Senior Programmer at IBM's Linux Technology
Center, where he contributes to and writes about open clustering solutions.
He studied at Georgetown University with a concentration in English.
He lives in upstate NY with his wife, Pat, three teenaged sons, and
an ever-changing cast of critters.
Brian J. Watson is an OS Developer for HP's Office of
Strategy and Technology, where he designs features and writes documentation
for the OpenSSI Clustering Project. He majored in Astronautical
Engineering at Penn State University. He enjoys hiking mountain
trails around Los Angeles and savoring fine beer at his local home-brewing
club.
|