The Secrets of openMosix
Richard Ferri
The ultimate promise of clusters is one of inexpensive,
scalable compute power fueled by commodity hardware. openMosix expands
that promise to include open source software. Luckily for programmers
and administrators, openMosix fulfills the promise of an open, scalable,
commodity cluster solution.
For those of you who have heard of Mosix, but are wondering what
openMosix is, I'll provide a bit of background. The Mosix project,
headed by Professor Amnon Barak, began in 1977 at the Hebrew University
of Jerusalem. The goal of Mosix is to transform an unwieldy and
perhaps heterogeneous pile of computers into a single efficient
cluster computing resource. This transformation is based on a set
of kernel changes, which provide a single system image cluster,
and a scheduler, which deftly moves processes to the various nodes
of the cluster based on a "least loaded" algorithm.
The Mosix project underwent a serious schism in January 2002,
which spawned a new generation of Mosix -- the openMosix project.
In a dispute over the "commercial future" of Mosix, Dr.
Moshe Bar, the Mosix co-project manager since 1999, started a new
company (Qlusters, Inc.) and forked the Mosix project into the new
offering, openMosix. openMosix has quickly caught on in the clustering
space, consistently at or near the top of the most popular clustering
projects list at SourceForge.
Although the long legacy of Mosix spans more than two decades,
with roots as far back as the PDP-11/45 running Bell Labs UNIX 6,
and offshoots into VAX machines and BSD UNIX, this article will
only address the Linux version of openMosix on X86 machines.
Single System Image
To fully appreciate the beauty and value of openMosix, we must
understand that big term in the opening paragraphs, the "single
system image" (SSI) cluster. In the book In Search of Clusters,
Gregory Pfister defines SSI as "the illusion, created by software
or hardware, that a collection of computing elements is a single
computing resource". Pfister introduces the idea that there
are levels of SSI in a cluster, that some subsystems may recognize
the cluster as SSI, and some may not. The user perspective comes
into play as well. For example, when a user goes online to order
a book from amazon.com, he sees amazon.com as a single computing
resource, a single image, without seeing the underlying infrastructure
that delegates his request to an individual server. So when referring
to openMosix as an SSI cluster, we must talk about how openMosix
has addressed the various shades of SSI, specifically from the perspective
of:
1. The process
2. The user
3. The filesystem
In Unix, a program is an executable file, and a process is an
instance of the program in execution (see Bach). We can consider
a process space as a collection of all the information about the
running processes on a system. In a traditional, non-SSI environment,
every machine would have its own individual process space, and every
space would have to be managed separately. In openMosix, there is
the concept of a Unique Home Node, or UHN. When a process is spawned
in an openMosix cluster, it must originate on a single node, and
that node is known as the UHN. From the UHN, the process can be
migrated to any other less heavily loaded node in the cluster. However,
from the perspective of the UHN, an openMosix cluster forms a single,
cluster-wide process space. All the processes that originate from
a UHN, and are later migrated to other nodes in the cluster, can
be administered directly from the UHN just as if the migrated processes
were local to the UHN. This administration of remote processes from
their UHN is transparent to the end user, and creates a single system
illusion with regard to the process space.
To distribute the workload across a cluster, the openMosix kernel
itself makes the decision to migrate the process from a node that's
more heavily loaded to one that's less loaded. In a homogeneous
cluster, where all nodes have the same compute power in terms of
CPU speed and memory, the decision to migrate a process can be fairly
straightforward. One of the features of openMosix is that the openMosix
kernel tries to assess the "horsepower" of each node in
a heterogeneous cluster, where nodes have different speeds and amounts
of memory. Admittedly, combining disparate factors like CPU speed
and amounts of memory to arrive at an overall horsepower rating
for the node is a bit of educated guesswork. However, this per-node
rating allows the openMosix kernel to make a decision when an individual
process should be migrated to a less heavily loaded node.
From the user perspective, a single cluster-wide process space
greatly simplifies process administration. Sending a signal to a
process, to kill it for example, does not require the user to know
the node on which the process is actually running. Equally important
is that the user does not have to manually balance the workload
across the cluster, based on the talents and amount of work of each
individual node.
openMosix dynamically balances the cluster workload by moving
work from heavily loaded nodes to more lightly loaded nodes. From
the user's perspective, this migration is completely transparent.
With regard to process management, the end user administers an openMosix
cluster just as he would a single workstation.
Besides process management, another aspect of SSI that is evident
to the user is how the cluster adjusts to the addition of new nodes.
To be completely SSI, the addition of a node would be transparent
to the end user -- the new node would seamlessly add horsepower
to the running cluster. One of the new features of openMosix is
an auto-discovery daemon that allows new nodes to join the cluster
without intervention of the administrator. When a new node is booted,
the auto-discovery daemon (omdiscd) uses multicast packages to inform
the other nodes that the new node is joining the cluster (see Buytaert).
Once the new node is evaluated for its horsepower and load, it is
assigned work based on the needs of the cluster. Since the addition
of new nodes is trivial, we can say that openMosix is very close
to the ideal in maintaining SSI when changing the node composition
of the cluster.
By now you're probably wondering, how does this cleverly
migrated process get access to its filesystems? After all, if a
process creates a temporary file on one node, the process would
have to maintain access to that file even if the process is migrated
to another node. For example, if a process running on node 1 creates
a file /tmp/foo, and then the process is migrated to node 2, how
does the process access the /tmp/foo file on node 1? openMosix has
several alternative solutions for remote file access.
The most simple (and least efficient) approach that openMosix
takes to file access is that the openMosix kernel running on the
remote node will intercept all I/O requests from the migrated process
running on the remote node and send them to the UHN. As you can
imagine, doing all file access over the network invariably slows
down a remote process. The second approach is for the user to create
a consistent cluster-wide view of the data using Network File System
(NFS). Not only is maintaining a consistent view of all the filesystems
(including permissions and owners) a painful process, it again forces
all I/O for a migrated process to be performed over the network.
NFS has several traditional problems, the greatest of which is a
lack of cache consistency -- if two processes are writing to
the same file at the same time, they basically slug it out to see
whose writes actually get written out to the file. In clustering,
this is known as bad karma.
openMosix provides a much more elegant solution than either the
openMosix kernel redirecting the I/O, or NFS. The current openMosix
solution for file access is oMFS, or the openMosix File System.
oMFS is a modern cluster-wide filesystem that is administered similarly
to NFS, but with SSI features that include cache, timestamp, and
link consistency. These features ensure that regardless of how many
remote processes are writing to a single file at the same time,
data integrity and file information are preserved. openMosix has
another feature called DFSA, or Direct File System Access, which
is based on a cluster filesystem such as oMFS. With the DFSA feature
installed, a determination is made as to whether it would be more
efficient to execute a remote process remotely, or migrate it back
to the node that actually contains the data -- in other words,
move the process to the data, instead of the data to the process
(see Bar). With the advent of modern cluster filesystems like oMFS,
and the performance enhancement of DFSA, we can say that openMosix
is SSI with regard to filesystems, and provides superior performance
with DFSA.
Now that I've examined openMosix as an SSI solution from
the perspective of the process, the user, and the filesystem, I'll
look at how to implement an openMosix cluster.
Installing an openMosix Cluster
In this section, I'll provide an overview of how to build
your own openMosix cluster. This article is not exhaustive in its
treatment of installing openMosix; refer to the excellent HOWTO
maintained by Kris Buytaert listed at the end of this article for
further details and the latest information.
There are actually two pieces to openMosix -- the openMosix
kernel and the userland tools. The openMosix kernel has been modified
to provide the Single System Image I talked about previously, and
for the Mosix File System. The userland tools provide all the executables
you need to form the cluster and manage migrated processes. You
have two choices when building your openMosix cluster -- either
installing the openMosix kernel and user tools from RPM, or downloading
the source and building your own openMosix kernel and user tools.
The project documentation says openMosix is supported on Red Hat,
Suse, and Debian distributions. I tried openMosix under Mandrake
9.0, and it installed and ran without a problem.
It's certainly easy to install openMosix directly from RPM.
The RPMs are on the sourceforge openMosix home page (http://www.sf.net/projects/openmosix,
follow the Files link). From there you can download the appropriate
kernel RPM (e.g., openmosix.kernel-2.4.20-openmosix2.i386.rpm) and
the user tools (e.g., openmosix-tools-0.2.4-1.i386.rpm). Installing
the kernel RPM will not update /etc/lilo.conf (the lilo loader configuration
file), so you'll have to add a stanza for the new openMosix
kernel, and run lilo. Just run the rpm command against the
openmosix-tools RPM to install the commands in the correct directories,
and you're ready to start up your openMosix cluster.
One of the drawbacks, however, of installing directly from RPMs
is that the user must accept the kernel options that the openMosix
RPM was originally built with -- for example, process migration
is on by default, but oMFS is off. If you want a different set of
kernel options, you'll have to build the kernel and the user
tools yourself. If you haven't built a Linux kernel before,
it's not all that scary. You basically download the source
for both the kernel and user tools, and follow a recipe something
like this:
1. Install your favorite Linux distribution. (I chose Mandrake
9.0, with network clients and development packages included.)
2. Download the kernel source from kernel.org and copy it to /usr/src.
(I chose the 2.4.20 version, linux-2.4.20.tar.bz2.)
3. Download the openMosix kernel patches from the Files link on
the sourceforge homepage (from http://www.sf.net/projects/openmosix)
and copy it to /usr/src.
4. cd to /usr/src, and untar the kernel source:
tar -jxvf linux-2.4.20.tar.bz2
5. cd to linux-2.4.20 and apply the patches to the
kernel source:
zcat .../openMosix-2.4.20-2.gz | patch -p1
6. make xconfig, modify any openMosix settings, and save
and exit to create a .config file (the list of all the kernel flags).
7. Build the kernel:
make dep bzImage modules modules_install
8. Take the bzImage you just created and copy it to /boot:
cp 'find . -name bzImage' /boot/openmosix-2.4.20
9. Update /etc/lilo.conf with a new stanza and run lilo.
This installs the openMosix kernel.
Once the kernel is built and installed and lilo'ed, it's
time to build the user tools.
1. Download the user tools source archive from the Files link
on the sourceforge homepage (e.g., openMosixUserland-0.2.4.tgz).
2. Untar the archive:
tar -zxvf openMosixUserland-0.2.4.tgz
3. cd to openMosixUserland-0.2.4 and modify the
configuration file as necessary. The only place I had to modify
was the OPENMOSIX variable:
OPENMOSIX = /usr/src/linux-2.4.20
4. make all
The make all copies all the openMosix binaries to their
respective directories, ready to be executed.
Remember, these recipes are provided as examples. The definitive
method for building and installing the kernel, and user tools included
in the READMEs are provided with the kernel source archive and the
userland tools archive, respectively.
Starting openMosix
To perform a meaningful test of openMosix, you really need a second
workstation, to which you've added the openMosix kernel you
just created, and once again updated lilo.conf and run lilo.
Note that if you copy only the kernel to the second node, and do
not copy over or rebuild the user tools, you will not have the necessary
openMosix commands to control the processes from the second node.
Since this is just a simple test, I copied over only the openMosix
kernel and decided to control all my processes from the original
node where I built the user tools.
Now that your nodes are rebooted and running the openMosix kernel,
it's time to tell openMosix about which nodes are in the cluster.
The cluster is formed using the setpe command, supplied with
the openMosix user tools. You can define the cluster from the command
line, or a simple file that tells openMosix the range of IP addresses
that will be included in the cluster, as in:
setpe -w -f /etc/hpc.map
where hpc.map contains:
1 mini.pok.ibm.com 2
which indicates that mini.pok.ibm.com is the first node in the cluster
(node 1), it begins at address mini.pok.ibm.com, and extends for 2
nodes. Since the IP address of mini is 192.168.0.1, the address of
the second node would be 192.168.0.2. Once you've run the setpe
-w command on both nodes to define the cluster, you can run setpe
-r to verify that both nodes are indeed recognized by openMosix.
Fun with Processes
Once you've built your cluster, configured openMosix, and
informed openMosix that the cluster has two nodes, it's time
to create a long running process that can be used as an example
of process migration. I stole the example from the openMosix HOWTO,
creating a file called testscript:
awk 'BEGIN {for(i=0;i<10000;i++)for(j=0;j<10000;j++);}'
After starting testscript a dozen times or so, in the background (e.g.,
testscript&) the fun begins. Depending on what is running
on the two nodes in your cluster, openMosix will immediately start
migrating the processes associated with the multiple testscripts running
to the second node of the cluster. Since the migration process is
transparent, it will not be obvious that the processes have migrated.
If you check the load on both nodes of the cluster, using the mosctl
command (you'll have to copy or NFS mount the userland binaries
to the second node), you should see that the load on both cluster
nodes is similar:
mostcl getload
Both load numbers were around 580 on my cluster. An even better indicator
would be to run the simple monitor on the first node, using the mosmon
command. It will show graphically that the work is divided between
the two nodes. At this point, I decided to grab all the work and migrate
all the processes back to the original node, again using the mosctl
command:
mosctl bring
Immediately all the processes that were migrated to the second node
were re-migrated to the first node. Using mosctl getload showed
that the second node's load was nearly zero, and the load on
the first node nearly doubled, indicating that the processes had indeed
migrated home. Executing mosctl nolstay undoes the effect of
mosctl bring. Once mosctl nolstay is executed on the
original node, openMosix will distribute some portion of the work
to other nodes in the cluster. Again, this can be verified with the
mosctl getload command, and by watching the mosmon monitor.
At this point, we can be pretty certain that processes are being
migrated around the cluster, based on the load indicators, however,
we want to see some increased performance in the application. After
all, the real advantage of openMosix is a transparent increase in
processing power. To show a measurable performance increase, I wrote
a script called "doit" that ran our testscript described
above. doit looked something like this:
date
testscript&
testscript&
wait
date
This script checks the date/time, starts two instances of the little
counting testscript in the background, and then waits for completion
of the child processes before displaying the time of completion. The
"wait" is very important -- otherwise the script just
starts the two instances of testscript in the background and exits
immediately, not waiting for completion. On the original node, I ran
mosctl bring to limit all processes to the single node, and
ran doit. It took 2:07. Then I ran mosctl nolstay to bring
the second node back into the cluster, and re-ran doit. This time,
doit took 1:07. Allowing for overhead, the second node cut our processing
time nearly in half, demonstrating that the second node did in fact
speed up the overall throughput of the cluster. Your mileage may vary.
Conclusion
In this article, I've examined how openMosix provides the
single system illusion to the end user with regard to process management
and node addition. I've also talked about how a migrated process
has data consistency as it's moved from node to node, through
the the cluster filesystem oMFS. Yet, this information barely scratches
the surface regarding openMosix. This project has a wealth of associated
monitors and management tools including a full screen process management
GUI, openMosixview (http://www.openmosixview.com), and various
daemons to collect and analyze process history data. As the popularity
of openMosix grows in the commercial space, more sophisticated tools
will emerge over time.
In deciding whether to install an openMosix cluster at your installation,
it's important to realize that the smallest manageable unit
in openMosix is the process. An application that runs as a single
process will not benefit from running under openMosix. If the same
application is broken into several processes that can run in parallel,
these processes can be distributed throughout an openMosix cluster
to enhance performance (see Robbins). Ultimately, if you're
looking for a transparent, dynamic method of balancing workload
across a Linux cluster, and your workload can run as parallel processes,
openMosix may be the solution for you.
Acknowledgements
I thank Moshe Bar for his review and recommendations regarding
this article.
Resources
Bach, Maurice J. The Design of the UNIX Operating System.
Prentice Hall, 1986, ISBN 0-13-201799-7.
Bar, Moshe. openMosix Internals: How openMosix works. See: http://www.openmosix.sourceforge.net
Bookman, Charles. Linux Clustering. New Riders, 2003, ISBN
1-57870-274-7.
Buytaert, Kris. The openMosix HOWTO: http://howto.ipng.be/openMosix-HOWTO
Historical information on Mosix -- http://www.mosix.org
Pfister, Gregory F. In Search of Clusters. Prentice Hall,
1998, ISBN 0-13-899709-9.
Robbins, Daniel. "Advantages of openMosix on IBM xSeries".
See: http://www.ibm.com/servers/esdd/articles/openmosix.html
Richard Ferri is a Senior Programmer in IBM's Linux Technology
Center where he contributes to and writes about open source clustering
projects. He attended Georgetown University many years ago with
a concentration in English. He lives in upstate New York with his
wife, Pat, three teenaged sons, and an ever-changing cast of critters.
Rich can be reached at: rcferri@us.ibm.com
|