Building the Perfect Cluster: Choices that Can Save Your Sanity
Håkon Bugge
Clustering is today's hottest hardware trend, poised to ride
the wave of the "mainstream adoption" portion of the technology
acceptance curve. Scientific and academic organizations were the
first to embrace high-performance clustering several years ago,
moving away from monolithic SMP hardware systems to less expensive,
more flexible clusters of high-volume, rack-mounted, commodity servers
typically running Linux. As they saved money on hardware, these
early adopters also dramatically reduced acquisition costs and license
fees for operating system software.
Now the mainstream business world is moving to clusters for similar
reasons. Because x86-based commodity servers running Linux are so
inexpensive, many corporate IT organizations are committing to clusters
to provide mission-critical computing horsepower.
The Potential Pitfalls of Clustering
The transition from SMP machines to clustering is not without
a few challenges. The biggest issue is that clusters often require
more administration than traditional SMP systems, since each component
of the cluster is still an individual server that needs to be configured,
loaded with operating system and application software, administered,
and maintained.
Within the cluster, additional layers of technology and software
provide necessary services such as operating system, hardware interconnect,
message passing, and cluster management software. These elements
must be dealt with, as well as the fact that although clusters themselves
are fault-tolerant, their individual nodes are not. If a node goes
down, it must be replaced or repaired.
Meanwhile, a cluster should be designed to handle the highly likely
event of node failure. This is not a rare situation; disk, power
supply, network interface, or other parts of a node are estimated
to fail between 30,000 and 60,000 hours of use. When this happens,
the cluster has one node less to utilize in carrying out its duties.
Clearly, in a mission-critical business IT environment, the impacts
of node failure and administrative overhead become an important
design factor. Thinking through the various "what if"
scenarios in the design phase can save significant cost over the
cluster's lifetime.
Doing the Math: How Many Nodes Are Required?
So, exactly how many nodes does your cluster need? It depends
primarily on the application(s) you want to run. In a cluster, an
application separates the processing task into smaller pieces, each
of which runs on the individual nodes. Some applications impose
restrictions on the flexibility of this decomposition. The application
might, for example, support only 2n number of processes; the cluster
designed to support it would therefore need 2n nodes. In this instance,
the failure of a single node would mean that the application could
utilize only half of the cluster.
Continuing the example, one solution would be to add one or two
spare nodes in advance, although this might impact the interconnect
cost per node. (See the sidebar "More Nodes or a Higher-Performance
Interconnect?") Additionally, there's a tradeoff between
the desired degree of fault tolerance and the cost of the overall
cluster. (See the sidebar "Moore's Law and Clusters.")
Pre-processing the input data might also impose similar restrictions.
Pre-processing often requires a 64-bit machine, since the complete
problem must be parsed by a single process that breaks it into smaller
sub-problems, which are in turn processed by more cost-effective
32-bit processors. Pre-processing is often time consuming, and the
result is a data set that is tailored for processing by a specific
number of nodes. When a node failure happens here, the pre-processed
data set can be rendered obsolete and a new data set must be generated.
This is another example where the addition of nodes to the cluster
configuration can save valuable processing time.
When clusters are designed correctly, they offer tremendous flexibility,
scalability, and reliability over the long run. But to ensure that
your organization reaps these benefits, a cluster must be carefully
planned based on requirements such as reliability, applications,
cost, etc.
Cluster Management: Working Easier, Not Harder
As systems administrators (SAs) get squeezed to manage more machines,
they must have access to tools that help them work more efficiently.
If not, the incremental responsibilities administrators must assume
with clusters can quickly push even the best SA to the breaking
point. While there may be plenty of smart, low-cost labor in academic
environments, that's definitely not the case in today's
jam-packed corporate IT environment. There never seem to be enough
available resources to handle the workload, much less the increased
administrative overhead that clustering entails.
As a result, the widespread adoption of clusters is driving fundamental
changes in their management. The knowledgebase required to manage
a cluster should not reside with a single SA; rather, administration
capabilities must be embedded into the IT organization itself. Today
there is a new class of administration tools that remove the complexity
of clustering, enabling any SA generalist to be an effective cluster
administrator -- and in the process, helping to preserve a sys
admin's sanity.
Professionally developed tools can dramatically streamline cluster
management by reducing the time required for installation, ongoing
management of software, fault isolation, and fault prediction. These
are major issues in corporate environments where clusters can easily
grow to 1,000 or more nodes, dispersed across multiple geographic
locations.
It's true that open source has zero upfront costs compared
to professionally developed cluster solutions. However, there are
many associated costs -- in terms of time and resources --
that open source carries over the lifetime of the cluster. With
open source, sys admins must manually execute many steps (e.g.,
system installation, looking for and installing updates, compiling
different applications to run on various interconnects, etc.), which
can quickly add up to an extraordinary amount of administrative
overhead. The time required to execute these tasks with professionally
developed solutions is a comparative fraction, an essential consideration
in today's resource-constrained IT environment.
Building a High-Performance Cluster: Issues and Considerations
To build a cluster that offers high performance and ease of manageability,
here are a few straightforward steps for you to follow. As detailed
below, the choices you make can have a dramatic effect on the ease
or difficulty you will later have in administering the cluster,
as well as its performance, flexibility, scalability, and return
on investment (ROI). Keep in mind that applications will predicate
many of your choices -- you'll need to consider which application(s)
you want to run and performance requirements, and even the specific
code.
Step 1: Physical Considerations
Before acquiring equipment and software, make sure that your environment
will support the cluster you configure, with room to grow as necessary.
Investigate cooling requirements, power sources and backup, and
make sure you have enough physical space to support the rack units
you'll want to install.
Step 2: Choose the Hardware Architecture
You can choose between a number of processor architectures for
your cluster, depending on performance requirements and budget.
Traditionally, clustering software has required that a single architecture
be used in a cluster. This is changing as communication middleware
and management systems better accommodate hardware heterogeneity,
giving admins the flexibility to design a cluster with a mix of
node architectures depending on requirements.
The result is greater design flexibility, scalability, and ability
to handle technology obsolescence with ease. For example, you can
design a cluster from the start to handle a variety of tasks with
optimized performance (such as pre- and post-processing), or you
can replace nodes that fail with the latest technology, giving you
essentially more power for the same or less cost. These nodes can
also come from different providers, so you're not locked into
a single vendor. The ability to have a combination of the two architectures
in a single cluster lets you join two separate clusters into a single
large one for increased processing power.
Another important issue to consider when ordering hardware, is
that the vendor may preset the BIOS settings in each server at the
factory. Often, the boot order must be changed to reflect network
boot. PXE (Pre-eXecution Environment) or EFI (Extensible Firmware
Interface) must be enabled and configured. Furthermore, if the CPUs
used are Intel® XeonTM or newer Pentium® 4s, the BIOS
option for enabling or disabling Hyper-Threading (HT) must be set
correctly based on application requirements; some applications take
advantage of HT, for others it's counter-productive.
Economics might also dictate you to disable HT, which appears
to software as if one physical CPU were two separate CPUs. This
might require application licenses for twice the number of CPUs
-- not a cost-effective option if the application can't
take advantage of HT, as it seldom delivers a performance boost
of more than about 30 percent.
Step 3: Choose the Interconnect
The interconnect determines how the nodes will be connected together
to function as a cluster. There are two main types of interconnects:
legacy interconnects such as Gigabit Ethernet, or more exotic ones
such as Myrinet. Gigabit Ethernet is less expensive, but Myrinet
offers higher performance, lower latency, and better scalability.
Certain middleware applications enable you to use a combination
of legacy and exotic interconnects in your cluster. Companies and
institutions often run multiple applications on a single cluster,
making your choice of interconnects a complex one, without a single
"right" answer. One of the newest ways to streamline this
decision point -- instead of making tactical decisions about
which interconnect to use on a cluster-by-cluster, application-by-application
basis -- is to strategically choose a message passing interface
that runs on all interconnects.
Step 4: Choose the Operating System
You'll need to decide which OS you want to run on the cluster.
A growing number of organizations are choosing variations of Linux
that are professionally supported for their high performance clusters.
Unsupported Linux is also an option, although there are administration
and support issues to carefully consider when choosing open source,
as previously noted.
Step 5: Choose the Cluster Communications Software
Here's where things start to get a more complicated. The
message passing middleware is a layer that encapsulates complexity
of the underlying communication mechanism and shields the application
from different methods of basic communication. Today, the Message
Passing Interface (MPI) dominates the market and is the standard
for message passing. Although MPI is common for most parallel applications,
developers are faced with a big challenge; virtually every brand
of interconnect requires a particular implementation of the MPI
standard. Furthermore, most applications are statically linked to
the MPI library. This raises three issues. First, if you want to
run two or more applications on your cluster, and some of them are
linked with different versions of the MPI implementation, then a
conflict might occur. This inconsistency is solved by having one
of the application vendors re-link, test, and qualify their application
for the other MPI version, which may take a significant amount of
time.
Second, evolving demands from applications or errors detected
and corrected in the MPI implementation can force one of the applications
to use a newer version. In this case, you end up with the previous
mentioned inconsistency.
The third issue to watch for is upgrading the interconnect to
a different kind, or evaluating the possibilities of doing so. Let's
say you have decided to use Gigabit Ethernet as the interconnect,
but find out that the TCP/IP stack imposes overhead that restricts
the scalability of the application(s). To switch to an MPI able
to take advantage of more efficient and lean protocols, such as
Remote Direct Memory Access (RDMA), you again ask the application
vendors for help with your upgrade or evaluation. Technically, this
hurdle can prevent you from major improvements you should gain from
newer more innovative communications software, or the general interconnect
hardware evolution.
Another approach -- one that avoids the issue -- is dynamic
binding between the application and the MPI middleware, and between
the MPI middleware and device drivers for various types of interconnects.
This way, the MPI implementation and application can evolve independently,
because the application can exploit the benefits from different
interconnects or protocols without changing or re-linking the applications.
(See Figure 1.)
Step 6: Choose the System Management Software
Management software is the final layer in the cluster stack. It
can be open source or delivered from a professional provider, the
latter of which can be vendor-specific (i.e., software for managing
homogeneous clusters using a specific vendor's hardware) or
independent.
While vendor-specific management systems are often tuned for specific
platforms, they do not offer the flexibility to be used in heterogeneous
environments, as independent third-party offerings do. If you choose
vendor-specific management software for a cluster that is initially
homogeneous but later becomes heterogeneous, you will have to go
through the time and expense of switching to a new, independent
system management solution. Choosing an independent management system
from the start can provide the flexibility and functionality you
need, now and in the future.
Ideally, the management software should handle the OS, MPI, and
third-party software. A professionally developed management system
can handle this automatically, whereas if you are using open source
tools, each of the components will have to be installed manually
when you initialize the cluster, and if a node goes down must be
re-initialized.
Conclusion
For systems administrators, building a cluster from the ground
up can be an exciting professional opportunity. But the impact of
the choices made -- particularly for software components like
message passing interfaces and management systems -- is deep
and long lasting. In the past, open source solutions have been sufficient
for organizations with a limited number of clusters and abundant
amounts of low-cost labor. But for today's busy SAs in corporate
environments, professionally developed solutions remove complexity,
and save time, effort and ultimately money -- making systems
administrators more productive while helping them stay sane.
Håkon Bugge is vice president of product development
at Scali, a provider of professionally developed clustering software
solutions. He can be reached at: hakon.bugge@scali.com.
|