Building the Perfect Cluster: Choices that Can Save Your Sanity

Håkon Bugge

Clustering is today's hottest hardware trend, poised to ride the wave of the "mainstream adoption" portion of the technology acceptance curve. Scientific and academic organizations were the first to embrace high-performance clustering several years ago, moving away from monolithic SMP hardware systems to less expensive, more flexible clusters of high-volume, rack-mounted, commodity servers typically running Linux. As they saved money on hardware, these early adopters also dramatically reduced acquisition costs and license fees for operating system software.

Now the mainstream business world is moving to clusters for similar reasons. Because x86-based commodity servers running Linux are so inexpensive, many corporate IT organizations are committing to clusters to provide mission-critical computing horsepower.

The Potential Pitfalls of Clustering

The transition from SMP machines to clustering is not without a few challenges. The biggest issue is that clusters often require more administration than traditional SMP systems, since each component of the cluster is still an individual server that needs to be configured, loaded with operating system and application software, administered, and maintained.

Within the cluster, additional layers of technology and software provide necessary services such as operating system, hardware interconnect, message passing, and cluster management software. These elements must be dealt with, as well as the fact that although clusters themselves are fault-tolerant, their individual nodes are not. If a node goes down, it must be replaced or repaired.

Meanwhile, a cluster should be designed to handle the highly likely event of node failure. This is not a rare situation; disk, power supply, network interface, or other parts of a node are estimated to fail between 30,000 and 60,000 hours of use. When this happens, the cluster has one node less to utilize in carrying out its duties.

Clearly, in a mission-critical business IT environment, the impacts of node failure and administrative overhead become an important design factor. Thinking through the various "what if" scenarios in the design phase can save significant cost over the cluster's lifetime.

Doing the Math: How Many Nodes Are Required?

So, exactly how many nodes does your cluster need? It depends primarily on the application(s) you want to run. In a cluster, an application separates the processing task into smaller pieces, each of which runs on the individual nodes. Some applications impose restrictions on the flexibility of this decomposition. The application might, for example, support only 2n number of processes; the cluster designed to support it would therefore need 2n nodes. In this instance, the failure of a single node would mean that the application could utilize only half of the cluster.

Continuing the example, one solution would be to add one or two spare nodes in advance, although this might impact the interconnect cost per node. (See the sidebar "More Nodes or a Higher-Performance Interconnect?") Additionally, there's a tradeoff between the desired degree of fault tolerance and the cost of the overall cluster. (See the sidebar "Moore's Law and Clusters.")

Pre-processing the input data might also impose similar restrictions. Pre-processing often requires a 64-bit machine, since the complete problem must be parsed by a single process that breaks it into smaller sub-problems, which are in turn processed by more cost-effective 32-bit processors. Pre-processing is often time consuming, and the result is a data set that is tailored for processing by a specific number of nodes. When a node failure happens here, the pre-processed data set can be rendered obsolete and a new data set must be generated. This is another example where the addition of nodes to the cluster configuration can save valuable processing time.

When clusters are designed correctly, they offer tremendous flexibility, scalability, and reliability over the long run. But to ensure that your organization reaps these benefits, a cluster must be carefully planned based on requirements such as reliability, applications, cost, etc.

Cluster Management: Working Easier, Not Harder

As systems administrators (SAs) get squeezed to manage more machines, they must have access to tools that help them work more efficiently. If not, the incremental responsibilities administrators must assume with clusters can quickly push even the best SA to the breaking point. While there may be plenty of smart, low-cost labor in academic environments, that's definitely not the case in today's jam-packed corporate IT environment. There never seem to be enough available resources to handle the workload, much less the increased administrative overhead that clustering entails.

As a result, the widespread adoption of clusters is driving fundamental changes in their management. The knowledgebase required to manage a cluster should not reside with a single SA; rather, administration capabilities must be embedded into the IT organization itself. Today there is a new class of administration tools that remove the complexity of clustering, enabling any SA generalist to be an effective cluster administrator -- and in the process, helping to preserve a sys admin's sanity.

Professionally developed tools can dramatically streamline cluster management by reducing the time required for installation, ongoing management of software, fault isolation, and fault prediction. These are major issues in corporate environments where clusters can easily grow to 1,000 or more nodes, dispersed across multiple geographic locations.

It's true that open source has zero upfront costs compared to professionally developed cluster solutions. However, there are many associated costs -- in terms of time and resources -- that open source carries over the lifetime of the cluster. With open source, sys admins must manually execute many steps (e.g., system installation, looking for and installing updates, compiling different applications to run on various interconnects, etc.), which can quickly add up to an extraordinary amount of administrative overhead. The time required to execute these tasks with professionally developed solutions is a comparative fraction, an essential consideration in today's resource-constrained IT environment.

Building a High-Performance Cluster: Issues and Considerations

To build a cluster that offers high performance and ease of manageability, here are a few straightforward steps for you to follow. As detailed below, the choices you make can have a dramatic effect on the ease or difficulty you will later have in administering the cluster, as well as its performance, flexibility, scalability, and return on investment (ROI). Keep in mind that applications will predicate many of your choices -- you'll need to consider which application(s) you want to run and performance requirements, and even the specific code.

Step 1: Physical Considerations

Before acquiring equipment and software, make sure that your environment will support the cluster you configure, with room to grow as necessary. Investigate cooling requirements, power sources and backup, and make sure you have enough physical space to support the rack units you'll want to install.

Step 2: Choose the Hardware Architecture

You can choose between a number of processor architectures for your cluster, depending on performance requirements and budget. Traditionally, clustering software has required that a single architecture be used in a cluster. This is changing as communication middleware and management systems better accommodate hardware heterogeneity, giving admins the flexibility to design a cluster with a mix of node architectures depending on requirements.

The result is greater design flexibility, scalability, and ability to handle technology obsolescence with ease. For example, you can design a cluster from the start to handle a variety of tasks with optimized performance (such as pre- and post-processing), or you can replace nodes that fail with the latest technology, giving you essentially more power for the same or less cost. These nodes can also come from different providers, so you're not locked into a single vendor. The ability to have a combination of the two architectures in a single cluster lets you join two separate clusters into a single large one for increased processing power.

Another important issue to consider when ordering hardware, is that the vendor may preset the BIOS settings in each server at the factory. Often, the boot order must be changed to reflect network boot. PXE (Pre-eXecution Environment) or EFI (Extensible Firmware Interface) must be enabled and configured. Furthermore, if the CPUs used are Intel® Xeon^TM or newer Pentium® 4s, the BIOS option for enabling or disabling Hyper-Threading (HT) must be set correctly based on application requirements; some applications take advantage of HT, for others it's counter-productive.

Economics might also dictate you to disable HT, which appears to software as if one physical CPU were two separate CPUs. This might require application licenses for twice the number of CPUs -- not a cost-effective option if the application can't take advantage of HT, as it seldom delivers a performance boost of more than about 30 percent.

Step 3: Choose the Interconnect

The interconnect determines how the nodes will be connected together to function as a cluster. There are two main types of interconnects: legacy interconnects such as Gigabit Ethernet, or more exotic ones such as Myrinet. Gigabit Ethernet is less expensive, but Myrinet offers higher performance, lower latency, and better scalability.

Certain middleware applications enable you to use a combination of legacy and exotic interconnects in your cluster. Companies and institutions often run multiple applications on a single cluster, making your choice of interconnects a complex one, without a single "right" answer. One of the newest ways to streamline this decision point -- instead of making tactical decisions about which interconnect to use on a cluster-by-cluster, application-by-application basis -- is to strategically choose a message passing interface that runs on all interconnects.

Step 4: Choose the Operating System

You'll need to decide which OS you want to run on the cluster. A growing number of organizations are choosing variations of Linux that are professionally supported for their high performance clusters. Unsupported Linux is also an option, although there are administration and support issues to carefully consider when choosing open source, as previously noted.

Step 5: Choose the Cluster Communications Software

Here's where things start to get a more complicated. The message passing middleware is a layer that encapsulates complexity of the underlying communication mechanism and shields the application from different methods of basic communication. Today, the Message Passing Interface (MPI) dominates the market and is the standard for message passing. Although MPI is common for most parallel applications, developers are faced with a big challenge; virtually every brand of interconnect requires a particular implementation of the MPI standard. Furthermore, most applications are statically linked to the MPI library. This raises three issues. First, if you want to run two or more applications on your cluster, and some of them are linked with different versions of the MPI implementation, then a conflict might occur. This inconsistency is solved by having one of the application vendors re-link, test, and qualify their application for the other MPI version, which may take a significant amount of time.

Second, evolving demands from applications or errors detected and corrected in the MPI implementation can force one of the applications to use a newer version. In this case, you end up with the previous mentioned inconsistency.

The third issue to watch for is upgrading the interconnect to a different kind, or evaluating the possibilities of doing so. Let's say you have decided to use Gigabit Ethernet as the interconnect, but find out that the TCP/IP stack imposes overhead that restricts the scalability of the application(s). To switch to an MPI able to take advantage of more efficient and lean protocols, such as Remote Direct Memory Access (RDMA), you again ask the application vendors for help with your upgrade or evaluation. Technically, this hurdle can prevent you from major improvements you should gain from newer more innovative communications software, or the general interconnect hardware evolution.

Another approach -- one that avoids the issue -- is dynamic binding between the application and the MPI middleware, and between the MPI middleware and device drivers for various types of interconnects. This way, the MPI implementation and application can evolve independently, because the application can exploit the benefits from different interconnects or protocols without changing or re-linking the applications. (See Figure 1.)

Step 6: Choose the System Management Software

Management software is the final layer in the cluster stack. It can be open source or delivered from a professional provider, the latter of which can be vendor-specific (i.e., software for managing homogeneous clusters using a specific vendor's hardware) or independent.

While vendor-specific management systems are often tuned for specific platforms, they do not offer the flexibility to be used in heterogeneous environments, as independent third-party offerings do. If you choose vendor-specific management software for a cluster that is initially homogeneous but later becomes heterogeneous, you will have to go through the time and expense of switching to a new, independent system management solution. Choosing an independent management system from the start can provide the flexibility and functionality you need, now and in the future.

Ideally, the management software should handle the OS, MPI, and third-party software. A professionally developed management system can handle this automatically, whereas if you are using open source tools, each of the components will have to be installed manually when you initialize the cluster, and if a node goes down must be re-initialized.

Conclusion

For systems administrators, building a cluster from the ground up can be an exciting professional opportunity. But the impact of the choices made -- particularly for software components like message passing interfaces and management systems -- is deep and long lasting. In the past, open source solutions have been sufficient for organizations with a limited number of clusters and abundant amounts of low-cost labor. But for today's busy SAs in corporate environments, professionally developed solutions remove complexity, and save time, effort and ultimately money -- making systems administrators more productive while helping them stay sane.

Håkon Bugge is vice president of product development at Scali, a provider of professionally developed clustering software solutions. He can be reached at: [email protected].