Integrating Linux Clusters into the Grid
Ian Lumb and Chris Smith
Linux clustering is pervasive. Next to the attractive price/performance
of COTS components, smart system software plays a key role in this
pervasiveness. In the context of clustering, it is smart system
software that allows a number of distinct systems to appear as one
-- even though each runs its own instance of the Linux operating
system. Figure 1 illustrates the possibilities. At one extreme,
the single-system environment (SSE) is smart system software that
runs in user space as a layered service. Often referred to as middleware,
there exist a number of open source and commercial implementations
of SSE solutions. At the other end, the single-system image (SSI)
is smart system software that spreads operating-system functions
across systems and involves modification of the Linux kernel. Such
tightly coupled integrations permit global process spaces (i.e.,
PIDs that span separate instances of the Linux operating system,
such as Beowulf BPROC), use of algorithms for preemptive process
migration (e.g., the MOSIX management package), etc. Though presented
as extremes, examples of SSI-SSE integration do exist. To varying
degrees, these solutions enable computing for capacity (i.e., throughput
of serial, parametric, and embarrassingly parallel applications)
and/or capability (i.e., multithreaded and distributed-memory parallel
applications). However, our purpose in this article is not to discuss
Linux clustering in detail. Rather, it is make a simple observation:
Use of smart system software allows distinct instances of the Linux
operating system to be virtualized as a cluster; a natural extension
allows clusters to be virtualized into grids.
In the next section, we define "grid computing" and
follow this definition with examples of enterprise and partner grids.
After providing an overview of the exciting convergence between
the grid and Web services in the Open Grid Services Architecture
(OGSA), we close with a summary and some recommendations plus resources
for further investigation.
Grid Computing
Much like the Web, grid computing originated in the research community
to facilitate collaboration for "Big Science", such as
sharing terascale volumes of data from high-energy physics (HEP)
experiments between hundreds of globally distributed scientists,
aggregating hundreds of CPUs to perform "grand challenge"
computations. Because the private sector shares a common interest
in high performance computing (HPC), grid computing is seeing early
adoption in the commercial sector as well. (See Resources section
for additional background and examples of grid computing.)
With adoption in its earliest phases, awareness of grid computing
is evident, but an understanding is often unclear. To fix ideas,
we adopt a three-point grid checklist (see Resources): "...
a Grid is a system that (1) coordinates resources that are not subject
to centralized control using (2) standard, open, general-purpose
protocols and interfaces to (3) deliver nontrivial qualities of
service". Each of these technical points requires elaboration:
1. Coordinates resources that are not subject to centralized control.
Because geographic distribution of people and resources is common,
coordination between multiple departments within a single organization
or multiple organizations may be necessary. In some cases, cooperation
among organizations exists only for a finite period of time, so
the term "virtual organizations" is often used. Trust
relationships and connectivity are examples of concerns between
cooperating parties.
2. Uses standard, open, general-purpose protocols and interfaces.
An increasingly real vision at the present time, the emerging Open
Grid Services Architecture (OGSA) holds the promise for refactoring
existing and developing new technology around open standards.
3. Delivers nontrivial qualities of service (QoS). These qualities
of service are often combined into Service Level Agreements (SLAs)
or policies. In the grid computing context, QoS translates business
objectives into objectives for the IT infrastructure, thus enabling
effective utilization, resource aggregation, and remote access to
specialized resources.
There are two noteworthy consequences of this three-point grid
checklist. First, Linux clusters are not grids. Even though they
may have grid-like attributes, Linux clusters fail to satisfy the
checklist's first point (i.e., they are centrally controlled
but not distributed geographically). Second, grid computing represents
the next phase in the evolution of distributed computing. In the
next two sections, we illustrate this evolution in terms of enterprise
and partner grids.
Enterprise Grids
In this section and the next, we assume that SSE smart system
software (e.g., Platform LSF) has virtualized a number of distinct
systems into an HPC cluster -- as before, each system is running
its own instance of the Linux operating system. Two or more clusters
can be transformed into an enterprise grid with Platform MultiCluster.
An overview of the transformation process follows:
- Identify the clusters involved.
- Agree upon the ports to be used by the service's daemons
for communication.
- Install the enabling components for Platform MultiCluster on
each cluster based on Platform LSF.
- Agree upon and implement consistent definitions for resources
(e.g., host types and models, shared resources, etc.).
- Agree upon and implement use models (i.e., job forwarding and/or
resource leasing) plus queue configurations.
- Agree upon and implement user account mapping as necessary.
Platform MultiCluster facilitates collaboration at the enterprise
level without loss of local autonomy; to implement such a solution,
inter-departmental discussions are required to complement the technical
efforts. Customers use this enabling technology in their production
deployments across a variety of industries (e.g., semiconductor
design, industrial manufacturing, government and education, bioinformatics,
computational chemistry, petroleum exploration, financial services,
etc.). In practice, this combination supports a variety of submission-execution
topologies (Figure 2). With the exception of the cluster case (Single
Submission, Single Execution), real-world implementations of these
topologies exist. From this topological consideration, it is clear
that enterprise grids are a cluster of clusters.
Platform MultiCluster fits the three-point grid checklist with
the following qualifications on the first two points:
1. A single organization is involved -- or multiple organizations
operating as one. This simplification means that the organizational
firewall can serve as the primary means of security.
2. Because this technology predates the still emerging open standards,
proprietary protocols and interfaces remain in use today.
Non-trivial QoS, through enterprise-wide scheduling, is enabled
by Platform MultiCluster.
Partner Grids
Again, we assume that SSE smart system software has virtualized
a number of distinct systems into an HPC cluster -- as before,
each system is running its own instance of the Linux operating system.
Additionally, multiple (virtual) organizations cause the transition
from enterprise to partner grids. In the partner grid case, the
enterprise firewall is meaningless, and the need for resource discovery
arises. The Globus Toolkit addresses these extra-enterprise tensions
in security and discovery. Figure 3 provides a functional overview
of the toolkit:
- Cluster Level -- As in the case of enterprise grids, SSE
smart system software virtualizes distinct systems into an HPC
cluster. Besides Platform LSF, other choices include Altair Open
PBS or PBS Pro, Sun Grid Engine or the University of Wisconsin's
CONDOR. The cluster-level workload manager is separate from the
toolkit.
- Grid Level -- The interface to the cluster-level is a Globus
component called GRAM (Grid Resource and Allocation Management).
GRAM permits an identified system to serve as the gatekeeper for
the grid, and through its job manager, acts as a universal adapter
for one or more cluster-level workload managers. Information discovery
(Grid Resource Information Service, GRIS) and information indexing
(Grid Index Information Service, GIIS) together comprise the Monitoring
and Discovery Service (MDS). Essential file transfer capability
is available via the toolkit's GridFTP component; a replica
management service based around GridFTP is also available. Shown
conceptually in this level, rich API (Application Programming
Interface) support is addressable directly from the cluster or
access layers.
- Access Level -- A client-side command-line interface is
included with the toolkit. This interface allows grid users to
describe (via a Resource Specification Language, RSL) jobs on
submission, plus monitor and control jobs. Although the toolkit
does not include a GUI, generic and community-centric portals
exist.
- Security -- All of the above is consistent with the Grid
Security Infrastructure (GSI). GSI uses a Public Key Infrastructure
(PKI) approach in which all grid resources (i.e., users, systems
and services) have their own private and public keys. X.509 certificates,
the Secure Sockets Layer (SSL, now referred to as Transport Level
Security, TLS), Certificate Authority (CA) and Generic Security
Service (GSS-API), form the core of GSI. Grid-motivated extensions
include single-sign-on and delegation capabilities. The GSI implementation
in the Globus Toolkit is standards-compliant. We will have more
to say about grid standards in the next section.
Figure 4 illustrates how the components of the Globus Toolkit
might be deployed both with and without a cluster-level workload
manager. Again, the natural affinity for grid use is evident in
environments that have already been virtualized through SSE smart
system software.
The Globus Project (a research consortium lead by the Argonne
National Laboratory (Chicago, IL) and the Information Sciences Institute
of the University of Southern California (Marina del Rey, CA)) makes
the toolkit available via a liberal open source license called the
Globus Toolkit Public License (GTPL). The GTPL permits the following
distributions of the toolkit:
- Globus Project -- Source code for the vanilla distribution
of the toolkit is available directly from the Globus Project.
Because the software engineers at the Globus Project use Linux
as their primary development platform, pre-built Linux distributions
are always available.
- System Vendors -- Many major system vendor offers a bundled
version of the Globus Toolkit. Because each of these versions
targets a specific platform, and this may involve source-code
modifications (e.g., due to porting, optimization, etc.), a temporary
degradation in overall interoperability between versions of the
toolkit is possible. This reduced interoperability is temporary
as the system vendors contribute their source code modifications
back to the Globus Project for inclusion in a subsequent release
of the toolkit. System vendors' Linux offerings tend not
to suffer from this complication.
- Independent Software Vendors (ISVs) -- Platform Globus
is a commercially supported version of the Globus Toolkit. Platform
adds value through enhancements -- improved packaging and
installation, multi-platform support, improved interoperability
with Platform LSF, etc. -- technical support, documentation,
and the availability of professional services for grid planning,
deployment, and ongoing management.
- Grid Starter Kits -- Through various initiatives, a number
of grid starter kits have become available. These kits tend to
target specific communities, projects, or grid competency in general.
Based around pre-built distributions of the toolkit, these starter
kits may include a portal, cluster-level workload manager, along
with other utilities.
Although the specifics do vary from distribution to distribution,
a generic overview of the installation and configuration process
will include:
- Pre-installation planning, such as acquiring the distribution
and certificates, ensuring availability of various utilities (e.g.,
Perl is required by some of the packaging tools), time synchronization,
etc.
- Building (if needed) and installing various bundles of the
toolkit
- Enabling GSI -- managing certificates for users, the gatekeeper,
and the directory service (MDS)
- Setting up the job manager for the appropriate cluster-level
workload manager
- Establishing user access control
Once installed, correct operation of each component can be determined.
The Globus Project provides certificates through a publicly accessible
Certificate Authority (CA). This means that those who are keen to
experiment with the toolkit do not need to set up their own CA at
the outset.
To this point, we have referenced version 2.x of the Globus Toolkit;
as of this writing, version 2.4 is the current production release.
Version 2.x is used in a number of grid projects (see Resources).
Overall, the toolkit complies well with the three-point grid checklist.
However, against the final point, regarding non-trivial QoS, there
is much opportunity for improvement.
The Open Grid Services Architecture
Each of the components in version 2.x of the Globus Toolkit is
based directly on a protocol (Figure 5). Although this was a pragmatic
and understandable decision at the outset, these underpinnings started
to increasingly limit the ability to develop on top of the toolkit.
Starting in late 2001, and based on these concerns of modularity
and extensibility, the Globus Project sought to refactor the toolkit.
Around the same time, IBM Research was investigating autonomic computing
-- computer systems that regulate themselves much in the same
way our autonomic nervous system regulates and protects our bodies.
The cross-fertilization between the Globus Project and IBM Research
lead to the OGSA.
OGSA is the consequence of the convergence between grid computing
and Web services. Based on experience with the Globus Toolkit, a
number of functional components (e.g., resource and allocation management,
data management, directory services) and common services (e.g.,
security) are identifiable. From Web services, the ability to leverage
SOAP (Simple Object Access Protocol), WSDL (Web Services Description
Language), and other capabilities is clearly appealing.
This fusion has already resulted in a significant outcome --
an emerging standard and implementation of the Open Grid Services
Infrastructure (OGSI, Figure 6). Built on top of Web services (particularly
WSDL), the Grid Service Specification is an enhancement that takes
grid computing into account through additions like persistence,
lifetime management, etc. The current version of this specification
is passing through the standards-approval process of the Global
Grid Forum (GGF). The GGF is a community-initiated forum that serves
as the authoritative body in the standards process. The Grid Service
Specification is a deliverable of the OGSI Working Group within
the GGF.
Originally released in mid-January of 2003, version 3 of the Globus
Toolkit (GT3) includes an implementation of the OGSI. As of this
writing, version 3 of the toolkit is in a beta release, and is expected
to enter production status later this year. The OGSI implementation
in GT3 is typically used in a Java 2 Enterprise Edition (J2EE) hosting
environment, though other hosting environments (e.g., Microsoft
.NET) are emerging. The hosting environment allows all resources
(the lowest layer in Figure 6), including Linux clusters, to be
virtualized for grid use.
From the bottom up, these first three layers (Figure 6) are tangible.
Current efforts in GGF Working Groups seek to co-evolve standards
and implementations for core OGSA services and policies. The final
three layers (at the top of Figure 6) place user interfaces, applications
and an application-enabling API in this OGSA context; these layers
are also evolving. Although the transition to OGSA will be evolutionary,
this is a revolutionary change to a service-oriented architecture.
Summary
Linux clusters are predisposed towards the grid. The established
practice is to virtualize these clusters through SSE smart system
software. Enterprise and partner grids provide examples of early
adoption based around established products (Platform MultiCluster)
and toolkits (Globus Toolkit, version 2), respectively. We have
also introduced the exciting convergence of grid computing and Web
services. This approach provides a modular and extensible foundation
upon which grid standards and implementations are starting to co-evolve.
Understandably, this is a time of significant change. Existing
technologies are being refactored under OGSA, and new technologies
are starting to emerge. Getting involved will depend on a number
of factors, such as exploratory investigation versus production
implementation, testbed versus enterprise versus cross-organizational
deployment, readiness for a service-oriented approach, etc. Again,
the experiences of those familiar with Linux clustering are transferable
to the grid. However, out-of-the-cluster thinking is needed to identify
and address the requirements on a broader scale of collaboration.
Acknowledgements
The authors acknowledge Carla Lotito of Platform for providing
Figure 4.
Resources
Autonomic Computing -- http://www.research.ibm.com/autonomic
Commodity Grid Kits -- http://www-unix.globus.org/cog
Foster, I., "What is The Grid? A Three Point Checklist",
GRIDtoday, 1(6), July 22, 2002. Available online at http://www.gridtoday.com/02/0722/100136.html.
Foster, I., "The Grid: Computing without Bounds", Scientific
American, April 2003.
The Globus Project -- http://www.globus.org
The Global Grid Forum (GGF) -- http://www.ggf.org
Grid Portal Toolkit -- https://gridport.npaci.edu/
MOSIX -- http://www.mosix.org
Platform Computing -- http://www.platform.com
The Open Grid Services Architecture (OGSA) -- http://www.globus.org/ogsa
Ian Lumb has been with Platform Computing Inc. for 5 years, starting
in training, and then business development, before taking his current
role as a Systems Engineer focused on Grid Computing solutions for
Government, Education and the Life Sciences. He has an M.Sc. in Earth
and Atmospheric Science from York University, and his interests include
High Performance Computing (HPC) for scientific insight.
Chris Smith has been with Platform Computing Inc. for 6 years,
starting in the development organization, moving to his current
role of Integration Architect focused on Grid Computing solutions
in Life Science and Government. He has a B.Sc. in Computer Science
from the University of British Columbia, and his interests include
distributed computing, parallel programming, operating systems and
communication protocols.
|