SolarisTM
Failover -- Keeping Connected at All Times
Brian Gollsneider and Arthur Messenger
Success in today's high-tech world demands high-availability
systems. Five or six 9s availability requires high-end hardware,
stable operating systems, and a stable connection to the network.
In SolarisTM 8, Sun introduced IP Network Multipathing. This
capability allows administrators to create a hot standby for a network
interface card (NIC) or to configure several active NICs on a machine
in a multipath group to back up each other. The hot standby can
take over for a failed primary card in as little as 100 ms. In this
article, we present how to configure a system for failover, then
describe the network impact of multipath groups, how resilient normal
network applications are to timeouts, and the system logging and
notification of the appropriate events. We assume a working Solaris
8 system that is on a network and a second available network card.
Ethereal was used to monitor and record the network activity.
Background
Network failover is the ability to recover from a network problem
on one network path and switch to another. The failure can be the
network card itself dying, the network cable being cut or disconnected,
or some other equivalent event. Note that we forced network failures
by physically disconnecting the network cable at the appropriate
time. Sun's IP Network Multipathing has three main parts: failure
detection, repair detection, and outbound load spreading.
Failure detection is sensing when a link is no longer good. On
the other hand, repair detection is determining when the link is
good again. Outbound load spreading is dividing the network traffic
leaving the system between the network interfaces. Earlier releases
of Solaris supported multiple network cards but did not have failover.
The common approach to simulate failover previously was to write
a script that continually pinged a host. If an answer was received,
then no action was taken because the network connection was good.
If the ping failed, then that interface was brought down and a backup
interface was configured and activated. Although this approach worked,
it had limitations and was not very elegant.
With Solaris 8, administrators get the ability to configure network
failover in several ways. The two primary ways are standby and active.
Standby is where the primary network card is used until a failure
and then the system switches to the standby card; active is where
both cards are active until one fails, at which time all traffic
is sent through the remaining card.
Details of IP Network Multipathing
IP Network Multipathing uses a daemon, /sbin/in.mpathd, to watch
over a group of NICs. A private address used only by in.mpathd is
established on each NIC. The in.mpathd daemon issues echo requests,
a ping, to a node on the IP link1. Note that the node is the default
router if there is one. If there is no router, the node is determined
by sending a multicast packet to the "all hosts" multicast
address, 224.0.0.1. The first few hosts to reply become the node.
In our small test network with nine other hosts, in.mpathd starting
echo requesting five of the nine responding hosts in a random fashion.
If there are five consecutive echo request failures on a NIC in
the group that in.mpathd is watching, failure has been detected
and the link is declared not to be functioning.
The NIC in the group with the least number of logical interfaces
has a logical interface created on it for the failed NIC's
IP address by in.mpathd. IP will then start using this new NIC.
The in.mpathd daemon continues to send echo requests on the failed
NIC while it has been declared non-functioning. When it has 10 consecutive
echo request successes on the failed NIC's private address
(i.e., it has detected the repair of the link), in.mpathd re-establishes
the IP address on the NIC and removes the logical interface on the
new NIC.
IP is now using the original NIC. What we have called a private
address is really a deprecated IP address -- an address that
IP will not use unless explicitly told to. These private addresses
must be visible to the IP link. This usually implies that the private
IP address has the same network address as the echo request responding
node. At the least, the echo request responder must have echo response
turned on in the IP stack or in.mpathd will have no way of seeing
whether the NIC is down.
If there is no router on the network, then the echo request responder
must at least have address 224.0.0.1, "all hosts" multicast
address active. The file /etc/default/mpathd is created during installation
and controls several aspects of in.mpathd's behavior, the most
important of which are detection time for a failure and whether
failback is allowed. Listing 1 shows /etc/default/mpathd with the
default comments removed.
FAILURE_DETECTION_TIME is set to 10000 milliseconds or 10 seconds
by default. This can be dropped as low as 100 ms for time-critical
applications connectivity. This is the time to determine a NIC failure,
which is defined as five consecutive echo request failures. The
system therefore divides FAILURE_DETECTION_TIME into five approximately
equal time segments to do the pings. Of course, smaller values for
FAILURE_DETECTION_TIME place a higher load on the network. FAILBACK=yes
tells in.mpathd to go back to the original NIC if it determines
that it has been repaired. We did not work with TRACK_INTERFACES_ONLY_WITH_GROUPS.
TRACK_INTERFACES_ONLY_WITH_GROUPS=yes is the default. If this
is no, in.mpathd will report on all failed NICs on the node even
if they are not in the multipath group. The network events discussed
above get logged in /var/adm/messages so the system activity tool
of your choice (i.e., swatch) can be set up to notify you as necessary.
See the References for more details than provided in this article.
Configuring a Hot Standby NIC (Command Line)
These are the steps at the command line to set up a NIC as a hot
standby to take over network if the primary card failed. Listing
2 shows the commands and standard out for the configuration.
Step 1: Check the state of the current network interface. The
first ifconfig -a (command [1]) in Listing 2 shows that interface
iprb0 is up using 10.1.1.1 as part of the 10.1.1.0 network, so we
have verified that the system is in a normal network state.
Step 2: Configure the primary card; this is shown by command [2]
in Listing 2. We chose the unique address on the network of 10.1.1.200
to use as the private address. Since this was our first command
with group SERVER1, this created the IP multipathing group, named
it SERVER1 and added the NIC associated with iprb0 to it. It also
started the in.mpathd daemon. The addif 10.1.1.200 netmask 255.255.255.0
broadcast 10.1.1.255 added a logical interface, ipbr0:1, to the
NIC. The -failover marks the 10.1.1.200 as a non-failover address.
That is, in.mpathd will not make a logical interface for it on another
NIC if this NIC should fail. The option deprecated marks
10.1.1.200 as not being available as a source address for outbound
packets unless explicitly asked for (bound). Finally, up
enables the logical interface just created. The second ifconfig
-a (command [3]) shows the successful completion of the command
with the new interface iprb0:1. Notice that the logical interface
iprb0:1 is DEPRECATED and NOFAILOVER.
Step 3: Configure the hot standby card; this is shown by command
[4] in Listing 2. We chose 10.1.1.201 for this address. The plumb
option sets up the connections between the device driver and the
NIC, 10.1.1.201 netmask 255.255.255.0 broadcast 101.1.255 sets up
the IP address for this NIC. The address is to be deprecated, a
member of the group SERVER1, and not failover if the NIC fails.
The standby option marks this NIC as a hot backup for a failed
NIC in the group SERVER1. Finally, the up option at the end
enables the interface. The final ifconfig -a (command [5]),
shows iprb1 successfully configured to be a hot standby. Later,
we will describe some testing to determine how failover performs.
Configuring a Hot Standby NIC (Startup Scripts)
Next, we will quickly show the equivalent syntax to preserve failover
across reboots. The idea is the same but the implementation syntax
varies to a good extent. Initializing a NIC is a relatively complex
activity controlled by the startup script /etc/init.d/network. This
script uses the file /etc/hostname.NICdriver to determine whether
a NIC is to be initialized. Normally, this contains only the hostname
associated with the NIC. This file becomes greatly expanded if you
use multipathing failover. Listing 3 shows first the contents of
/etc/hostname.iprb0, the primary NIC card, and then /etc/hostname.iprb1
-- the hot standby. Again, we don't need to manually start
in.mpathd. It gets started by using group in the configuration files.
One caution on the importance of proper syntax -- on the 10.1.1.11
address, we left out the "up" the first time. This produced
an error that DNS would not work, reporting an error about not being
about to find the hostname of the DNS server.
IP Network Multipathing Testing
By following the above steps, we have configured a primary network
card and a hot standby. For initial testing, we changed the FAILURE_DETECTION_TIME
to the minimum 100 ms in /etc/default/mpathd. Listing 4 shows an
extract of the system event logging in /var/adm/messages as we put
traffic on the network and forced network failures and repairs by
pulling network cables. Note that as the system becomes loaded,
it cannot keep a failure detection time of 100 ms and reports what
it actually can do. Also, note the various messages about NIC failures,
successful failover to iprb1, and repair detection and failback
to iprb0. Our conclusion is that failover works as advertised although
very short failure detection times may not be supportable under
load.
The other part of our testing focused on the connection impact
of various FAILURE_DETECTION_TIMEs. Small values of the parameter
add greater network traffic in the form of more heartbeat pings
and might not even be supportable because of system load. We therefore
put the parameter back to its default 10000-ms value and checked
the normal UNIX connection utilities (telnet, ftp, ssh) for the
impact of failover. We found that no utility would time out with
a value of 10 seconds. We were able to repeatedly pull cables, going
back and forth between the cards, and never encountered a failure.
We transferred a 100-MB file with ftp through four failovers back
and forth without any problems. We conclude that the default 10-second
value for FAILURE_DETECTION_TIME is adequate for many applications
but that each application will need to be tested.
Configuring Multiple NICs (Startup Scripts)
Configuring the system for two active cards is very similar to
the standby setup. Listing 5 shows the /etc/hostname.iprb1 setup.
It is now a clone of /etc/hostname.iprb1 with different IP values.
The /etc/hostname.iprb0 file has no changes. With this setup, the
system can failover and recover in either direction, and it has
the advantage of greater throughput because of the second active
NIC.
Conclusions
We found that Sun's IP Network Multipathing facility provides
a hot standby for NIC card and network hardware failures. It can
be configured as a primary NIC with a hot standby or as two primary
NICs failing over to the other as necessary. This capability should
be useful to many people, especially regarding systems with very
high-availability requirements. It is very easy to configure your
system to failover as required. The default timeout value of 10
seconds is adequate for many applications, but the required timeout
value for each application will have to be determined. Very small
timeout values may not be supportable by a loaded system. System
events are logged in /var/adm/messages.
References
IP Networking Multipathing Administration Guide, Part 806-7931-10,
Sun Microsystems, Inc., Palo Alto, CA 94303-4900, April 2001.
Man page -- man in.mpathd
1. An IP link is a communication facility or medium over which
nodes can communicate at the link layer. This is the subnetwork
access layer of the TCP/IP network model or layer 1 and 2 of the
OSI model. Think LAN, or switch, or 10Base5 (thicknet) cable.
Brian Gollsneider is working on a PhD in Electrical Engineering
from the University of Maryland. When not buried in research, he
is a UNIX instructor for Learning Tree International. He can be
reached at: gollsneb@glue.umd.edu.
Arthur M. Messenger is a retired UNIX systems administrator
who occasionally answers questions for friends and works part time
for Learning Tree International. When not teaching, he lives with
his wife in Haymarket, Virginia where they spend time with their
grandchildren. He can be reached at: Arthur.Messenger@att.net.
|