Quick
Network Redundancy Schemes
Leo Liberti
Simple bash scripting and IP aliasing can be used to implement
quick and easy host redundancy schemes based either on host availability
or service availability. In this article, I describe a very simple
way to implement such a scheme. The idea sprang from the need to
supply a simple redundancy scheme to an HTTP service we have at
IrisTech, where I work. The programming involved is limited to bash
scripting and the availability of standard UNIX network utilities.
This approach makes use of a feature of the Linux kernel, IP aliasing,
which provides a neat way to assign more than one IP address to
one network card; however, Linux is by no means the only operating
system with this property. And even on operating systems without
IP aliasing features, there are ways to circumvent this problem
(e.g., installing two network cards).
Redundancy is useful when you have a computer running a crucial
service. To lessen the probability of the service becoming unavailable,
one (or possibly more than one) other computer monitors the activity
of the service on the main computer. If the service becomes unavailable,
there can be one of two reasons:
1. The computer has crashed (or has lost the network connections);
2. The computer is alive but the process running the service has
died.
The monitoring computer must take different actions according
to the nature of the problem.
Initial Scenario
To put this idea into practice, I will first describe the network
setup needed to implement it. Host A (IP address 192.168.1.1 on
device eth0) supplies the important service. Host B (IP address
192.168.1.2 on device eth0) monitors A and takes action when
the important service becomes unavailable. To make things more precise
and readily applicable, A runs the HTTP service with an Apache Web
server running on port 80, and both A and B run Linux. The version
is largely unimportant, as long as it is later than 2.0.1. I advise
using the latest stable kernel (as of this writing, 2.2.17). One
last requirement is that the two hosts must be on the same IP network.
With tremendous effort and deviant hacks, this requirement can be
circumvented, but for this discussion I assume that A and B are
on the same network.
Taking Over a Crashed Server
The first step is to make B capable of taking over the service
provided by A.This means, first, that Apache must be installed on
B (even though it may not be running while A is functioning properly).
Second, that B must mirror all data files necessary to the HTTP
service. This can be easily implemented with a cron job running
rsync, rdist, or other syncronization software. At
IrisTech, we use cron on host B to periodically run a bash script
which:
- Mounts the data directory from A to B via NFS,
- Uses rsync to copy the data from A to B,
- Unmounts the NFS export.
Details
Add a line like the following:
00 4 * * * root /usr/local/sbin/mirror
in your /etc/crontab. This means run the mirror
script at 4 am each morning. This is the mirror script:
#!/bin/sh
mount A:/data -t nfs /mnt/A
sleep 25
rsync -a --delete /mnt/A/ /home/data-from-A
umount /mnt/A
Of course, the details of this script will vary with your network
setup.
IP Aliasing
Now, host B must be made capable of assigning more than one IP
address to a single network interface. Under Linux, this can be
obtained by rebuilding a kernel with the IP_alias option.
Put the following line:
CONFIG_IP_ALIAS=y
in the file /usr/src/linux/.config, then run:
make dep; make clean; make bzImage; make modules; make
modules_install
cp /usr/src/linux/arch/i386/boot/bzImage /boot/vmlinuz-ipalias
Insert the relevant lines in /etc/lilo.conf, and run lilo;
if it produces no errors, reboot. On machines or operating systems
without this feature, a second network card can be installed on
host B. At this point, B is ready to monitor A and take action should
A fail. Recall that B should be able to tackle two problems: a)
A crashes or loses the network connection, and b) A loses the service.
I will first show how to monitor and take action for the first problem
(a).
Monitoring the Network Connection
Monitoring of A can be obtained by using ping. We embed
this command in a bash function pinghost:
function pinghost() {
ping -c 1 $1 > /dev/null 2>&1
}
The option -c 1 means ''send just one
ICMP probe packet''; the default behavior would be to
keep sending packets indefinitely towards the target host (in this
case, the placeholder $1 is used as a target host for ping).
Since we are only interested in the return value of pinghost(),
we send standard output and standard error to /dev/null.
The main bash script runs a loop that calls pinghost() continuously
and exits the loop when the target host stops responding to pings:
while pinghost A
do
sleep 10
done
The instruction sleep 10 in the main loop body ensures
that ping packets are not sent too frequently. In this case, they
are sent every 10 seconds. This parameter can be varied as needed.
Substituting the Host
In short, pinghost() is used as a functionality test inside
a loop. If the test keeps succeeding, we wait 10 seconds, then go
back to the start of the loop and repeat the test. If the test fails,
we exit the loop and take action: B takes over A.
apachectl graceful
ifconfig eth0:0 192.168.31.1 up
Make sure Apache is running on host B, then B is configured to
respond to two IP addresses (thanks to the IP_alias kernel
feature) on the same network card: its own (on eth0) and
the IP address previously assigned to A (on eth0:0).
The Complete Bash Script
The bash script shown in Listing 1 is an embellishment of the
principle expounded above with some user-definable parameters and
some logging facilities. This bash script must be installed as /usr/local/sbin/monitor;
it depends on directories /usr/local/lib/monitor/log and
/usr/local/lib/monitor/bin. The ''take action''
part of the scheme is included in a separate executable bash script,
A-noping, to be placed in /usr/local/lib/monitor/bin/,
where A is specified as the first argument on the command line of
monitor. The second argument specifies whether monitor
must continue monitoring after taking action (yes) or not
(no). It is also possible to have monitor simply record
the event of the crash of A without actually taking any action by
making the ''take action'' script /usr/local/lib/monitor/bin/A-noping
not executable. (Listings for this article are available from the
Sys Admin Web site: http://www.sysadminmag.com.)
A typical monitor invocation would be:
backup.mydomain.com:~# /usr/local/sbin/monitor www.mydomain.com \
yes
Taking Over a Crashed Service
I will now show how to tackle service problems when A does not
lose its network connections. There are two possible approaches:
1. B tells A to respawn the Apache process.
2. B tells A to bring down its network interface and then B takes
the place of A as in the previous case.
The best results are often obtained by using a combination of
these two approaches. The ideal situation would be as follows: B
checks whether the Apache process is present on A. If it is not,
it tries to respawn apache on A. If it is present, the problem is
not easy to solve -- we tell A to lose its network connection
and take over its IP address with the IP_alias technique
described above.
Executing Commands Remotely
Security Issues -- All of these approaches make use
of the concept ''B tells A to do something''.
Furthermore, everything B tells A to do requires root privileges
on A. To keep this task reasonably simple, the root password of
host A must be stored on host B in plain text. This is a very bad
idea in regard to security. I recommend that you only implement
this scheme if:
- Your network is behind a very strict firewall;
- User access to hosts A, B and the network they are on is limited
and easily controllable;
- The data on host A is not terribly important;
- The root password on host A is different by all other root
passwords on the network.
If any of these requirements are not met, I suggest you abandon
this ''poor man's redundancy scheme'' and
go for something more robust.
Assuming all of the above requirements are met (or you find a
way to ensure root privileges to B on A without storing the root
password in plain text), we can go on to modify the monitor
script to take care of the new problem.
Using rexec -- I will explain how to make B give root
commands to A in a totally automated way. To this purpose, I will
make use of the rexec service.
1. Make sure the rexec daemon is installed in the inetd
services (check /etc/inetd.conf).
2. Make a file .netrc in the home directory of the root
user (usually /root) on host B with the following lines:
machine A
login root
password root_password_on_A
and give it permissions 400.
3. Test this setup by running an ls command remotely:
root@B:~# rexec A ls
The weak link in the security chain is the .netrc file
where the root password of host A is stored in plain text. Keep
it as safeguarded as possible.
Monitoring a Service
I will now explain how host B can monitor the HTTP service on
host A. To monitor any service, one simply connects to that service
and transfers some data. If all goes well, one assumes the service
is up and running. To connect to a service manually, the normal
telnet client may typically be used. Unfortunately, telnet
cannot be automated, because it cannot accept commands from standard
input. The alternative is netcat (http://www3.l0pht.com/ weld/netcat/).
Along with the freely available source code, you can find netcat
precompiled and packaged for most Linux distributions. After netcat
is compiled and installed, the binary executable nc should
be placed in a directory in your PATH (usually /usr/local/bin).
Thus, we have to substitute the monitoring function pinghost()
of the script above with the following:
function pingservice() {
echo "GET /" | nc $1 80 | grep -qs "<HTML>"
}
This means connect to port 80 (HTTP) of the host specified in
the function argument and execute the command GET /; then
scan the output for the string ''<HTML>''.
If it is found, return 0 (success), otherwise return non-zero (failure).
The options to grep mean ''suppress all visible
output''.
Service monitoring will take place as in the previous case via
a main loop:
while pingservice A ; do
sleep 10
done
Taking Action
If the monitoring test fails, the main loop is exited and some
action must be taken. As mentioned previously, the strategy to be
implemented is the following:
- Host B checks the process table on host A and verifies whether
the Apache process is present.
- If Apache is not running, respawn the Apache process and try
the service monitoring test again. If it succeeds, go back to
the main loop; otherwise, continue with next step.
- The problem is not easy to solve -- Host B tells host A
to lose its network connection and then takes over its IP address,
effectively ''becoming'' host A.
Checking Remote Processes
Checking the process table for Apache is easy:
rexec A ps -e | grep -v grep | grep -qs httpd
This command means execute ps -e on host A, then scan the output
for the string ''httpd''. If it is found, return
0 (success), otherwise return non-zero (failure). The middle pipe,
grep -v grep is a technicality; it removes from the process
list all grep processes. If you skip this, you risk finding
the string ''httpd'' on the process table even
when httpd is not running (which would defy the whole purpose
of this check). As for the option ''-e''
to the remote ps command, it means ''list all processes''.
If it does not work for you, check your local ps man page for
the correct option.
Decisions
Now we implement the logic to make the necessary choices. In bash,
we check return values of a command via the construct:
if command ; then commands ; fi
In this case, we use the following script excerpt:
FLAG=1
if ! rexec A ps -e | grep -v grep | grep -qs httpd ; then
rexec A apachectl graceful
sleep 1
if pingservice A ; then
FLAG=0
fi
fi
if [ FLAG==1 ] ; then
rexec A ifconfig eth0 down
apachectl graceful
ifconfig eth0:0 192.168.31.1 up
fi
The Complete Bash Script
Listing 2 is a modification of the monitor script above
called service-monitor, which completes the discussion. The
main difference with the monitor script is that the ''take
action'' script is now part of the main script. Consequently,
I did not include a device for continuing to monitor the service.
If service-monitor succeeds in resuming the service on host
A, it keeps monitoring it; otherwise it automatically takes its
IP address and then exits. (Otherwise it would just monitor itself,
and if the service test should fail for whatever reason, this script
would take host B off the network -- a most unpleasant situation.)
Further Development
In this article, I tried to show, rather than the scripts themselves,
the general approach to quick redundancy schemes and some of the
techniques necessary to implement them. The scripts can be modified
to accomplish lots of other things, such as better logging, mailing
systems administrators automatically when crashes occur, secure
communication between the hosts, or even many-hosts redundancy,
where lots of hosts monitor each other in order to supply the same
service.
Leo Liberti graduated in Mathematics from Imperial College,
London, in 1992 and then received a M.Sc. in Mathematics from Turin
University, Italy. He is now a research assistant and part-time
system administrator at Imperial College, and the Technical Director
at IrisTech, Como, Italy, an Italian firm that supplies customers
with Web-based and electronic services. Leo Liberti can be reached
at: liberti@iris-tech.net.
|