Article
Listing 1
Listing 2

apr2001.tar

Quick Network Redundancy Schemes

Leo Liberti

Simple bash scripting and IP aliasing can be used to implement quick and easy host redundancy schemes based either on host availability or service availability. In this article, I describe a very simple way to implement such a scheme. The idea sprang from the need to supply a simple redundancy scheme to an HTTP service we have at IrisTech, where I work. The programming involved is limited to bash scripting and the availability of standard UNIX network utilities. This approach makes use of a feature of the Linux kernel, IP aliasing, which provides a neat way to assign more than one IP address to one network card; however, Linux is by no means the only operating system with this property. And even on operating systems without IP aliasing features, there are ways to circumvent this problem (e.g., installing two network cards).

Redundancy is useful when you have a computer running a crucial service. To lessen the probability of the service becoming unavailable, one (or possibly more than one) other computer monitors the activity of the service on the main computer. If the service becomes unavailable, there can be one of two reasons:

1. The computer has crashed (or has lost the network connections);

2. The computer is alive but the process running the service has died.

The monitoring computer must take different actions according to the nature of the problem.

Initial Scenario

To put this idea into practice, I will first describe the network setup needed to implement it. Host A (IP address 192.168.1.1 on device eth0) supplies the important service. Host B (IP address 192.168.1.2 on device eth0) monitors A and takes action when the important service becomes unavailable. To make things more precise and readily applicable, A runs the HTTP service with an Apache Web server running on port 80, and both A and B run Linux. The version is largely unimportant, as long as it is later than 2.0.1. I advise using the latest stable kernel (as of this writing, 2.2.17). One last requirement is that the two hosts must be on the same IP network. With tremendous effort and deviant hacks, this requirement can be circumvented, but for this discussion I assume that A and B are on the same network.

Taking Over a Crashed Server

The first step is to make B capable of taking over the service provided by A.This means, first, that Apache must be installed on B (even though it may not be running while A is functioning properly). Second, that B must mirror all data files necessary to the HTTP service. This can be easily implemented with a cron job running rsync, rdist, or other syncronization software. At IrisTech, we use cron on host B to periodically run a bash script which:

Mounts the data directory from A to B via NFS,
Uses rsync to copy the data from A to B,
Unmounts the NFS export.

Details

Add a line like the following:

00 4 * * * root /usr/local/sbin/mirror

in your /etc/crontab. This means run the mirror script at 4 am each morning. This is the mirror script:

#!/bin/sh
mount A:/data -t nfs /mnt/A
sleep 25
rsync -a --delete /mnt/A/ /home/data-from-A
umount /mnt/A

Of course, the details of this script will vary with your network setup.

IP Aliasing

Now, host B must be made capable of assigning more than one IP address to a single network interface. Under Linux, this can be obtained by rebuilding a kernel with the IP_alias option. Put the following line:

CONFIG_IP_ALIAS=y

in the file /usr/src/linux/.config, then run:

make dep; make clean; make bzImage; make modules; make
modules_install
cp /usr/src/linux/arch/i386/boot/bzImage /boot/vmlinuz-ipalias

Insert the relevant lines in /etc/lilo.conf, and run lilo; if it produces no errors, reboot. On machines or operating systems without this feature, a second network card can be installed on host B. At this point, B is ready to monitor A and take action should A fail. Recall that B should be able to tackle two problems: a) A crashes or loses the network connection, and b) A loses the service. I will first show how to monitor and take action for the first problem (a).

Monitoring the Network Connection

Monitoring of A can be obtained by using ping. We embed this command in a bash function pinghost:

function pinghost() {
  ping -c 1 $1 > /dev/null 2>&1
}

The option -c 1 means ''send just one ICMP probe packet''; the default behavior would be to keep sending packets indefinitely towards the target host (in this case, the placeholder $1 is used as a target host for ping). Since we are only interested in the return value of pinghost(), we send standard output and standard error to /dev/null. The main bash script runs a loop that calls pinghost() continuously and exits the loop when the target host stops responding to pings:

while pinghost A
  do
  sleep 10
done

The instruction sleep 10 in the main loop body ensures that ping packets are not sent too frequently. In this case, they are sent every 10 seconds. This parameter can be varied as needed.

Substituting the Host

In short, pinghost() is used as a functionality test inside a loop. If the test keeps succeeding, we wait 10 seconds, then go back to the start of the loop and repeat the test. If the test fails, we exit the loop and take action: B takes over A.

apachectl graceful
ifconfig eth0:0 192.168.31.1 up

Make sure Apache is running on host B, then B is configured to respond to two IP addresses (thanks to the IP_alias kernel feature) on the same network card: its own (on eth0) and the IP address previously assigned to A (on eth0:0).

The Complete Bash Script

The bash script shown in Listing 1 is an embellishment of the principle expounded above with some user-definable parameters and some logging facilities. This bash script must be installed as /usr/local/sbin/monitor; it depends on directories /usr/local/lib/monitor/log and /usr/local/lib/monitor/bin. The ''take action'' part of the scheme is included in a separate executable bash script, A-noping, to be placed in /usr/local/lib/monitor/bin/, where A is specified as the first argument on the command line of monitor. The second argument specifies whether monitor must continue monitoring after taking action (yes) or not (no). It is also possible to have monitor simply record the event of the crash of A without actually taking any action by making the ''take action'' script /usr/local/lib/monitor/bin/A-noping not executable. (Listings for this article are available from the Sys Admin Web site: http://www.sysadminmag.com.)

A typical monitor invocation would be:

backup.mydomain.com:~# /usr/local/sbin/monitor www.mydomain.com \
    yes

Taking Over a Crashed Service

I will now show how to tackle service problems when A does not lose its network connections. There are two possible approaches:

1. B tells A to respawn the Apache process.

2. B tells A to bring down its network interface and then B takes the place of A as in the previous case.

The best results are often obtained by using a combination of these two approaches. The ideal situation would be as follows: B checks whether the Apache process is present on A. If it is not, it tries to respawn apache on A. If it is present, the problem is not easy to solve -- we tell A to lose its network connection and take over its IP address with the IP_alias technique described above.

Executing Commands Remotely

Security Issues -- All of these approaches make use of the concept ''B tells A to do something''. Furthermore, everything B tells A to do requires root privileges on A. To keep this task reasonably simple, the root password of host A must be stored on host B in plain text. This is a very bad idea in regard to security. I recommend that you only implement this scheme if:

Your network is behind a very strict firewall;
User access to hosts A, B and the network they are on is limited and easily controllable;
The data on host A is not terribly important;
The root password on host A is different by all other root passwords on the network.

If any of these requirements are not met, I suggest you abandon this ''poor man's redundancy scheme'' and go for something more robust.

Assuming all of the above requirements are met (or you find a way to ensure root privileges to B on A without storing the root password in plain text), we can go on to modify the monitor script to take care of the new problem.

Using rexec -- I will explain how to make B give root commands to A in a totally automated way. To this purpose, I will make use of the rexec service.

1. Make sure the rexec daemon is installed in the inetd services (check /etc/inetd.conf).

2. Make a file .netrc in the home directory of the root user (usually /root) on host B with the following lines:

machine A
login root
password root_password_on_A

and give it permissions 400.

3. Test this setup by running an ls command remotely:

root@B:~# rexec A ls

The weak link in the security chain is the .netrc file where the root password of host A is stored in plain text. Keep it as safeguarded as possible.

Monitoring a Service

I will now explain how host B can monitor the HTTP service on host A. To monitor any service, one simply connects to that service and transfers some data. If all goes well, one assumes the service is up and running. To connect to a service manually, the normal telnet client may typically be used. Unfortunately, telnet cannot be automated, because it cannot accept commands from standard input. The alternative is netcat (http://www3.l0pht.com/ weld/netcat/). Along with the freely available source code, you can find netcat precompiled and packaged for most Linux distributions. After netcat is compiled and installed, the binary executable nc should be placed in a directory in your PATH (usually /usr/local/bin).

Thus, we have to substitute the monitoring function pinghost() of the script above with the following:

function pingservice() {
  echo "GET /" | nc $1 80 | grep -qs "<HTML>"
}

This means connect to port 80 (HTTP) of the host specified in the function argument and execute the command GET /; then scan the output for the string ''<HTML>''. If it is found, return 0 (success), otherwise return non-zero (failure). The options to grep mean ''suppress all visible output''.

Service monitoring will take place as in the previous case via a main loop:

while pingservice A ; do
  sleep 10
done

Taking Action

If the monitoring test fails, the main loop is exited and some action must be taken. As mentioned previously, the strategy to be implemented is the following:

Host B checks the process table on host A and verifies whether the Apache process is present.
If Apache is not running, respawn the Apache process and try the service monitoring test again. If it succeeds, go back to the main loop; otherwise, continue with next step.
The problem is not easy to solve -- Host B tells host A to lose its network connection and then takes over its IP address, effectively ''becoming'' host A.

Checking Remote Processes

Checking the process table for Apache is easy:

rexec A ps -e | grep -v grep | grep -qs httpd

This command means execute ps -e on host A, then scan the output for the string ''httpd''. If it is found, return 0 (success), otherwise return non-zero (failure). The middle pipe, grep -v grep is a technicality; it removes from the process list all grep processes. If you skip this, you risk finding the string ''httpd'' on the process table even when httpd is not running (which would defy the whole purpose of this check). As for the option ''-e'' to the remote ps command, it means ''list all processes''. If it does not work for you, check your local ps man page for the correct option.

Decisions

Now we implement the logic to make the necessary choices. In bash, we check return values of a command via the construct:

if command ; then commands ; fi

In this case, we use the following script excerpt:

FLAG=1
if ! rexec A ps -e | grep -v grep | grep -qs httpd ; then
  rexec A apachectl graceful
  sleep 1
  if pingservice A ; then
    FLAG=0
  fi
fi
if [ FLAG==1 ] ; then
  rexec A ifconfig eth0 down
  apachectl graceful
  ifconfig eth0:0 192.168.31.1 up
fi

The Complete Bash Script

Listing 2 is a modification of the monitor script above called service-monitor, which completes the discussion. The main difference with the monitor script is that the ''take action'' script is now part of the main script. Consequently, I did not include a device for continuing to monitor the service. If service-monitor succeeds in resuming the service on host A, it keeps monitoring it; otherwise it automatically takes its IP address and then exits. (Otherwise it would just monitor itself, and if the service test should fail for whatever reason, this script would take host B off the network -- a most unpleasant situation.)

Further Development

In this article, I tried to show, rather than the scripts themselves, the general approach to quick redundancy schemes and some of the techniques necessary to implement them. The scripts can be modified to accomplish lots of other things, such as better logging, mailing systems administrators automatically when crashes occur, secure communication between the hosts, or even many-hosts redundancy, where lots of hosts monitor each other in order to supply the same service.

Leo Liberti graduated in Mathematics from Imperial College, London, in 1992 and then received a M.Sc. in Mathematics from Turin University, Italy. He is now a research assistant and part-time system administrator at Imperial College, and the Technical Director at IrisTech, Como, Italy, an Italian firm that supplies customers with Web-based and electronic services. Leo Liberti can be reached at: [email protected].