Keeping
Your Web Content in Sync
Adam Olson
This article is all about keeping the content in your Web server
farm synchronized with rsync. rsync is a very handy program that
provides a simple way to mirror content across a number of machines.
I'll show how to design a straightforward content push system
to keep front-end Web server content synchronized. There are plenty
of ways to utilize a program like rsync; this is just one of them.
Obtaining and Building rsync
The current version of rsync is 2.4.6 and was written by Andrew
Tridgell and Paul Mackerras. Download the compressed source tar
ball at: http://rsync.samba.org. I ran the following commands
on a system running Solaris 2.7, and the compilation went smoothly:
# gzip -dc rsync-2.4.6.tar.gz | tar xvf -
# cd rsync-2.4.6
# ./configure
# make
# make install
This will install the rsync binary in /usr/local/bin as well
as the man pages. You will need to go through this process on all
the involved hosts.
More on Our Goal
One example of a cookie cutter Web tier is a design where a number
of front-end Web servers all serve up identical content and the
rest is handled via calls to a back-end database of some kind. Traffic
is load balanced across the Web servers using a method such as DNS
round robin or, if possible, a hardware solution. Because the Web
servers all have the same content tree, using rsync to maintain
these structures from a central distribution point provides a clean
and easy way to maintain the content.
More on rsync
Why does rsync work so well in this configuration? Here are some
of the key factors:
1. You can use ssh as the underlying transport mechanism.
This means you get added security without a lot of extra work. ssh
handles all of the authentication which is a lot better than leaving
it up to clear text protocol like rlogin.
2. Entire filesystems or individual directories can be updated,
therefore making it easy to mirror your document root and subdirectories
to a number of destination hosts.
3. It preserves symbolic and hard links, ownership, permissions,
etc. For example, if rsync is preserving file ownership, the UIDs
of the transferred files will remain the same instead of being owned
by the account initiating the transfer.
rsync also includes an algorithm for determining which portions
of a file need to be synchronized, thus it can be more efficient
over slow transmission lines. Personally, I don't usually benefit
from this feature because high bandwidth paths are increasingly
more common. As the following example shows, I am more concerned
with the act of synchronizing our hosts than with the hopes of doing
it in the most efficient manner. If you are interested in learning
more about the rsync algorithm, a detailed description is provided
in the distribution.
Let's Do Some Syncing
I'll now walk through how to build a basic configuration
that can be expanded to support a multitude of hosts. The following
is an example of using ssh to transfer the files. You need
ssh (http://www.ssh.com) installed on both hosts,
or you can use rsh.
The central distribution point will be located on a host named
dev, and our front-end Web server will be on a host named
www1. The distribution root on dev will be located
at /usr/local/webroot, and the document root on www1
will be located at /usr/local/webroot as well.
The basic command to synchronize www1 to dev looks
like this:
dev# rsync -vrlHpog --delete --rsh=/usr/local/bin/ssh/usr/local/webroot/ www1:/usr/local/webroot/
Here is a break down of this command that shows what each part does:
- -v -- Run in verbose mode. Displays the files being
transferred, as well as statistics on how much data was written,
read, and how long it took.
- -r -- Recurse into directories.
- -l -- Preserve soft links.
- -H -- Preserve hard links.
- -p -- Preserve permissions.
- -o -- Preserve owner.
- -g -- Preserve group.
- --delete -- This option deletes any files on the
destination host that do not exist on the distribution host. This
is useful because when certain portions of the content have been
deleted in new revisions, unless this option is specified, the
files will linger around on the front-end Web servers. This could
conceivably have bad affects on your application.
- --rsh=/usr/local/bin/ssh -- The path to ssh.
- /usr/local/webroot/ -- The local content source
directory.
- www1:/usr/local/webroot/ -- The remote host and
its local content document root.
Another argument you may use often is --exclude. For example,
adding --exclude="*.log" or --exclude="*.old" would
exclude any file ending in .log or .old from being
pushed to the front-end Web servers. Log files or backups made while
on the development server are of little use when synchronized into
production. For a list of all the arguments to rsync, run rsync
without any arguments or check out the man page.
Sprucing It Up
Typing the command discussed above works well when you are dealing
with only a few front-end Web servers. Even then, it is always easier
to write a script to do it for you! I am always happier when I have
eliminated repetitious tasks.
Here is a basic script that gets the job done. A useful addition,
if you use RSA authentication in your ssh setup, is to add
support for ssh-agent so a passphrase only needs to be entered
once:
#!/usr/local/bin/perl
#
# a basic script utilizing rsync that will synchronize
# content to a number of front end servers.
#
# adamo@humboldt1.com 10/31/00
#
#### DEFINE ####
# array of servers, add your hosts here.
@servers = (www1, www2, www3, www4, www5, www6, www7, www8);
# distribution directory
$distdir = "/usr/local/webroot/";
# destination directory
$destdir = "/usr/local/webroot/";
#### END ####
foreach $server (@servers) {
print "Initiating content synchronization on $server.\n";
system "/usr/local/bin/rsync -vrlHpog --delete \
--rsh=/usr/local/bin/ssh $distdir $server:$destdir";
if ($? == 0) {
print "Content synchronization successful on $server.\n";
} else {
print " Content synchronization failed on $server.\n";
}
}
Conclusion
This article covered a relatively painless way to keep the content
on your front-end Web servers synchronized. It can be expanded upon
to synchronize content across a wide area of differing services,
as well. rsync's seamless integration with ssh and ability
to mirror entire directory trees while keeping permissions and ownership
intact, make it a good solution to the problem of content management.
Adam Olson has helped build a successful ISP (http://www.humboldt1.com),
designed and configured portions of the California Power Network
while working at MCI WorldCom, and is currently working for a startup
in the Silicon Valley (http://www.quaartz.com).
Adam hopes to be sailing a lot soon. He can be contacted at: adamo@humboldt1.com.
|