Remote
Machine Monitoring with Marvin
Alistair Young
I was asked to produce a monitoring application for the University
of the Highlands and Islands (UHI) Project, an organization consisting
of 13 academic partners, 2 associate institutions, and a directorate.
Our network connects the partner campuses and 70 learning and outreach
centers throughout the highlands and islands of Scotland. We had
found that, occasionally, machines would drop off the network for
reasons such as unannounced engineering work or just plain failure.
Because the WWW unit is responsible for hosting the partner Web
sites and providing access to the Internet for private block IP
machines, it was imperative to know when the Web server and proxy
machines were unavailable and also when the Webmail server was doing
something it shouldn't. Many staff members work from home over
the weekend, and Web access to their email is vital, so any solution
had to take this into account. Student record systems running on
Oracle databases were also pretty important, so I also needed a
way to ensure these machines were doing what they were supposed
to. In all cases, a technical person had to find out about the fault
before the users.
On the Oracle front, Perl was out as I wanted a robust system,
both simple to implement and use. Perl required the entire Oracle
client on the monitoring server. Java didn't, as Oracle supplied
a type 4 ODBC driver. So, with a special user added to the Oracle
systems, I wrote the database monitoring in pure Java, simulating
in code what a user would do to extract information from the database.
If the process failed, then the machine wasn't functioning
and an alert could be produced.
With the other machines, the solution was on two fronts. Is it
serving? Is it responding? Therefore, I needed to check HTTP and
ping. I identified three types of machines:
1. Web server
2. Proxy
3. Router
Armed with this information, and the fact I wanted to use as much
open source as possible, I dived into the murky world of object-oriented
(OO) Perl, and Marvin was born. I named this application after the
intelligent robot in The Hitchhiker's Guide to the Galaxy
by Douglas Adams who spent all day doing menial tasks such as talking
to doors.
Anatomy of an OO Perl Module
In Perl, a class is just a package, so declaring a simple Server
class is easy:
package Server
{
my $fields = {_name => "www.sysadmin.com"}; # Reference to an anon hash
sub new { bless ($fields) };
}
1;
The last line reminds you that it is just a package after all. Class
members and methods sit inside the curly braces. This is OO Perl at
its simplest -- no encapsulation/inheritance or interesting stuff
like that. Calling new, you get a reference to an anonymous hash containing
the class members. You have to do the work of the compiler. While
reading about all this, two methods dropped out -- the "Flyweight
pattern" and "Secure hashes". Now, secure hashes give
you just about every OO feature you could want but at a price --
speed and efficiency go down the drain and anyway, it's Perl,
you shouldn't be paranoid. So, this led me to use the flyweight
method, which basically hides a class's methods and members in
a randomized hash. You create the class, create a unique number, associate
the number with the class, and bless the number. What you then have
is a reference to a hash key, whose data points to the class you're
after:
...
{ # Extract from Server.pm
# Start of "class"
my %_data;
my %_fields = {
...
# Public:
type => undef, # server/proxy/router
name => undef, # Description
# Private:
_watch_interval => undef, # Watch machine every so many secs
_last_watch => undef, # Epoch time of last watch
... };
# Extract from the constructor
sub new
{
my ($class, %args) = @_;
# Get a reference to the allowed fields hash
my $dataref = {%_fields};
...
# Create a unique key to identify this class...
$dataref->{_key} = rand
until $dataref->{_key} && !exists $_data{$dataref->{_key}};
# and store it in the class
$_data{$dataref->{_key}} = $dataref;
...
bless \$dataref->{_key}, $class;
}
# "Get" method
sub get_name {return $_data{${$_[0]}}->{name}}
} # end of "class"
...
The class contains the "_fields" hash, which is a container
for all the members, and the enclosing braces mean that only functions
declared within can access this hash. Normally, you could now bless
a reference to this hash but that wouldn't be OO, so instead,
a unique number is generated, stored in the "_data" hash,
and used to point to "_fields". This unique number is what
I bless. So, what you get back when you call "new" on this
class is a reference to an index into a hash, which points to the
hash storing the actual class data. The "get_name" method
looks horrendous but is actually:
1. class_name = $_[0] ==> class name
2. class_ref = ${class_name} ==> Dereference to get the unique
number generated by the call to "new"
3. $_data{class_ref}->{name} ==> Get the "name"
member from the class "_fields" hash
This is the only way to get this class member's data. It's
OO, sort of.
The Daemon Principle
Rather than use a cron job, daemonizing the application provided
much more flexibility in polling machines. I wanted people to use
the Web front-end to add machines to the list and set a polling
time independent of other machines. If they required the polling
time increased for a short period, they could do it through the
front-end without affecting anything else. This just wasn't
possible with cron. So, I learned about daemons:
sub daemonise
{
chdir '/'; # (1)
umask 0; # (2)
open STDIN, '/dev/null'; # (3)
open STDOUT, '>$stdout_file';
open STDERR, '>$sterr_file';
defined (my $pid = fork); # (4)
exit if $pid; # (5)
setsid; # (6)
}
1. The first thing to do is get out of any weird path, like a mount.
2. A umask of 0 lets us set any of the bits for the "process"
we're about to "create".
3. Redirect STDOUT and STDERR to log files to keep the terminal uncluttered.
The whole point of the daemon is that it doesn't require a terminal.
4-5. Fork a new process and kill the main one.
6. Run the child in a new session to detach it from the terminal.
With the app running as a daemon, I could control it by modifying
its conf file and HUP'ing it.
SMS
As I said previously, I wanted to use as much open source as possible
and this included gnokii, the Linux mobile phone driver. All it
requires is a Nokia phone plugged into the serial port, a phone
number, and a message. By wrapping this in a Perl module, Notify.pm,
and populating it with info from the conf file, I had instant access
to various technical bods at all hours of the day and night. To
wake Joe Bloggs at 3am when a router stopped responding was simplicity
itself:
echo "wake up! Your router is down" | gnokii --sendsms 0777745678
The driver has numerous functions, which I plan to exploit and are
described in further developments.
The conf File
I needed simplicity:
...
# Notification nicknames
# This identified Joe Bloggs by the nickname "joe". He can be
# contacted via email and SMS (yes,yes) and his contact details follow.
notify,joe,Joe Bloggs,yes,yes,joe@work.com joe@home.com,0775644334
notify,jim,Jim Bloggs,yes,yes,jim@work.com jim@home.com,0775644334 0777745678
# URLs to test HTTP servers
# This defines a set of URLs named "test_urls"
urls,test_urls,www.bbc.co.uk www.annwiddecombemp.com www.cnn.com www.ed.ac.uk
# This identifies a machine to test. In this case it's a proxy
# server, port 8080 with a timeout of 100ms, doesn't implement
# sysbot (see later), has a polling interval of 600s and is tested
# using the "test_urls" set of test URLs. If something goes wrong,
# "joe" and "jim" want to know.
proxy,Squid2,squid2.smo.uhi.ac.uk,8080,100,nosysbot,600,test_urls,joe jim
...
The log file
A sample log line for an HTTP server being monitored:
Sun Sep 8 11:23:59 BST 2002,vader.uhi.ac.uk,ok,http://www.uhi.ac.uk,ok,3.80,100,,
To begin, you get the date/time of the poll, then the machine name,
the URL used to test it, and the response. After that, get the ping
status. "ok" means it responded within the timeout set in
the conf file. The actual time it took in ms is next, followed by
the timeout in ms. The last two entries are for sysbot-supplied uptime
and load. As this server isn't running sysbot at the moment,
these are blank. If the server stops handing out HTTP, we can check
whether that server is up from the ping time. If both are not "ok",
then it's a safe bet the machine is down, unless a router near
it is also reporting not "ok", in which case it might just
be unreachable.
The Web Front-end
To get around the problem of tech bods sending me emails to add
machines to the list, I developed a Web front-end in PHP, which
displayed the current status of all machines Marvin knew about and
allowed technical staff to modify the configuration and add new
computers to the list. Status information was provided by Marvin
via a PHP associative array, which it generated on the fly and which
the PHP front-end included:
...
"UHI Web Server" => "server!!!Mon Sep 9 09:41:14 \
2002!!!ok!!!http://www.uhi.ac.uk!!!ok!!!3.84!!!!!!!!!",
...
The sample line above told PHP that the "UHI Web Server"
was last polled on Mon Sep 9, was serving ok, and was within the ping
timeout. Figure 1 shows an extract from the status page showing a
summary of machines. As long as all lights are green, no further drilling
down is required. Figure 2 shows an example of drilling down through
a particular machine to get its monitor details. Figure 3 shows configuration
details for a Web server.
I'm also working on porting the original JSP code, which
produced a summary of problems for people looking at the main Web
site. This provided status information along the lines of "Web
server slow today" and other such information. With status
data collecting in the log files, I started to investigate graphing
methods to display it. Originally, I was going to use RRD tool,
but as the polling rates would be variable, it wasn't really
an option. Instead, I chose JpGraph PHP graphing classes because
I had already used these to build graphs of our proxy servers'
memory and CPU rates.
Further Developments
I'm working on enhancements such as user identification on
the front-end to only allow updates to those machines you have added.
Sysbot is coming along with authentication being built in and a
range of other machine variables being added. At the moment, it
monitors uptime and load, allowing me to tell me when a machine
has been rebooted and whether the load on it was inordinately high
just before it went offline.
One of the major enhancements will be server control via mobile
phone. If your server goes down, you can reply to the SMS message
with a reboot command. As long as the sysbot port is open and the
server is reachable, the reboot signal will get through.
Usage Example
The application has proven its worth by alerting key personnel
when a fault has occurred (such as Webmail not serving) and they
could fix it before someone important noticed. The SMS function
proved to be a double-edged sword though. As an example, shortly
after the application went live, I drove the forty-odd miles to
Portree to MOT my car. On the way, a blanking plate blew off the
exhaust and I roared into town sounding like a jumbo jet. After
I dropped the car off at the garage, my phone went "ping"
and a message appeared from Marvin -- one of the proxies had
stopped serving! When I get round to implementing sysbot, I will
have the functionality to reboot that machine from Portree high
street. If nothing else, it would make for interesting conversation
in the baker's queue:
"White or brown loaf, dear?" "Er, could you excuse
me please, I'm just rebooting my proxy!"
References
Object Oriented Perl by Damian Conway. Manning, ISBN 1-884777-79-1.
Programming the Network with Perl by Paul Barry. Wiley
Computer Publishing, ISBN 0-471-48670-1.
JpGraph -- http://www.aditus.nu/jpgraph
Dependencies
HTML-Tagset-3.03
HTML-Parser-3.26
libwww-perl-5.65
Mail-Sendmail-0.77
URI-1.19
Time-HiRes-01.20
Net-Ping-2.19
Time-modules-101.062101
gnokii
Links
http://www.uhi.ac.uk
http://www.smo.uhi.ac.uk
http://www.suse.com
Alistair Young, graduated with a BSc(Hons) in Physics & Microsystems
from the University of Abertay, Dundee. He spent five years at OKI
(UK) Ltd. writing Windows printer drivers in C/C++ and InstallShield
applications. Alistair joined the UHI Millennium Institute in 2001
as a senior software engineer, producing server monitoring software
and Web applications. He is now happily out of kernel mode and playing
around in Flash/PHP/Java and Perl. Alistair can be contacted at: sm00ay@groupwise.uhi.ac.uk.
|