Advanced
SNMP Monitoring with RRDTool
Adam Denenberg
In today's economy, monitoring devices via SNMP is crucial
to determining where bottlenecks may exist, and where we may have
placed too much capacity. Many network-monitoring tools have evolved
to handle SNMP data, but these have all been based around the work
of one tool -- MRTG. MRTG (Multi Router Traffic Grapher) allows
you to visually graph and trend anything imaginable out of SNMP,
no matter what the data represents. MRTG has been around for about
a decade, and most systems or network engineers are probably graphing
something with it. While MRTG is an important tool, RRDTool (another
tool written by the author of MRTG), is going to be the main focus
of this article.
RRDTool is an underutilized tool that, in conjunction with MRTG,
can be used to present your SNMP data in some innovative ways. This
article assumes you have a working knowledge of MRTG, and will focus
on taking RRDTool and SNMP to the next level by using RRDTool's
advanced features and expanding the functionality of the SNMP package
net-snmp. I will illustrate how easy it can be to create custom
graphs to provide information via SNMP (not available out of the
box as an OID) simply by adding a few options to the SNMP daemon
configuration file. Most importantly, I will examine using RRDTool
to create custom arrangements of SNMP data to provide new insights
into our infrastructures.
RRDTool Quick Background
RRDTool (http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/)
is a redesign of the MRTG graphing and logging functionality, however,
it does not poll SNMP data like MRTG. We will use MRTG to continue
to poll and gather data, but we'll use RRDTool to store and
graph the data.
RRDTool stores data in RRD files (Round Robin Database). RRD files
provide faster access and are initially created with the maximum
file size. Speed is another big advantage of using RRDTool with
MRTG. MRTG creates the graphs every time it runs, which can be very
taxing on a server if you poll a few hundred nodes every five minutes.
When used with RRDTool, however, the graphs are only created when
requested and not every time MRTG is run, which drastically reduces
the server load and is a necessity for anyone running a large number
of nodes under MRTG. Fortunately, migrating to RRDTool-style databases
on an existing MRTG system is seamless, since MRTG has existing
hooks built into its code to use RRDTool.
Once RRDTool is installed, add the following three lines (globally)
to your MRTG config file (assuming RRDTool is installed in /usr/local/rrdtool):
LogFormat: rrdtool
PathAdd: /usr/local/rrdtool/bin
LibAdd: /usr/local/rrdtool/lib/perl
The next time MRTG is run with this config file, it will see the LogFormat
statement, convert all the existing data in your .log files from MRTG,
and populate the new RRD file so no data is lost. The new RRD file
will be in the same location (WorkDir) as your .log file, but will
have a new extension -- .rrd.
The power of RRDTool is its ability to take the existing data
stored in the database and manipulate it in new, complex ways. RRDTool
contains a utility called rrdcgi, which is a script interpreter
that can read special RRD tags embedded in a cgi file that will
present your RRD data in a robust, graphical format. The parameters
to these special RRD tags are the same parameters you can give to
the command rrdtool graph, which is something you need to
create your own graphs to view data.
Installing SNMP
I will focus on using net-snmp 5.0.7 on a Red Hat 8 Linux box
for the server side of things. (I have used net-snmp on most variants
of Unix and Linux without problems, but Red Hat 8 was the most available
to me.) Net-snmp has many configurable options (like loading additional
MIBs, which must be done at compile time), but the default configuration
for this article will work just fine:
#tar xvfz net-snmp-5.0.7.tar.gz
# cd net-snmp-5.0.7
# ./configure
# make
# make install
This will put the snmp tools in /usr/local/bin by default, the daemons
in /usr/local/sbin, and the configs in /usr/local/share/snmp. If you
do not like where these files are located, you can change your configure
statement to place them in the directory of your choice, but the defaults
are fine for now.
After installing net-snmp, we'll set up the snmpd.conf file
to make our data available to MRTG and RRDTool. Minimally, the snmpd.conf
should contain one line -- the community string -- which
can be specified as follows:
rocommunity mycommunity
This tells the snmp daemon that it will service snmp queries to those
requests that use the proper community string, which is mycommunity.
If no rocommunity line is specified, then the snmp daemon defaults
to using "public", which is insecure since it's usually
the default community string.
Now that the daemon has the proper community string, we'll
add some additional functionality to SNMP. Most of this data will
be coming from the UCDAVIS MIB, so look at the file /usr/local/share/snmp/mibs/UCD-SNMP-MIB.txt
to see what is available. Objects like memory, CPU, and load are
already being tracked in the MIB, but the flexibility of net-snmp
allows us to expand the ucdavis MIB to include important data not
being recorded by default. We'll add three lines to our snmpd.conf
to track new data:
proc /usr/local/apache/bin/httpd
disk /var
exec TCP_CUSTOM /usr/local/sbin/tcp.sh ESTABLISHED
The first line, "proc /usr/local/apache/bin/httpd", contains
the net-snmp keyword "proc", which stands for process. This
command allows us to track the number of processes running off the
command we specify. The command you place here needs to match the
exact output of ps -e (on Linux and Solaris), or ps acx
on BSD, so be careful.
The next line, "disk /var", uses the net-snmp keyword
disk, which allows us to track the statistics of a partition we
specify (in this case, /var). Specifying the disk /var command
populates the MIB with data to track free disk space, total space
used, and percentage used.
The third line, containing the "exec" statement, is
a unique command in net-snmp that lets you create your own OIDs.
The exec command takes three arguments -- NAME, SCRIPT,
and optionally, ARGUMENTS. Placing this line in the snmpd.conf file
says we want to run a script known to snmpd as TCP_CUSTOM (could
be called anything) that will call the script /usr/local/sbin/tcp.sh
for its data, and takes the argument ESTABLISHED. The contents of
this script are simply:
#! /bin/sh
netstat -an | grep $1 | wc -l
This script greps all ESTABLISHED connections from netstat -an
and returns the number using the wc command. As shown, we can
run any script to do just about anything, as long as it returns a
number to snmpd. It then becomes a value that we can graph and trend
and also use to analyze what our systems are doing.
Now that we've configured our snmpd.conf file, let's
fire up our snmp daemon and confirm that our data is there. Simply
run the command:
#/usr/local/sbin/snmpd
and the snmp daemon will be started with all our config options.
Next, we'll verify that our custom data is being properly
tracked by net-snmp. To do so, we'll use the snmpwalk tool
and run it against our localhost system. Snmpwalk can be used to
walk the MIB tree of any host or device running snmp. We'll
walk only the ucdavis part of our tree for now, since that is where
all our data is stored. Issue the command:
#/usr/local/bin/snmpwalk -c mycommunity localhost ucdavis
to walk the ucdavis tree of our host. This will reveal all the objects
being stored by the ucdavis MIB. The lines that will contain the data
we put in the config will look like the following. For our proc line,
look for:
UCD-SNMP-MIB::prNames.1 = STRING: /usr/local/apache/bin/httpd
UCD-SNMP-MIB::prCount.1 = INTEGER: 190
This uses the proc part of the MIB and tells us that 190 httpd processes
are running, which can be extremely useful information for a Web server
farm. For the "disk" entry, look for these lines:
UCD-SNMP-MIB::dskPath.1 = STRING: /var
UCD-SNMP-MIB::dskDevice.1 = STRING: /dev/sda3
UCD-SNMP-MIB::dskTotal.1 = INTEGER: 198399
UCD-SNMP-MIB::dskAvail.1 = INTEGER: 146029
UCD-SNMP-MIB::dskUsed.1 = INTEGER: 36499
UCD-SNMP-MIB::dskPercent.1 = INTEGER: 20
These lines tell us that our /var partition, which is device /dev/sda3,
is 198399K, of which we are using 36599K, 146029K is available, and
we are using 20% of total space. Any combination of these numbers
can be useful in trending disk space usage. Our third line containing
the "exec" statement will have these lines in the snmpwalk
output:
UCD-SNMP-MIB::extIndex.1 = INTEGER: 1
UCD-SNMP-MIB::extNames.1 = STRING: TCP_CUSTOM
UCD-SNMP-MIB::extCommand.1 = STRING: /usr/local/sbin/tcp.sh
UCD-SNMP-MIB::extResult.1 = INTEGER: 0
UCD-SNMP-MIB::extOutput.1 = STRING: 65
Finally, we see our custom script is showing no errors on output (the
extResult line) and an output of 65, meaning 65 established
connections. As you can see, running a script to return data that
we can monitor via snmp is straightforward and can be quite powerful
in our monitoring application.
Our last step is to convert these values into actual OIDs that
we can place in the MRTG config file so we can pull them out via
snmp. We'll use snmptranslate to convert the name into
an OID. If, for example, we wanted to start tracking free disk space
on our /var partition, we can execute the following command:
#snmptranslate -IR -On dskAvail.1
.1.3.6.1.4.1.2021.9.1.7.1
snmptranslate has told us that the OID of available disk space
on our /var partition is .1.3.6.1.4.1.2021.9.1.7.1. Anything under
.1.3.6.1.4.1.2021 is the ucdavis MIB, so when you see that prefix,
you know it is configured under the ucdavis section of net-snmp. Issue
the snmptranslate for any of our custom-configured commands
and you will be able to instantly start graphing and trending all
your system's important data through MRTG and RRDTool.
RRDTool -- The Good Stuff
As I mentioned, RRDTool is fast and will drastically reduce the
load on systems running MRTG since it will not generate every graph
every time MRTG runs. It only generates graphs when they are requested,
allowing you to create graphs only when the data has changed. However,
we have yet to tap into the real power of RRDTool -- getting
at data not previously possible with MRTG.
Assume you have MRTG and RRDTool set up, and you are monitoring
httpd processes, CPU, load, and memory on all 30 machines in your
Web server farm. All the data and graphs look good and the information
is helpful in watching the load and activity of individual servers.
What if someone asked you for the average CPU usage of your Web
server farm? The only true way to proceed is to click on 30 graphs,
add those numbers, and divide by 30, which is where RRDTool will
save the day. We'll use RRDTool to provide the information
we need, and more.
We can use RRDTool to turn this statement into a numerical reality
-- take all the nodes in the Web cluster, grab the data, and
divide by the number of nodes in the cluster. If a node does not
have data (the server failed or is down), then remove it from the
average so it does not skew the results, thus providing an "average"
of the data for all the nodes in our cluster. To do this, we'll
use RRDTool's special functions, CDEFs. CDEFs are special functions
that can first take the existing data, manipulate it, and then provide
a new value to use in graphs. (If you are new to RRDTool, review
the tutorials on the RRDTool site.)
Now that we know what kind of data we are looking for, let's
see if we can get the cluster average for CPU usage on our Web server
farm. This example is something that we would place in our cgi file
that uses rrdcgi, RRDTool's script interpreter, which allows
you to create Web pages and graphs using our RRD files. rrdcgi
takes special tags that are used to specify the data you would send
directly to rrdgraph. To do this, we'll generate the following
RRDTool statement in our cgi file. For this example, assume three
servers in our cluster (instead of 30):
#! /usr/local/rrdtool/bin/rrdcgi
<html><body>
<RRD::GRAPH /www/server_cluster/server_cluster_cpu_daily.png \
--lazy \
--imginfo \
'<IMG SRC=/www/server_cluster/%s WIDTH=%lu HEIGHT=% lu >' \
-a PNG \
-s '-1d' \
--width '600' --height '150' \
--title 'CPU Usage for servercluster' \
'DEF:A01=/www/server1/server1_cpu.rrd:ds0:AVERAGE' \
'DEF:B01=/www/server1/server1_cpu.rrd:ds1:AVERAGE' \
'DEF:A02=/www/server2/server2_cpu.rrd:ds0:AVERAGE' \
'DEF:B02=/www/server2/server2_cpu.rrd:ds1:AVERAGE' \
'DEF:A03=/www/server3/server3_cpu.rrd:ds0:AVERAGE' \
'DEF:B03=/www/server3/server3_cpu.rrd:ds1:AVERAGE' \
CDEF:Anodes=A01,UN,0,1,IF,A02,UN,0,1,IF,A03,UN,0,1,IF,+,+ \
CDEF:Bnodes=B01,UN,0,1,IF,B02,UN,0,1,IF,B03,UN,0,1,IF,+.+ \
CDEF:User_Avg=A01,UN,0,A01,IF,Anodes,/,A02,UN,0,A02,IF,Anodes,/, \
A03,UN,0,A03,IF,Anodes,/,+,+ \
CDEF:System_Avg=B01,UN,0,B01,IF,Bnodes,/,B02,UN,0,B02,IF,Bnodes,/, \
B03,UN,0,B03,IF,Bnodes,/,+.+ \
-v 'CPU %' \
'AREA:C#00FF00:User Cpu:\j' ' \
GPRINT:C:MAX:Max User CPU\: %5.3lf %s' \
'GPRINT:C:AVERAGE: Avg User CPU\: %5.3lf %s' \
'GPRINT:C:LAST: Current User CPU\: %5.3lf %s' \
'COMMENT:\n' 'COMMENT:\n' \
'LINE1:D#0000FF:Sys Cpu:\j' \
'GPRINT:D:MAX:Max System CPU\: %5.3lf %s' \
'GPRINT:D:AVERAGE: Avg System CPU\: %5.3lf %s' \
'GPRINT:D:LAST: Current System CPU\: %5.3lf %s' \
>
This is a sample statement that would be put in a cgi file and display
the last 24 hours of data for the average CPU usage on our three Web
servers. Let's break down the <RRD::Graph block and see how
this allowed us to see the cluster average for CPU usage.
The first line of our block must point to our image file, which
is (in this case) server_cluster_cpu_daily.png. The next
line (-lazy) is great for reducing server load since this
statement says to not regenerate graphs unless the graph data is
older than the rrd data. If the data hasn't changed,
it will not re-create the graph. The -imginfo line allows
us to use a custom <img> tag when the html is printed (in
case you want to point at an image outside of your docroot). The
-a png tells the rrdcgi to use PNG format instead of GIF
or GD. The -s tells us to specify a start date of the data
retrieval. Normally, this value is in seconds, but RRDTool allows
you to use AT-STYLE TIME SPECIFICATION as a reference instead of
using seconds since the epoch (like our example of -1d for one day).
This tells RRDTool to grab data exactly 24 hours (1 day) back from
the current point. The -width and -height arguments
simply allow us to specify the size of our graph.
Now for the data definitions, which is the most crucial part.
Assume each Web server has the following target line in its mrtg
config file for CPU usage:
Target[serverX_cpu]: \
.1.3.6.1.4.1.2021.11.50.0&.1.3.6.1.4.1.2021.11.52.0:mycommunity@serverX
These are the two OIDs out of SNMP for USER CPU and SYSTEM CPU running
on a standard Unix machine. This tells MRTG that we want to graph
USER CPU vs. SYSTEM CPU as our two data sources. MRTG expects two
values for a target, and those in turn become ds0 and ds1 when converted
to RRD format. Each DEF statement identifies a special string with
one of the RRD values that we are storing (in this case, SYS and USER
CPU usage). For example, the line DEF:A01=/usr/local/www/web1_cpu.rrd:ds0:AVERAGE
says to define the string A01 as the value of data source ds0 from
our RRD database, which in turn maps to the value of the User CPU,
using the AVERAGE function. The same concept goes for B02, for example,
which maps the string B02 to the SYSTEM CPU of our server server2,
using the data source ds1. Remember ds0 maps to USER CPU and ds1 maps
to SYSTEM CPU for each RRD file, since that is how we defined our
Target line in MRTG. We can now use these values (A01, B01, etc.)
as actual representations of data later on in our RRD definition.
Now that we have stored the individual values of CPU data for
each individual server, we need to create our average functions.
Remember, as stated before, we first must make sure that each node
actually has data, since a failed node with no data can skew our
results. So in our first CDEF, we determine the number of nodes
that actually have data, and do not return a null value (which RRD
refers to as UNDEF). Any formula that has UNDEF as one of its values
will result in UNDEF and the graph will show a value labeled as
NaN (not a number). Thus, if we mistakenly use a failed node in
our calculations that returns UNDEF, our graph will not have any
data since the result will be UNDEF. The following CDEF will result
in the true number of nodes that have data we can average, disregarding
nodes with no data:
CDEF:Anodes=A01,UN,0,1,IF,A02,UN,0,1,IF,A03,UN,0,1,IF,+,+
This CDEF uses RPN expressions to calculate its data. (If you are
not familiar with how RPN expressions work, the tutorial on the RRDTool
Web site explains it well.) This expression means that if A01 is undefined,
assign it the value 0, or else assign it a 1. We then write the same
expression for each node -- one node after another all the way
through A03. Then we follow our three expressions by two "+"
signs, which means take the first two results of our IF statements,
add them, then take that result and add it to our third value. The
result of this expression is the number of nodes we have in our Web
server farm that returned a value. This will be used later to calculate
our average.
Now that we have the number of nodes in our Web server farm, we
can calculate the average CPU usage. We do this by using the following
CDEF in our cgi file:
CDEF:C=A01,UN,0,A01,IF,Anodes,/,A02,UN,0,A02,IF,Anodes,/,A03,UN,0,A03,IF,Anodes,/,+,+
This CDEF function takes each node, checks its value (we do this again
because we need to turn UNDEF values into 0 to prevent our result
being NaN), divides by the number of nodes in our cluster that we
calculated previously, and then adds all those values together. If
we had server1, server2, and server3 with USER CPU values 80, 60,
and UNDEF, substituting these values in our function would yield the
following results. Anodes would become the value 2 since the first
two IF statements result in the value 1, and the last IF statement
turns the UNDEF into 0. The User_Avg value would then equate to 80/2
+ 60/2 + 0/2 = 40 + 30 + 0 = 70. This tells us that the average CPU
usage on all our nodes is 70%, which is correct. Imagine how much
time you would save if you had 250 servers to monitor. This is just
one example of how easy it can be to extend your existing data being
gathered by MRTG, SNMP, and RRDTool and generate incredibly useful
trending information.
The remaining lines that contain AREA, LINE, GPRINT, and COMMENT
are for putting the actual data into the graph. AREA and LINE are
two types of graphs that we can use, depending on how we want our
graphs to look. They just need to be assigned a color on the graph,
and told which variable to reference. In this case, we want our
USER CPU usage to show up as an AREA graph, and our SYSTEM CPU usage
to show up as a LINE Graph. The GPRINT statements are there to print
some useful numeric information underneath our graph, but embedded
inside the graph as a PNG.
Conclusion
I hope this article has shown you new ways to collect some important
data from your server clusters through net-snmp, as well as how
to use the power of RRDTool to monitor how your clusters are behaving.
The more time you spend with RRDTool, you will realize its true
power in presenting important information into easy-to-read graphs,
which can definitely save time and solve the mystery of your infrastructure.
Adam Denenberg has been a Unix Engineer since 1998, working
mainly in the areas of Internet and new media companies focusing
on high-level Unix Systems Engineering. He has a BS in Computer
Science from the State University of New York at Albany with a minor
in physics, and has been working with Unix since about 1994.
|