NetWorker
Savegroup Summarizer -- A Legato NetWorker Reporting Tool
John Stoffel
NetWorker Savegroup Summarizer (NSS) came about when we replaced
our old four-drive, 8mm (5-GB capacity!) tape jukebox with a new
tape jukebox holding two DLT7000 tape drives, driven by a dual-CPU
Sun Ultra 2 server. At the time, we were running Legato's NetWorker
Backup software version 4.2.x to back up about 100 clients of mixed
types, mostly UNIX with some PC and NetWare clients. I wanted to
get a better idea of exactly how much data we were backing up, how
long backups took to complete, and where the performance bottlenecks
were. I also wanted to see how much real improvement we were getting
in our backup performance after the upgrade to the new hardware.
Legato NetWorker is a powerful tool to manage backup needs, but
it has some frustrating limitations when it comes to reporting useful
statistics from a completed backup. The 5.x version was especially
limited in regard to reports you could get from the base product.
See Example 1 for a sample savegroup notification report. At the
time this program was first started, the NetWorker Management Console
product was not available, and the GEMs reporting tool was an expensive
add-on. (See the sidebar for an overview of Legato.)
After looking around and asking on the NetWorker mailing list
(see Resources), I ended up writing my own code as presented here.
The goals of the code were as follows:
1. Email the administrator a concise daily report detailing overall
status of the backups completed each night.
2. Report statistics on how much data was backed up and how long
it took.
3. Report on the largest saveset backup(s) in terms of the amount
of data, as well as the saveset backup(s) that took the most time,
since these are not always the same.
4. Keep the report in plain ASCII format, 80 columns maximum width,
so that reports are useful even on dumb terminal displays.
See Example 2 for a sample report generated by NSS.
Problem Description
The initial implementation of NSS just took the raw information
stored in the /nsr/logs/messages log file and parsed out the needed
information. This worked at first, because I could first look for
the matching lines that signified the start of a savegroup report,
and then look for the matching end of the report. Then some post-processing
would be done on the data to make it more presentable.
This process broke, however, when we had multiple savegroups running
at different times and with different clients that overlapped in
their start/end times. Since the messages file was just an ever-growing
file, it was impossible to tell which savegroup a line of output
belonged to without knowing lots of internal details of each savegroup.
Because I didn't want to do inquiries to NetWorker's internal database,
and because all the information I need was in the standard savegroup
notification, I wrote the savegroup notifications to a file and
ran them through NSS by hand. This was automated fairly quickly
with a procmail recipe.
Over time, NSS has been expanded with extra features, such as
a tape report to show how many tapes in each tape pool are full,
partially full, or empty. This is a useful check to ensure that
you have enough tapes for the next night's run. The limitation here
is that you must make a call to the mminfo command, which
may or may not be available on the system running the report. Another
useful feature is the ability to save easily parsed summaries of
the backups for each savegroup recording such details as:
- Start and end time
- Time taken to save
- Number of clients
- Number of failed clients
- Size of the backup in kilobytes
- Backup level
The server and savegroup names are implicit in the default directory
structure and filenames used. See the online usage of NSS for details,
and Example 3 for a sample log file.
The eventual goal is to write a plotting (nss-plot, see the resources)
program to graph the output in a nice overview format. This project,
however, has been going very slowly due to time constraints and
limitations of the various plotting tools.
Breakdown of NSS and How It Works
The first step of the script is the parsing of the various command-line
options. You can change all settings and actions from the command-line
switches, so you can embed the script inside another script if need
be.
State Machines
The core stage is to process STDIN and to scan the input for the
start and end markers that determine the savegroup report. This
is really the heart of the code, and can be thought of as a simple
finite state machine. Any good book on computer theory and programming
will describe state machines in more detail, but one definition
is:
- An initial state
- A set of possible input
- A set of new states that may result from the input
- A set of possible actions or output events that result from
a new state
This is a very useful way of thinking when you are trying to write
a program (which is just a fancy state machine) to parse a set of
data to turn it into a more useful representation. Humans are very
good at pattern matching, but as seen when trying to write a regular
expression to handle all the myriad possibilities to pull out that
one piece of data, computers are not as good at pattern recognition,
or more accurately, exception handling in pattern matching.
This is where the power of Perl comes in, since we have both a
need for regular expressions to match and find the markers that
determine which state we are in, as well as the regular expressions
that are used to pull out the needed data inside each section. Luckily,
Legato has been quite static with the format of the savegroup reports
over three major versions of the software.
Another useful feature of Perl 5.x is the addition of complex
data structures, so we can do hashes of hashes to store the various
bits of information that are pulled from the savegroup report. This
document cannot really go into this in any great depth, so I would
recommend that you see the Perl documentation, especially the perldsc
and perlreftut man pages, or the O'Reilly & Associate
books Programming Perl (3rd Edition) by Larry Wall, Tom Christiansen,
and Jon Orwant and Advanced Perl Programming by Sriram Srinivasan.
Breaking down the problem into smaller steps is just one technique
that can be used to make the problem more manageable. In general,
when processing a random stream of input, looking for data, you
will have the following states and transitions that are possible
as you scan through the input stream.
- I have not found the data I need.
- I have found the data I need.
- I have found a different set of data I need.
- I am at the end of the data I need.
- No more data to read.
Transitions are how you move from one state to another, and are
really just how you indicate to the computer what to do on the next
step of input.
When parsing input, you can choose the chunk size, such as characters,
words, lines, paragraphs, etc. It can be tricky to determine which
state to be in (i.e., what to do with the input) when states span
multiple chunks. In such situations, it makes sense to break those
spans into multiple states. So, when you find Chunk A, you know
to look for Chunk B, or you fail and reset to the state before you
found Chunk A, or just go back to a default state.
For example, say you are reading input one line at a time, but
the markers you need to match against are split across two lines,
with the markers being "foo" and "bar", and the end marker being
"END", or the "EOF". Some basic Perl code to handle this can be
seen in Figure 3. Note that while we are reading our input, the
variable $state keeps track of the state we are in, which
tells us what we are expecting for input. This lets us skip the
processing of other states that don't apply or that might cause
problems if we can't tell what to do with the input without knowing
where we are in the processing. Also notice how we jump out of the
various "if-then" constructs and back up to the top of the loop
to read another line of input.
Parsing Input in Perl
Using this type of programming, we can process almost any type
of input. When parsing a NetWorker savegroup report, there are four
possible states the input can be in:
1. We haven't found the start of the savegroup report.
2. We've found the header and parsed the info from it.
3. We're done with the header but now we have "Unsuccessful Save
Sets" to read in and process.
4. We've found the "Successful Save Sets" marker and we're processing
them.
In state one, we are looking for the marker that will drop us
into state two, where we will process the actual savegroup. This
lets us skip mail headers or other extraneous information.
In state two, we have matched the regular expression that tells
us we have found the start of the savegroup report. The code currently
supports both Legato NetWorker and "Solstice Backup", which is Sun's
name for their OEM'd version from Legato.
In this state, we pull out information on the number of clients,
the savegroup name, the start (or restart) time, and the end time.
We might also find out the name of the backup server if we're lucky.
This is one of the more frustrating limitations of the savegroup
report format, because it does not explicitly state the name of
the backup server.
So, we need state three to help figure out the name of the backup
server, because it's not explicitly stated in the savegroup report
anywhere. But, because we know that all indexes are saved on the
server, we can look for lines that mention "index save" and pull
the server name information from there. State three then may or
may not be used, depending on the savegroup report and whether there
were any failures.
The main work is handled in the fourth state, which is where we
process the individual client saveset (think filesystems) reports.
Again, we try to determine which host is really the server. Note
that when the server is finished writing all of a client's savesets
to tape, it will write the client index on the server to tape as
well, so we also look for that information. As each client's information
is read in, we find: client name, saveset (directory), level, total
amount of data written, the scale used for this measurement (e.g.,
kilo-, mega-, giga- or terabytes), the time it took to write the
data, and the number of files saved.
There is no finish state. I assume that the input will continue
to match in the fourth state, but this is handled by skipping any
input that doesn't match the NetWorker report format, so processing
continues until the EOF.
One key step during the processing of client saveset reports in
state four is to take the amount of data saved and scale it into
kilobytes, so that it's all consistent. This simplifies other steps
further in the processing and printing of reports. (Note that I
use the standard 1024 bytes in a kilobyte, not the marketing driven
version that uses 1000 bytes in a kilobyte.)
Perl's ability to handle complex data structures is also a key
element here, because it lets us store the data we've parsed in
a hash of hashes of hashes. This could have been done using split()
and join() to build long keys to hold the information, or
using multiple parallel hashes, each for just one piece of info.
Using a complex data structure contains all the data in one structure
for each use. The basic data structure is as follows:
Client --> Saveset Name --> Level
--> Total Data Written
--> Time to Write Data
--> Number of Files Written
Each level of the above structure is a hash, pointing to one or
more sub-hashes as needed. The first level is the name of the client,
and since each client can have multiple Savesets, that leads to
the second level of the hash. At the third level, we could have
gone to a fixed array to hold the information, but continuing the
use of hashes serves two purposes. One, it's self documenting --
no need to remember that index 2 is the number of files written
by the Client in that particular saveset. Two, the sorting and report
generation functions are simpler and most consistent since the entire
data structure is just hashes.
Post-Processing and Reporting
After all the data has been parsed, we process the data to pull
out the start and end times, as well as sum the total amount of
data written to tape across all savesets from all clients. This
is a simple step, since we already have all the sizes in kilobytes,
so we simply sum them up by both level and the overall total.
Once this is done, we must determine which scale to use when showing
the reports. Generally, I like to use the biggest scale possible,
since if we have written gigabytes of data to tape, it's not too
informative to know how many megabytes it is.
After that, we can pick and choose among the various reports and
output them. When printing reports in Perl, most people think of
using formats as a quick and dirty option, but there are some limitations
to this method that I found frustrating -- mostly dealing with empty
line compaction. So, I only use formats for the main summary section,
and instead use printf() for the various extra reports.
These other reports are fairly self-explanatory, but I will look
at how a couple of them work. The print_top_n_size() function
will sort and display the clients that wrote the most data. This
is broken into two steps -- the first of which goes through all
the clients and totals up the size of all savesets written by that
client. The data is then put into a temporary hash. The second step
is to print out the header of the report and loop through the temporary
hash holding the client totals, printing until we either run out
of clients or reach the maximum number of clients to show. Generally,
only the top five or ten clients are interesting in terms of the
amount of data written.
In contrast, the report for the top N hosts by time is broken
down by saveset, since a client with a very small amount of data
could have a problem and take a large amount of time to write that
data. This is interesting data and should be shown for troubleshooting
purposes. The general structure of the report is the same though,
where a first loop through the data builds a temporary hash of the
needed info, then the header is printed and the report is written.
One suggestion would be to include the size of the data written,
as well as the time, but I haven't felt a need for this, and it
has not been requested. There is also an issue with the report width,
since I am trying to keep the entire report less than 80 columns
in width if at all possible.
Setting Up and Using NSS
To run these scripts, you need a reasonably up-to-date version
of Perl (see resources) and the Time::Local and Getopt::Long libraries,
both of which come standard with Perl 5.000 and newer.
Edit the file to make sure that you have the correct path to your
locally installed version of Perl. You can also edit the first few
lines that specify the default directories for where the logs and
the savegroup input should be saved. These can also be specified
on the command line with the -L and -O options, respectively.
You can feed the raw Savegroup summaries directly to NSS via STDIN
to get a nice report, as shown in Example 2. This is very useful
for testing or just running off a quick report to make sure things
are working correctly for your site.
Another technique would be to use procmail to filter incoming
savegroup notification emails and forward the summaries, while saving
the raw savegroup notification to a mail folder. This is slightly
trickier, but eliminates the worry that you'll lose the raw notifications
sent to your sys admins. See Figure 2 for an example procmail recipe;
you will probably have to tweak it to recognize the format of email
sent to you. In this example, the email is sent from SERVER, and
it is saved in a mail folder in the user's Mail/ directory. For
a more complete discussion of procmail and how to write recipes,
see the resources section.
You can call NSS directly from NetWorker so that only summaries
get emailed to the sys admins. Here are instructions for Legato
NetWorker 5.1.1 on Solaris. They might be slightly different depending
on which version of NetWorker and which OS you are running.
1. Start up the /usr/bin/nsr/nwadmin GUI and connect to your NetWorker
server.
2. Click on Customize -> Notifications.
3. A new window will pop up with a list of notifications. Scroll
down and highlight "Savegroup completion".
4. Edit the action to be as follows:
/path/to/nss -o -l -s 10 -t 10 -T -m "admins@foo.com"
5. Click on the "Apply" button.
The above options deserve some explanation, as shown in Figure
1. Note, however, that you can see the online help with an explanation
of all the arguments by passing the -h flag to nss
when you run it from the command line.
The -o option tells NSS to write the savegroup notification
to the default saveinput directory, as specified in the source code,
or by the -O option. The filename format option, -F,
defaults to "%S/%G-%D" where %S is the backup server name, %G is
the savegroup name, and %D is the date of the savegroup notification.
This lets you log the data from multiple backup servers, each with
multiple savegroups into a central and consistent directory structure.
If necessary, you can recreate reports by running NSS on the file(s)
as needed.
The -l option is similar, but it turns on the logging of
the summary data to be plotted later in some sort of data analysis
tool, such as plot-nss. It also has a companion option, -L
for which directory to log the data to.
The -s and -t options can be used with or without
optional numbers. They turn on the reporting of the top N (default
5) savesets in terms of size and time, respectively.
The -T option turns on the tape report. Please note that
this option can only show you the current status of tape use at
the time the report is run. So, if you run the report immediately
after the savegroup notification is sent by NetWorker, you will
get a reasonably accurate status of the tapes remaining at that
time. If you run this report later, you may encounter problems,
because it depends on the permissions of the user running the report
in order to use the mminfo command to extract information
from NetWorker. Such access might not be permitted to general users.
The -m option is obvious, because it is the email address
to which to send the Savegroup Summary. If this is left off, the
default of STDOUT is used. Its companion option is -S <string>,
which specifies the subject of the Savegroup Summary being emailed.
It defaults to the following: "Backups %E: %S - %G", where %S and
%G are the same as the -F option, and the %E gives the status
of the backup that completed, either "SUCCEEDED" or "FAILURES".
The idea here is that if you miss reading email for several days,
you can sort your inbox by "Subject:", and all the successful reports
can be disposed of quickly. In this way, you can focus on the failures,
since they are the most important.
Plans for future work include dynamic sizing of column widths
in reports based on client hostname length, plotting/visualization
tool for the data saved with the -l option, handling newer
versions of NetWorker, and adding support for Veritas NetBackup.
Conclusion
NSS is still under development as time and energy permit, but
the basic layout has stabilized over the past year because it does
what I need it to do without muss or fuss.
In this article, I've tried to provide both a useful script and
some pointers on the concepts you can use to write your own application
for parsing arbitrary input and generating useful reports.
Resources
NSS Homepage -- http://jfs.ecotarium.org/sources/nss
Procmail -- http://www.procmail.org
Legato -- http://www.legato.org
NetWorker Users Mailing List -- http://listmail.temple.edu/archives/networker.html
John Stoffel attended Worcester Polytechnic Institute where
he earned a degree in computer science and spent way too much time
doing Theatre and Rock'n'Roll lighting on the side. He currently
works as a senior UNIX sys admin for a not-so-large-anymore major
telecommunications company. He is also a Board member of the USENIX
Sage Certification program at http://www.sage-cert.org. He
can be reached at: john@stoffel.org.
|