Automating
Web Reports with Analog
Isaac Sacolick
At my organization, we have a
number of Web applications that we sell to newspapers as an Application
Service Provider (ASP). These products were developed over a six-year
period, and some were acquired from other companies. The result
is that these products were developed and run on different software
platforms and system architectures. Additionally, we have several
traffic-reporting tools and procedures in place, each one having
different issues but also different strengths. We were tasked to
look at different ways of providing a consolidated traffic reporting
solution that would work with our different products but not require
a massive re-engineering of the applications or the back-end tools.
We decided to keep things simple and pilot an approach on one
of the company's products. We aimed to develop a fairly generic
approach that was loosely coupled from the application environment
so that integrating the other products would be feasible. We knew
that using Web logs would be the easiest approach and focused our
early efforts on the procedures for managing the log files.
We chose to use the free logfile analysis program Analog, available
at http://analog.cx, as the core of our reporting system.
My company had experimented with two different commercial traffic-analysis
packages without much success. Outsourcing seemed like a quick and
easy way to provide detailed traffic reports, but the costs were
high and ultimately would not bring us closer to providing reports
that would tie traffic metrics to other product metrics. We had
already experimented with Analog to perform quick product performance
reports, so we already knew Analog's capabilities and strengths.
Although there are other open source traffic-reporting systems
with better reporting features, Analog's simplicity really
made it attractive. At its core, it parses one or more log files,
supports various types of filters, and produces a large number of
reporting sections and output formats. A single configuration file
defines all of the processing parameters, rulers, and reporting
features. One important feature is its ability to develop a cache
of the raw reporting data. This enables you to parse the log files
once, create a cache output file, then run Analog on the cache file(s)
to generate reports. Cache files are smaller than the original log
files, and processing them is significantly faster.
Some other features that we liked about Analog are as follows:
- It's incredibly fast, and we could run months worth of
data on relatively inexpensive hardware. Some of the recovery
and rerunning features of commercial products, however, were slower
and far more complex.
- It has excellent documentation, an active user community, and
is on its fifth major version.
- Analog has several output formats including a fielded data
output. We knew we could use Analog to generate the reports, but
when we were ready to join traffic data with other product metrics,
we could use Analog to generate raw summary information. We could
then import the summary data into our data warehouses and develop
higher level reports.
Implementation
After choosing Analog as the log processing application, we defined
other parts of the system. Figure 1 shows a simplified view of the
network and system architecture. Web servers and Web application
servers log to their local disks and their log rotation times are
standardized. We have two core processing scripts. A log management
script pulls the logs from each server and copies them to a redundant
NAS. The backup server is then tasked with a job to back up the
new log files. The data processing server runs a second log processing
script that automates processing each day's logs. At the end
of this process, the daily batch reports are generated and stored
back on the NAS. The Reporting Server is both an intranet and extranet
server that allows internal users and customers to view reports.
The Data Processing and Reporting Linux Servers have identical configurations
that provide failover capability if needed. The entire configuration
runs on a backnet VLAN so that reporting network traffic does not
interfere with Web-based traffic.
The log management script is a relatively straightforward script
that moves data files to the NAS and triggers the backup process.
The log processing script manages the automated processing of the
new logs and provides a simple interface for configuring and running
batch reports. The script supports several independent workflows
that can be chained together by running it with the appropriate
parameters. Workflows supported include:
1. Ensure that all log files are present for a given day.
2. Create a directory tree that organizes the logs by date and
by server.
3. Break each log up by virtual host and store these in a new
directory structure.
4. Run Analog on a day's worth of logs from each virtual
server and create the cached output.
5. Create a report covering one or more virtual hosts for a specific
date range.
There are many ways to support steps 1 and 2 depending on your
server architecture, naming conventions, and reporting needs. Step
3 was developed to optimize the process and to facilitate giving
customers access to their log files. This step may not be required
when managing a limited number of virtual hosts or when the Web
servers are configured to log each virtual host to a separate log
file. Steps 4 and 5 show our approach for simplifying and automating
the nightly jobs and may be useful optimizing the nightly log processing
jobs and configuring Analog to deliver batch reports.
After step 3, the logs are stored in a directory structure that
looks like:
/logs_by_vhost/Virtualhost/YYYY-MM-DD
Inside this directory are all log files for this virtual host on the
specific day. We wanted to create an output cache repository that
looked like:
/cache_by_vhost/VirtualHost/YYYY/MM/YYYY-MM-DD.cache
where the file "YYYY-MM-DD.cache" is the file Analog creates
for the virtual host for data covering a specific day.
To create a cache file, set up an Analog configuration file to
produce a cached output. You can automate this by setting up a "template"
Analog configuration file, then running a function in the script
that does the following:
- Makes a list of all virtual hosts to run on
- For each virtual host:
- Loads in the template Analog configuration file into a string
- Substitutes the runtime parameters for creating the cache for
this run
- Saves this "instance" version of the Analog configuration
file
- Runs Analog, which creates the cache file
Listing 1 shows how to create the instance version of the configuration
file. This function takes three arguments: the input template filename,
the output template filename, and a hash of values to substitute.
On running the cache, we set the following variables:
My %config_substitutes = (
LOGFILE => "$PATH/logs_by_vhost/$vhost,$date/*",
FROM => "$date",
TO => "$date",
CACHEOUTFILE => "$PATH/cache_by_vhost/$year/$month/$date.cache",
VHOSTINCLUDE => "$vhost",
HOSTNAME => "$hostname",
HOSTURL => "http://$vhost",
OUTFILE => "none";
);
If we have two strings, $infile being the input analog template and
$outfile the output instance version of the config file, the routine
can be called using:
createAnalogConfigFile(\$infile, \$outfile, \%config_substitutes);
The key parts of this substitution are the Analog configuration parameters
that are based on virtual host or date. You may find other parameters
that need to be modified, but the process should be similar. After
creating the instance configuration file, running analog using this
file generates the cache file. We set up this process to run each
evening after the log files are archived. This minimizes the amount
of processing and optimizes the environment to run batched and ad
hoc reports over an arbitrary date range.
Running Reports
Developing reports runs through a similar process as creating
the cache. The program takes the virtual host, a "from"
date, and a "to" date as parameters. We then create a
config_substitutes hash that starts with something like:
My %config_substitutes = (
LOGFILE => "none",
FROM => "$from",
TO => "$to",
CACHEOUTFILE => "none",
VHOSTINCLUDE => "$vhost",
HOSTNAME => "$hostname",
HOSTURL => "http://$vhost",
OUTPUT => "HTML"
OUTFILE => "$PATH/report_by_customer/$vhost/$vhost-$from-$to.html",
);
Other parameters are set to turn on/off specific reports and to set
other parameters. The only step left is to tell Analog which cache
files to load for this report. The cache files are set using the CACHEFILE
parameter. To run on a date range, a CACHEFILE directive is needed
for every cache file that covers the virtual host (or group of them)
and date within the date range. Listing 2 shows how to create the
CACHEFILE parameter string. From the $from and $to dates, an array
of dates is created, which has all dates between these two dates.
A set of CACHEFILE parameters is then created, one for every date
in the array. This string can then be added to the hash as:
$config_substitutes(CACHEFILE) = \
createCacheFileStr("$PATH/report_by_customer", $vhost, $from, $to);
Analog then runs and creates an output report (in this example, an
HTML report), which is stored in a report output directory "report_by_customer/$vhost".
Viewing the Reports
Because the reports are archived in directories by virtual host,
we configured our Report Server with the Apache Web server. We allow
internal users and customers to log in, view the appropriate reporting
directories, select, and review the reports. We also configured
the reporting script to run as both a command-line application and
a CGI. Internal users can enter reporting parameters into a Web
form, submit to the application, and generate reports on the fly.
Conclusions
From here, there are many features that can be added to this basic
framework. Here are some examples:
- If you rotate servers frequently, consider building a database
that tracks this activity. You can then enhance the quality control
steps by checking the database and ensuring that the proper log
files were rotated and are available before generating the cache.
- If you have many different batch reports to run nightly, consider
building a database that lists the reports, duration, frequency,
and configuration for the reports.
- If you want to join traffic reporting metrics with other product
metrics, then use Analog's computer output and then load
the data into your database.
Also, consider exploring the many helper applications that Analog
lists at their Web site:
http://analog.cx/helpers/index.html
Whatever solution you implement, keep in mind the basic principles
of ensuring a quality log management process, automate that process,
and engage all parties to prioritize the reporting features and management
utilities.
Acknowledgements
Special thanks to several individuals who contributed to this
work, including Bill Castagne, Wei Guo, Nanda Mounasamy, Kevin Phillips,
Wayne Regencia, Oneill Stanleigh, Viktoria Stern, and Yvette Vacca.
Isaac Sacolick is the Chief Technology Officer of PowerOne
Media Inc, a leading Application Service Provider for newspapers,
and also serves as an advisor to several technology startup companies.
His professional interest and expertise is in publishing systems,
application networks, data warehousing, process automation, and
artificial intelligence.
|