Providing Accurate and Reliable Traffic Reports from Log Data
The integrity of Web server logs usually falls under the responsibility
of the site's systems administrators or Webmasters. The adage
"garbage in, garbage out" is appropriate for reporting
metrics derived from Web logs. If there are quality issues regarding
how your servers log, which metrics are logged, or how log files
are processed, then the integrity of the information reported in
traffic reports will be suspect. If your company provides customers
or investors with financial metrics derived from Web traffic reports,
then sys admins must implement reasonable quality control procedures
to ensure that the logging process is accurate.
Logging issues fall under several categories:
Log File Integrity -- Most sites implement multiple Web servers
and log formats, and other operating parameters must be consistent
across all servers. Many sites employ methods for distributing Web
server configuration (like Apache's httpd.conf) files from
a single source to all Web servers in the cluster, or use a management
console that handles this distribution. In any case, inconsistent
logging formats can lead to inaccurate results. Another common problem
occurs if a Web server stops logging, which can happen if a disk
becomes full or if the process has trouble opening or creating the
log file. It's also a good idea to segregate requests for graphics,
audio files, JavaScripts, and other object-level requests to a different
set of virtual servers than the virtual servers used for serving
page requests. This allows you to create and process smaller log
files, which saves on the I/O used by the server, storage media,
and log processing time.
Log File Management -- To archive and process the reporting
data, you will need an approach for rotating a Web server's
logs, which essentially means copying out the data in the log and
replacing the log with a clean and empty file. Apache lists two
approaches on their Web site including the rotatelogs script:
http://httpd.apache.org/docs/programs/rotatelogs.html
and cronolog:
http://www.cronolog.org/
The important detail here is to standardize when you rotate logs.
You should ensure aggregate logs from all servers to a single location
to simplify running log analysis software and to facilitate backups.
There are many approaches that work, including secure copying after
log rotation or using NFS mounts to NAS devices. Whatever the approach,
it's important that the archiving and centralizing step occurs
correctly and sends alerts if there are any problems.
Request-Level Data Integrity Issues -- Web servers log HTTP
requests, but as discussed, users are far more interested in page
view and visit-level metrics. The first issue is to determine when
a request is not a page view. For example, HTTP redirects, page
caching, and server-side content pulls can all lead to server requests
that are not pages. The second issue results from different methods
for managing request and session level information. Request parameters
can be passed in as directories in the URL, as parameters in the
query string, or as type=hidden variables in HTML forms. Hidden
variables are passed in using the content body of the HTTP request
and are generally not logged. If the application that is running
stores persistent data in server-side session variables or cookies,
then this information is probably not logged in the Web server logs.
Still, other information may be "implied" as default values
if they are not sent in with the request. Before reporting on any
request parameters, it's important to investigate all methods
for passing in or storing these parameters.
|