Providing Accurate and Reliable Traffic Reports from Log Data

The integrity of Web server logs usually falls under the responsibility of the site's systems administrators or Webmasters. The adage "garbage in, garbage out" is appropriate for reporting metrics derived from Web logs. If there are quality issues regarding how your servers log, which metrics are logged, or how log files are processed, then the integrity of the information reported in traffic reports will be suspect. If your company provides customers or investors with financial metrics derived from Web traffic reports, then sys admins must implement reasonable quality control procedures to ensure that the logging process is accurate.

Logging issues fall under several categories:

Log File Integrity -- Most sites implement multiple Web servers and log formats, and other operating parameters must be consistent across all servers. Many sites employ methods for distributing Web server configuration (like Apache's httpd.conf) files from a single source to all Web servers in the cluster, or use a management console that handles this distribution. In any case, inconsistent logging formats can lead to inaccurate results. Another common problem occurs if a Web server stops logging, which can happen if a disk becomes full or if the process has trouble opening or creating the log file. It's also a good idea to segregate requests for graphics, audio files, JavaScripts, and other object-level requests to a different set of virtual servers than the virtual servers used for serving page requests. This allows you to create and process smaller log files, which saves on the I/O used by the server, storage media, and log processing time.

Log File Management -- To archive and process the reporting data, you will need an approach for rotating a Web server's logs, which essentially means copying out the data in the log and replacing the log with a clean and empty file. Apache lists two approaches on their Web site including the rotatelogs script:

http://httpd.apache.org/docs/programs/rotatelogs.html

and cronolog:

http://www.cronolog.org/

The important detail here is to standardize when you rotate logs. You should ensure aggregate logs from all servers to a single location to simplify running log analysis software and to facilitate backups. There are many approaches that work, including secure copying after log rotation or using NFS mounts to NAS devices. Whatever the approach, it's important that the archiving and centralizing step occurs correctly and sends alerts if there are any problems.

Request-Level Data Integrity Issues -- Web servers log HTTP requests, but as discussed, users are far more interested in page view and visit-level metrics. The first issue is to determine when a request is not a page view. For example, HTTP redirects, page caching, and server-side content pulls can all lead to server requests that are not pages. The second issue results from different methods for managing request and session level information. Request parameters can be passed in as directories in the URL, as parameters in the query string, or as type=hidden variables in HTML forms. Hidden variables are passed in using the content body of the HTTP request and are generally not logged. If the application that is running stores persistent data in server-side session variables or cookies, then this information is probably not logged in the Web server logs. Still, other information may be "implied" as default values if they are not sent in with the request. Before reporting on any request parameters, it's important to investigate all methods for passing in or storing these parameters.