Compressing
Web Content with mod_gzip and mod_deflate
Stephen Pierzchala
Cost-reduction is a key component in every IT budget. One item
that is scrutinized is the cost of bandwidth. The use of a content
compression solution is one way to conserve bandwidth. With that
in mind, this article will cover some of the compression modules
for Apache: mod_gzip for Apache 1.3.x and 2.0.x; and mod_deflate
for Apache 2.0.x.
Content Compression Basics
Most compression algorithms, when applied against a plain text
file, can reduce its size by 70% or more, depending on the content
in the file. (See the sidebar for an overview of compression levels.)
When using compression algorithms, there is little difference between
standard and maximum compression levels, especially when you consider
the extra CPU time necessary to process these extra compression
passes. This is particularly important when dynamically compressing
Web content. Most software content compression techniques use a
compression level of 6 (out of 9 possible levels) to conserve CPU
cycles. The file size difference between level 6 and level 9 is
usually so minimal that it's not worth the extra time.
Content Compression of Web Content
With files that are identified with the text/.* MIME-types, compression
can be applied to the file before it's placed on the wire,
reducing the number of bytes transferred, and improving performance
at the same time. Testing has also shown that Microsoft Office,
StarOffice/OpenOffice, and Postscipt files can be GZIP-encoded for
transport by the compression modules.
Some important MIME-types that cannot be GZIP-encoded are: application/x-javascript
(external javascript files); application/pdf (PDF files); and image/.*
(all image files). The caveat against Javascript files is mainly
due to bugs in browser software, as these files are really text
files and would improve overall performance by being compressed
for transport. PDF and image files are already compressed, and attempting
to compress them again will simply make them larger, as well as
lead to potential rendering issues with browsers.
Before sending a compressed file to a client, the server must
ensure that the client receiving the data understands the compressed
format and can render it correctly. Browsers that understand compressed
content send a variation of the following client request headers:
Accept-encoding: gzip
Accept-encoding: gzip, deflate
All current major browsers include some variation of this message
with every request sent. (The exceptions to this rule are all versions
of Microsoft Internet Explorer when the HTTP/1.1 settings are turned
off. This seems to be a result of the erroneous belief that only HTTP/1.1
clients will send a request for GZIP-encoded content. See:
http://www.microsoft.com/technet/prodtechnol/iis/maintain/featusability/httpcomp.asp
If the server sees the header and chooses to provide compressed content,
it will respond with the following pair of server response headers:
Content-encoding: gzip
Content-type: [insert appropriate MIME type]
These headers tell the receiving browser to first decompress the content,
and then parse it normally or pass it to the appropriate helper application.
The file size advantages of compressing content can be easily
seen by looking at a couple of examples -- one is an HTML file
(Table 1), and the other is a Postscript file (Table 2). Performance
improvements will be discussed later in the article.
Configuring mod_gzip
The mod_gzip module is available for both Apache/1.3.x and Apache/2.0.x.,
and can be compiled into Apache as a DSO or a static module. The
Apache/1.3.x version is located at:
http://sourceforge.net/projects/mod-gzip/
and the Apache/2.0.x version is located at:
http://www.gknw.de/development/apache/httpd-2.0/unix/modules/
The compilation for a DSO is simple -- from the uncompressed source
directory, perform the following steps as root:
make APXS=/path/to/apxs
make install APXS=/path/to/apxs
/path/to/apachectl graceful
The mod_gzip must be loaded last in the module list because Apache/1.3.x
processes content in module order, and compression is the final step
performed before data is written to the wire. (Note that mod_gzip
installs itself in the httpd.conf file, but it is commented out.)
A very basic configuration for mod_gzip in the httpd.conf would
include:
mod_gzip_item_include mime ^text/.*
mod_gzip_item_include mime ^application/postscript$
mod_gzip_item_include mime ^application/ms.*$
mod_gzip_item_include mime ^application/vnd.*$
mod_gzip_item_exclude mime ^application/x-javascript$
mod_gzip_item_exclude mime ^image/.*$
mod_gzip_item_exclude file \.(?:exe|t?gz|zip|bz2|sit|rar)$
This allows Microsoft Office and Postscript files to be GZIP-encoded,
while not compressing PDF files. PDF files should not be GZIP-encoded
because they are compressed in their native format, and compressing
them leads to issues when attempting to display the files in Adobe
Acrobat Reader. For the paranoid systems administrator, you may want
to explicitly exclude PDF files from being compressed:
mod_gzip_item_exclude mime ^application/pdf$
Configuring mod_deflate
The mod_deflate module for Apache/2.0.x is included with the source
for this server. This makes compiling it into the server very simple:
./configure --enable-modules=all --enable-mods-shared=all --enable-deflate
make
make install
With mod_deflate for Apache/2.0.x, the GZIP-encoding of documents
can be enabled in one of two ways: explicit exclusion of files by
extension, or by explicit inclusion of files by MIME type. These methods
are specified in the httpd.conf file.
Explicit exclusion:
SetOutputFilter DEFLATE
DeflateFilterNote ratio
SetEnvIfNoCase Request_URI \.(?:gif|jpe?g|png)$ no-gzip dont-vary
SetEnvIfNoCase Request_URI \.(?:exe|t?gz|zip|bz2|sit|rar)$ no-gzip dont-vary
SetEnvIfNoCase Request_URI \.pdf$ no-gzip dont-vary
Explicit inclusion:
DeflateFilterNote ratio
AddOutputFilterByType DEFLATE text/*
AddOutputFilterByType DEFLATE application/ms* application/vnd* application/postscript
In the explicit exclusion method, note the same exclusions as the
mod_gzip file, namely images and PDF files.
Compressing Dynamic Content
If your site uses dynamic content (XSSI, CGI, etc.), there is
no need to do anything special to compress the output from these
modules. Because mod_gzip and mod_deflate process all outgoing content
before it is placed on the wire, all content from Apache that matches
either the MIME-types or the file extensions mapped in the configuration
directives will be compressed.
The output from PHP, the most popular dynamic scripting language
for Apache, can also be compressed in three possible ways:
- Using the built-in output handler, ob_gzhandler
- Using the built-in ZLIB compression
- Using one of the Apache compression modules
Configuring PHP's built-in compression is simply a matter
of compiling PHP with the --with-zlib configure flag and
re-configuring the php.ini file.
Output buffer method:
output_buffering = On
output_handler = ob_gzhandler
zlib.output_compression = Off
ZLIB method:
output_buffering = Off
output_handler =
zlib.output_compression = On
The output buffer method produces marginally better compression, but
either will work. The output buffer, ob_gzhandler, can also be added
on a script-by-script basis if you do not want to enable compression
across your whole site.
If you do not want to (or cannot) reconfigure PHP with ZLIB enabled,
the Apache compression modules will compress the content generated
by PHP. I have configured my server so that the Apache modules handle
all of the compression, and so that all pages are compressed in
a consistent manner, regardless of their origin.
Caching Compressed Content
Can compressed content be cached? The answer is an unequivocal
yes. With mod_gzip and mod_deflate, Apache sends the "Vary"
header, indicating to caches that this object differs from other
requests for the same object based on certain criteria -- User-Agent,
Character Set, etc. When a compressed object is received by a cache,
it will note that the server returned a Vary: Accept-Encoding response,
indicating that this response was generated based on the request
containing the Accept-Encoding: gzip header.
This does lead to a situation where a cache can store two copies
of the same document, one compressed and one uncompressed. This
is a design feature of HTTP/1.1, and allows clients with and without
the ability to receive compressed content to still benefit from
the performance enhancements gained from local proxy caches.
Logging Compression Results
When comparing the logging methods of mod_gzip and mod_deflate,
there really is no comparison. The mod_gzip logging is very robust
and configurable, and is based on the Apache log format. This allows
the mod_gzip logs to be configured in pretty much any way you want
for analysis. The default log formats provided when the module is
installed are shown below:
# LogFormat "%h %l %u %t \"%r\" %>s %b mod_gzip: \
%{mod_gzip_compression_ratio}npct." common_with_mod_gzip_info1
LogFormat "%h %l %u %t \"%r\" %>s %b mod_gzip: %{mod_gzip_result}n \
In:%{mod_gzip_input_size}n Out:%{mod_gzip_output_size}n \
Ratio:%{mod_gzip_compression_ratio}npct." common_with_mod_gzip_info2
# LogFormat "%{mod_gzip_compression_ratio}npct." mod_gzip_info1
# LogFormat "%{mod_gzip_result}n In:%{mod_gzip_input_size}n \
Out:%{mod_gzip_output_size}n Ratio:%{mod_gzip_compression_ratio}npct." mod_gzip_info2
The logging allows you to see the file's size prior to compression,
the size after compression, and the compression ratio. After tweaking
the log formats to meet your specific configuration, these can then
be added to a logging system by specifying a CustomLog in the httpd.conf
file:
CustomLog logs/gzip.log common_with_mod_gzip_info2
# CustomLog logs/gzip.log mod_gzip_info2
Logging in mod_deflate is limited to one configuration directive,
DeflateFilterNote, and this is intended to be added to an access_log
file. Be careful about doing this in your production logs, because
it may cause some log analyzers to have issues when examining your
files. It's best to start by logging compression ratios to a
separate file:
DeflateFilterNote ratio
LogFormat '"%r" %b (%{ratio}n) "%{User-agent}i"' deflate
CustomLog logs/deflate_log deflate
Performance Improvement from Compression
How much improvement can you see with compression? The difference
in measured download times on a very lightly loaded server indicates
that the time to download the Base Page (the initial HTML file)
improved by between 1.3 and 1.6 seconds across a very slow connection
when compression was used. See Figure 1.
There is a slightly slower time for the server to respond to a
client requesting a compressed page. Measurements show that the
median response time for the server averaged 0.23 seconds for the
uncompressed page and 0.27 seconds for the compressed page. However,
most Web server administrators should be willing to accept a 0.04
increase in response time to achieve a 1.5 second improvement in
file transfer time.
Web pages are not completely HTML. How do improved HTML (and CSS)
download times affect overall performance? Figure 2 shows that overall
download times for the test page were 1-1.5 seconds faster when
the HTML files were compressed.
To further emphasize the value of compression, I ran a test on
a Web server to see the average compression ratio when requesting
a very large number of files. Furthermore, I wanted to determine
the affect on server response time when requesting large numbers
of compressed files simultaneously. There were 1952 HTML files in
the test directory, and I checked the results using CURL across
my local LAN. (The files were the top-level HTML files from the
Linux Documentation Project. They were installed on an Apache 1.3.27
server running mod_gzip. Minimum file size was 80 bytes and maximum
file size was 99419 bytes.) See Table 3.
As expected, the "First Byte" download time was slightly
higher with the compressed files than it was with the uncompressed
files. But this difference was in milliseconds, and is hardly worth
mentioning in terms of on-the-fly compression. It is unlikely that
any user, especially dial-up users, would notice this difference
in performance.
That the delivered data was transformed to 43% of the original
file size should make any Web administrator sit up and take notice.
The compression ratio for the test files ranged from no compression
for files that were less than 300 bytes, to 15% of original file
size for two of the Linux SCSI Programming HOWTOs. Compression ratios
do not increase in a linear fashion when compared to file size;
rather, compression depends heavily on the repetition of content
within a file to gain its greatest successes. The SCSI Programming
HOWTOs have a great deal of repeated characters, making them ideal
candidates for extreme compression.
Smaller files also did not compress as well as larger files for
this same reason. Fewer bytes means a lower probability of repeated
bytes, resulting in a lower compression ratio. See Table 4.
The data shows that compression works best on files larger than
5000 bytes; after that size, average compression gains are smaller,
unless a file has a large number of repeated characters. Some people
argue that compressing files below a certain size is a wasteful
use of CPU cycles. If you agree with these folks, using the 5000-byte
value as floor value for compressing files should be a good starting
point. I, however, compress everything that comes off my servers
because I consider myself an HTTP overclocker, trying to squeeze
every last bit of download performance out of the network.
Conclusion
With a few simple commands, and a little bit of configuration,
an Apache Web server can be configured to deliver a large amount
of content in a compressed format. These benefits are not simply
limited to static pages; dynamic pages generated by PHP and other
dynamic content generators can be compressed by using the Apache
compression modules. When added to other performance tuning mechanisms
and appropriate server-side caching rules, these modules can substantially
reduce the bandwidth for a very low cost.
Stephen Pierzchala is Senior Diagnostic Analyst for Keynote
Systems in San Mateo, California. His focus is on analyzing Web
performance data and evangelizing on Web performance topics, such
as Content Compression, Caching, and Persistent Connections.
|