Filtering/Replacing
Advertisements with Squid and Apache
Dustin Anders
The proliferation of banner and pop-up ads can be quite annoying
when surfing the Internet. In this article, I will describe a quick
and easy way to block banner ads via Squid and Apache.
Prerequisites
The following applications are necessary for filtering ads in
this manner:
Perl 5 or greater -- available at http://www.perl.com
Apache 1.3.26 -- available at http://www.apache.org
Squid 2.5 -- available at http://www.squid-cache.org
Installation of the above packages will not be discussed in this
article. Installation documents are available on each site to help
you through the process. You will also need to download adfilter.pl,
hosts, and filters file from http://www.unixfun.com/sysadmin/adfilter/
or from the Sys Admin Web site: http://www.sysadminmag.com.
Design Notes
In this configuration, Squid is installed on a Linux machine (running
Red Hat 7.3) that will be used as a gateway for a Windows 2000 machine.
In the workstation's browser configuration, specify port 8010 as
your http gateway. Squid has been configured to bind to port 8010.
See Figure 1 for the proxy settings configuration. You could also
run Squid on a workstation and point locally (127.0.0.1) on port
8010.
The process used for filtering Web ads is as follows:
1. Web requests sent to the Squid Web proxy are redirected to
the Perl script adfilter.pl (described in this article).
2. If the requested URL is a .jpg, .bmp, or .png image file, the
URL passes through the following checks:
- A regular expression filter, which searches for directory paths
that might indicate a banner ad, such as //.*/ads/.*
- A hosts file lookup using a special hosts file that lists known
Internet ad servers (available on the Web from a company called
Smartin Designs).
3. If the URL is identified as an ad by either of the checks, the
URL is redirected to a local Apache Web server, where a replacement
graphic is substituted for the ad.
Squid Configuration
To begin, you should have Squid and Apache installed on your server.
We will make a few additions to the Squid configuration file, squid.conf,
which should be located in the conf directory of your Squid installation
directory.
We will make use of Squid's redirect feature, which allows us
to pass a few parameters to an application of our choosing. Below
are the parameters that are passed:
URL -- The requested Uniform Resource Locator.
IP ADDRESS / FQDN -- The IP address or fully qualified domain
name of the client that requested the page.
IDENT -- The identity of the user running the Web browser.
METHOD -- The request method, which is GET, POST, or HEAD.
You can also specify how many child processes Squid is to fork
for the redirect program. In this example, we redirect all URLs
to a Perl script called adfilter.pl (see Listing 1). Below are the
lines that need to be added to your squid.conf:
redirect_program /usr/local/squid/adfilter/adfilter.pl
redirect_children 10
Depending on how many machines are using your gateway for Web surfing
and the usage, you will also need to increase your child count. That
concludes all the changes that need to be made to the squid.conf.
This script requires Perl 5.x or greater to be installed on the
system. As discussed before, Squid passes several key values to
the redirect application. We will ignore all the values except the
URL.
The script supports reading in a list of expressions in a filters
file and a host file. It compares the passed-in arguments to the
filters file first. Comparison is done for image files only that
have the following extensions: .jpg, .bmp, and .png. You can add
more extensions to check by changing the URL comparison statement.
If no match is returned from the filters comparison, it will compare
the URL to the hosts file. If it finds a match in either file, it
prints out the replacement URL defined within the same line in the
matched entry. In this configuration, everything is simply redirected
to 127.0.0.1. Apache will provide a bit of redirect magic of its
own to allow us to insert an image for any request.
The filters file contains a list of regular expressions to match
the URL against. The following regular expression is a sample from
the file:
127.0.0.1 //.*/ads/.*
The hosts file contains a list of hosts that have been determined
to be ad related by a company called Smartin Designs. The hosts file
is too long to include here, but can be downloaded from:
http://www.unixfun.com/sysadmin/adfilter/
or from the Sys Admin Web site.
A sample hosts entry is:
127.0.0.1 123banners.com
Apache Configuration
The Apache configuration is relatively simple. In this configuration,
we are only using Apache to redirect all requests to an image file
on the server. We could have changed the hosts/filter files to redirect
all requests to a Fully Qualified Domain Name/file. However, since
we are making use of a regularly updated host file that redirects
requests to 127.0.0.1, we need to redirect those requests.
The Apache configuration consists of the following line that performs
the redirect:
AliasMatch ^(.*) /var/www/htdocs/1.png
Aliasmatch allows us to specify a regular expression to compare
to the URL. In this case, we are matching every request to our server.
Every URL that hits the Apache server will be redirected to a 5-byte
image file called 1.png. This image contains my initials.
Final Product
After you have downloaded the filters file and the hosts file
from http://www.unixfun.com/sysadmin/adfilter/, place them
in a subdirectory called adfilter under the Squid installation directory.
If your Squid installation directory does not exist in /usr/local/squid,
update the filter_dir variable at the top of the adfilter.pl to
reflect the proper location. With all this in place and the modifications
to the squid.conf and httpd.conf, you will need to restart Squid
and Apache. After restarting Squid, you will notice that quite a
few more processes were started. See Figure 2.
Assuming you created your own banner replacement image that was
referenced in the changes to the httpd.conf, you should see that
image when you go to a Web site, such as http://www.espn.com.
See Figure 3 for an example of the output.
Updates
As one may expect, new ad servers are popping up daily. It makes
sense to keep the hosts/filters files updated with the latest regular
expressions/entries. The Smartin Designs site at:
http://www.smartin-designs.com/downloads.htm
provides a page where you can download the latest hosts file for ad
sites. The site also provides a notification service when new files
are released. If you would like to automate the update process, you
can make use of Lynx by placing the following command in your system
crontab:
/usr/bin/lynx -dump http://www.smartin-designs.com/ \
downloads/full_127001.txt > /<ADFILTER DIR>/hosts
Unfortunately, I have not found one site that provides a filter list
for ads that's updated on a continual basis. There are quite a few
sites that have lists for pattern matches, so most likely you will
need to build out your own filter list.
Considerations
Currently, three types of images are checked to determine whether
they are banner ads. This was done primarily because the hosts database
is so large. If one checks all requests, it sufficiently slows down
browsing access.
You may want to define your own set of filters rather than using
the hosts file, which currently contains 13,000 entries. Most of
these entries can be eliminated and covered by a few simple regular
expressions. For example, the hosts file contains a lot of ad1.server.com,
ad2.server.com ... ad255.server.com. You could eliminate 254 entries
with one regular expression. Or, you could cover every ad server
of a certain type by simply specifying //ads.* in the filters file.
Another important note is that the potential exists for sites
to be blocked inadvertently. For example, if ESPN's main title bar
has an address of http://www.espn.com/banner/title.jpg, then
the regular expression, //.*/banner/.*, will block the request.
Entries in the hosts file will also cause some non-ad sites to be
blocked.
Dustin Anders, CISSP is a Senior Network Security Consultant
for a large security company. He has more than six years experience
of network and systems security experience. He spends the majority
of his free time writing Perl and PHP scripts and playing with his
son, Ian.
|