Using MySQL for Apache Logging

18.15.1 Problem

You don't want to use MySQL to log accesses for just a few pages, as shown in Recipe 18.14. You want to log all pages accesses, and you don't want to have to put logging actions in each page explicitly.

18.15.2 Solution

Tell Apache to log pages accesses to MySQL.

18.15.3 Discussion

The uses for MySQL in a web context aren't limited just to page generation and processing. You can use it to help you run the web server itself. For example, most Apache servers are set up to log a record of web requests to a file. But it's also possible to send log records to a program instead, from which you can write the records wherever you likesuch as to a database. With log records in a database rather than a flat file, the log becomes more highly structured and you can apply SQL analysis techniques to it. Log file analysis tools may be written to provide some flexibility, but often this is a matter of deciding which summaries to display and which to suppress. It's more difficult to tell a tool to display information it wasn't built to provide. With log entries in a table, you gain additional flexibility. Want to see a particular report? Write the SQL statements that produce it. To display the report in a specific format, issue the queries from within an API and take advantage of your language's output production capabilities.

By handling log entry generation and storage using separate processes, you gain some additional flexibility. Some of the possibilities are to send logs from multiple web servers to the same MySQL server, or to send different logs generated by a given web server to different MySQL servers.

This section shows how to set up web request logging from Apache into MySQL and demonstrates some summary queries you may find useful.

18.15.4 Setting Up Database Logging

Apache logging is controlled by directives in the httpd.conf configuration file. For example, a typical logging setup uses LogFormat and CustomLog directives that look like this:

LogFormat "%h %l %u %t "%r" %>s %b" common CustomLog /usr/local/apache/logs/access_log common

The LogFormat line defines a format for log records and gives it the nickname common. The CustomLog directive indicates that lines should be written in that format to the access_log file in Apache's logs directory. To set up logging to MySQL instead, use the following procedure:[4]

[4] If you're using logging directives such as TransferLog rather than LogFormat and CustomLog, you'll need to adapt the instructions in this section.

Suppose you want to record the date and time of each request, the host that issued the request, the request method and URL pathname, the status code, the number of bytes transferred, and the user agent (typically a browser or spider name). A table that includes columns for these values can be created as follows:

CREATE TABLE httpdlog ( dt DATETIME NOT NULL, # request date host VARCHAR(255) NOT NULL, # client host method VARCHAR(4) NOT NULL, # request method (GET, PUT, etc.) url VARCHAR(255) BINARY NOT NULL, # URL path status INT NOT NULL, # request status size INT, # number of bytes transferred agent VARCHAR(255) # user agent );

Most of the string columns use VARCHAR and are not case sensitive. The exception, url, is declared as a binary string as is appropriate for a server running on a system with case-sensitive filenames. If you're using a server where URL lettercase doesn't matter, you can omit the word BINARY.

The httpdlog table definition shown here doesn't include any indexes. You should add some, because otherwise any summary queries you run will slow down dramatically as the table becomes large. The choice of which columns to index will be based on the types of queries you intend to run to analyze the table contents. For example, queries to analyze the distribution of client host values will benefit from an index on the host column.

Next, you need a program to process log lines produced by Apache and insert them into the httpdlog table. The following script, httpdlog.pl, opens a connection to the MySQL server, then loops to read input lines. It parses each line into column values and inserts the result into the database. When Apache exits, it closes the pipe to the logging program. That causes httpdlog.pl to see end of file on its input, terminate the loop, disconnect from MySQL, and exit.

#! /usr/bin/perl -w # httpdlog.pl - Log Apache requests to httpdlog table use strict; use lib qw(/usr/local/apache/lib/perl); use Cookbook; my $dbh = Cookbook::connect ( ); my $sth = $dbh->prepare (qq{ INSERT INTO httpdlog (dt,host,method,url,status,size,agent) VALUES (?,?,?,?,?,?,?) }); while (<>) # loop reading input { chomp; my ($dt, $host, $method, $url, $status, $size, $agent) = split (/ /, $_); # map "-" to NULL for some columns $size = undef if $size eq "-"; $agent = undef if $agent eq "-"; $sth->execute ($dt, $host, $method, $url, $status, $size, $agent); } $dbh->disconnect ( ); exit (0);

Install the httpdlog.pl script where you want Apache to look for it. On my system, the Apache root directory is /usr/local/apache, so /usr/local/apache/bin is a reasonable installation directory. The path to this directory will be needed shortly for constructing the CustomLog directive that instructs Apache to log to the script.

httpdlog.pl assumes that input lines contain httpdlog column values delimited by tabs (to make it easy to break apart input lines), so Apache must write log entries in a matching format. The LogFormat field specifiers to produce the appropriate values are as follows:

%{%Y-%m-%d %H:%M:%S}

The date and time of the request, in MySQL's DATETIME format

%h

The host from which the request originated

%m

The request method (GET, POST, and so forth)

%U

The URL path

%>s

The status code

%b

The number of bytes transferred

%{User-Agent}i

The user agent

To define a logging format named mysql that produces these values with tabs in between, add the following LogFormat directive to your httpd.conf file:

LogFormat "%{%Y-%m-%d %H:%M:%S}t %h %m %U %>s %b %{User-Agent}i" mysql

Most of the pieces are in place now. We have a log table, a program that writes to it, and a mysql format for producing log entries. All that remains is to tell Apache to write the entries to the httpdlog.pl script. However, until you know that the output format really is correct and that the program can process log entries properly, it's premature to tell Apache to log directly to the program. To make testing and debugging a bit easier, have Apache log mysql-format entries to a file instead. That way, you can look at the file to check the output format, and you can use it as input to httpdlog.pl to verify that the program works correctly. To instruct Apache to log lines in mysql format to the file test_log in Apache's log directory, use this CustomLog directive:

CustomLog /usr/local/apache/logs/test_log mysql

Then restart Apache to enable the new logging directives. After your web server receives a few requests, take a look at the test_log file. Verify that the contents are as you expect, then feed the file to httpdlog.pl. If you're in Apache's logs directory and the bin and logs directories are both under the Apache root, the command looks like this:

% ../bin/httpdlog.pl test_log

After httpdlog.pl finishes, take a look at the httpdlog table to make sure that it looks correct. Once you're satisfied, tell Apache to send log entries directly to httpdlog.pl by modifying the CustomLog directive as follows:

CustomLog "|/usr/local/apache/bin/httpdlog.pl" mysql

The | character at the beginning of the pathname tells Apache that httpdlog.pl is a program, not a file. Restart Apache and new entries should appear in the httpdlog table as visitors request pages from your site.

Nothing you have done to this point changes any logging you may have been doing originally. For example, if you were logging to an access_log file before, you still are now. Thus, Apache will be sending entries both to the original log file and to MySQL. If that's what you want, fine. Apache doesn't care if you log to multiple destinations. But you'll use more disk space if you do. To disable file logging, comment out your original CustomLog directive by placing a # character in front of it, then restart Apache.

18.15.5 Analyzing the Log File

Now that you have Apache logging into the database, what can you do with the information? That depends on what you want to know. Here are some examples that show the kinds of questions you can use MySQL to answer easily:

18.15.6 Other Logging Issues

I've chosen a simple method for hooking Apache to MySQL, which is to write a short script that communicates with MySQL and then tell Apache to write to the script rather than to a file. This works well if you log all requests to a single file, but certainly won't be appropriate for every possible configuration that Apache is capable of. For example, if you have virtual servers defined in your httpd.conf file, you might have separate CustomLog directives defined for each of them. To log them all to MySQL, you can change each directive to write to httpdlog.pl, but then you'll have a separate logging process running for each virtual server. That brings up two issues:

Logging to a database rather than to a file allows you to bring the full power of MySQL to bear on log analysis, but it doesn't eliminate the need to think about space management. Web servers can generate a lot of activity, and log records use space regardless of whether you write them to a file or to a database. One way to save space is to expire records now and then. For example, to remove log records that are more than a year old, run the following query periodically:

DELETE FROM httpdlog WHERE dt < DATE_SUB(NOW( ),INTERVAL 1 YEAR);

Another option is to archive old records into compressible tables. (This requires that you use MyISAM tables so that you can compress them with the myisampack utility.) For example, when the date changes from September 2001 to October 2001, you know that Apache won't generate any more records with September dates and that you can move them into another table that will remain static. Create a table named httpdlog_2001_09 that has the same structure as httpdlog (including any indexes). Then transfer September's log records from httpdlog into httpdlog_2001_09 using these queries:

INSERT INTO httpdlog_2001_09 SELECT * FROM httpdlog WHERE dt >= '2001-09-01' AND dt < '2001-10-01'; DELETE FROM httpdlog WHERE dt >= '2001-09-01' AND dt < '2001-10-01';

Finally, run myisampack on httpdlog_2001_09 to compress it and make it read-only.

This strategy has the potential drawback of spreading log entries over many tables. If you want to treat the tables as a single entity so that you can run queries on your entire set of log records, create a MERGE table that includes them all. Suppose the set of tables includes the current table and tables for September 2001 through April 2002. The statement to create the MERGE table would look like this:

CREATE TABLE httpdlog_all ( dt DATETIME NOT NULL, # request date host VARCHAR(255) NOT NULL, # client host method VARCHAR(4) NOT NULL, # request method (GET, PUT, etc.) url VARCHAR(255) BINARY NOT NULL, # URL path status INT NOT NULL, # request status size INT, # number of bytes transferred agent VARCHAR(255) # user agent ) TYPE = MERGE UNION = (httpdlog, httpdlog_2001_09, httpdlog_2001_10, httpdlog_2001_11, httpdlog_2001_12, httpdlog_2002_01, httpdlog_2002_02, httpdlog_2002_03, httpdlog_2002_04);

The UNION clause should name all the tables that you want to include in the MERGE table. Note that you'll need to drop and recreate the httpdlog_all definition each time you generate a new static monthly log table. (Also, if you add an index, you'll need to add it to each of the individual tables, and recreate the MERGE table to include the index definition as well.)

Reports run against the httpdlog_all table will be based on all log entries. To produce monthly reports, just refer to the appropriate individual table.

With respect to disk space consumed by web logging activity, be aware that if you have query logging enabled for the MySQL server, each request will be written to the httpdlog table and also to the query log. Thus, you may find disk space disappearing more quickly than you expect, so it's a good idea to have some kind of log rotation or expiration set up for the MySQL server.

Категории