Apache Cookbook: Solutions and Examples for Apache Administrators

Apache can, and usually does, record information about every request it processes. Controlling how this is done and extracting useful information out of these logs after the fact is at least as important as gathering the information in the first place.

The logfiles may record two types of data: information about the request itself, and possibly one or more messages about abnormal conditions encountered during processing (such as file permissions). You, as the webmaster, have a limited amount of control over the logging of error conditions, but a great deal of control over the format and amount of information logged about request processing (activity logging). The server may log activity information about a request in multiple formats in mulitple log files, but it will only record a single copy of an error message.

One aspect of activity logging you should be aware of is that the log entry is formatted and written after the request has been completely processed. This means that the interval between the time a request begins and when it finishes may be long enough to make a difference.

For example, if your logfiles are rotated while a particularly large file is being downloaded, the log entry for the request will appear in the new logfile when the request completes, rather than in the old logfile when the request was started. In contrast, an error message is written to the error log as soon as it is encountered.

The web server will continue to record information in its logfiles as long as it's running. This can result in extremely large logfiles for a busy site and uncomfortably large ones even for a modest site. To keep the file sizes from growing ever larger, most sites rotate or roll over their logfiles on a semi-regular basis. Rolling over a logfile simply means persuading the server to stop writing to the current file and start recording to a new one. Due to Apache's determination to see that no records are lost, cajoling it to do this according to a specific timetable may require a bit of effort; some of the recipes in this chapter cover how to accomplish the task successfully and reliably (see Recipe 3.8 and Recipe 3.9).

The log declaration directives, CustomLog and ErrorLog, can appear inside <VirtualHost> containers, outside them (in what's called the main or global server, or sometimes the global scope), or both. Entries will only be logged in one set or the other; if a <VirtualHost> container applies to the request or error and has an applicable log directive, the message will be written only there and won't appear in any globally declared files. On the other hand, if no <VirtualHost> log directive applies, the server will fall back on logging the entry according to the global directives.

However, whichever scope is used for determining what logging directives to use, all CustomLog directives in that scope are processed and treated independently. That is, if you have a CustomLog directive in the global scope and two inside a <VirtualHost> container, both of these will be used. Similarly, if a CustomLog directive uses the env= option, it has no effect on what requests will be logged by other CustomLog directives in the same scope.

Activity logging has been around since the Web first appeared, and it didn't take long for the original users to decide what items of information they wanted logged. The result is called the common log format (CLF). In Apache terms, this format is:

"%h %l %u %t \"%r\" %>s %b"

That is, it logs the client's hostname or IP address, the name of the user on the client (as defined by RFC 1413 and if Apache has been told to snoop for it with an IdentityCheck On directive), the username with which the client authenticated (if weak access controls are being imposed by the server), the time at which the request was received, the actual HTTP request line, the final status of the server's processing of the request, and the number of bytes of content that were sent in the server's response.

Before long, as the HTTP protocol advanced, the common log format was found to be wanting, so an enhanced format, called the combined log format, was created:

"%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\""

The two additions were the Referer (it's spelled incorrectly in the specifications) and the User-agent. These are the URL of the page that linked to the document being requested, and the name and version of the browser or other client software making the request.

Both of these formats are widely used, and many logfile analysis tools assume log entries are made in one or the other.

The Apache web server's standard activity logging module allows you to create your own formats; it is highly configurable and is called (surprise!) mod_log_config. Apache 2.0 has an additional module, mod_logio, which enhances mod_log_config with the ability to log the number of bytes actually transmitted or received over the network. If these doesn't meet your requirements, though, there are a significant number of third-party modules available from the module registry at http://modules.apache.org/.

The status code entry in the common and combined log formats deserve some mention, because its meaning is not immediately clear. The status codes are defined by the HTTP protocol specification documents (currently RFC 2616 at ftp://ftp.isi.edu/in-notes/rfc2616.txt). Table 3-1 gives a brief description of the codes defined at the time of this writing.

Table 3-1. HTTP status codes

Code

Abstract

Informational 1xx

100

Continue

101

Switching protocols

Successful 2xx

200

OK

201

Created

202

Accepted

203

Nonauthoritative information

204

No content

205

Reset content

206

Partial content

Redirection 3xx

300

Multiple choices

301

Moved permanently

302

Found

303

See other

304

Not modified

305

Use proxy

306

(Unused)

307

Temporary redirect

Client error 4xx

400

Bad request

401

Unauthorized

402

Payment required

403

Forbidden

404

Not found

405

Method not allowed

406

Not acceptable

407

Proxy authentication required

408

Request timeout

409

Conflict

410

Gone

411

Length required

412

Precondition failed

413

Request entity too large

414

Request-URI too long

415

Unsupported media type

416

Requested range not satisfiable

417

Expectation failed

Server error 5xx

500

Internal server error

501

Not implemented

502

Bad gateway

503

Service unavailable

504

Gateway timeout

505

HTTP version not supported

The one-line descriptions shown in Table 3-1 are sometimes terse to the point of being confusing, but they should at least give you an inkling of what the server thinks happened. The first digit is used to separate the codes into classes or categories; for example, all codes starting with 5 indicate there is a problem handling the request, and the server thinks the problem is on its end rather than on the client's end.

For a complete description of the various status codes, you'll need to read a document about the HTTP protocol or the RFC itself.

Категории