PHP Cookbook: Solutions and Examples for PHP Programmers

13.0. Introduction

Most of the time, PHP is part of a web server, sending content to browsers. Even when you run it from the command line, it usually performs a task and then prints some output. PHP can also be useful, however, playing the role of a web client, retrieving URLs and then operating on the content. Most recipes in this chapter cover retrieving URLs and processing the results, although there are a few other tasks in here as well, such as cleaning up URLs and some JavaScript-related operations.

There are many ways retrieve a remote URL in PHP. Choosing one method over another depends on your needs for simplicity, control, and portability. The three methods discussed in this chapter are standard file functions, the cURL extension, and the HTTP_Request class from PEAR. These three methods can generally do everything you need and at least one of them should be available to you whatever your server configuration or ability to install custom extensions. Other ways to retrieve remote URLs include the pecl_http extension (http://pecl.php.net/package/pecl_http), which, while still in development, offers some promising features, and using the fsockopen( ) function to open a socket over which you send an HTTP request that you construct piece by piece.

Using a standard file function such as file_get_contents( ) is simple and convenient. It automatically follows redirects, so if you use this function to retrieve the directory http://www.example.com/people and the server redirects you to http://www.example.com/people/, you'll get the contents of the directory index page, not a message telling you that the URL has moved. Standard file functions also work with both HTTP and FTP. The downside to this method is that it requires the allow_url_fopen configuration directive to be turned on.

The cURL extension is a powerful jack-of-all-request-trades. It relies on the popular libcurl (http://curl.haxx.se/) to provide a fast, configurable mechanism for handling a wide variety of network requests. If this extension is available on your server, we recommend you use it.

If allow_url_fopen is turned off and cURL is not available, the PEAR HTTP_Request module saves the day. Like all PEAR modules, it's plain PHP, so if you can save a PHP file on your server, you can use it. HTTP_Request supports just about anything you'd like to do when requesting a remote URL, including modifying request headers and body, using an arbitrary method, and retrieving response headers.

Recipes 13.1 through 13.7 explain how to make various kinds of HTTP requests, tweaking headers, method, body, and timing. Recipe 13.8 helps you go behind the scenes of an HTTP request to examine the headers in a request and response. If a request you're making from a program isn't giving you the results you're looking for, examining the headers often provides clues as to what's wrong.

Once you've retrieved the contents of a web page into a program, use Recipes 13.9 through 13.14 to help you manipulate those page contents. 13.9 demonstrates how to mark up certain words in a page with blocks of color. This technique is useful for highlighting search terms, for example. Recipe 13.11 provides a function to find all the links in a page. This is an essential building block for a web spider or a link checker. Converting between plain text and HTML is covered in Recipes 13.12 and 13.13. 13.14 shows how to remove all HTML and PHP tags from a web page.

Recipes 13.15 and 13.16 discuss how PHP and JavaScript can work together. 13.15 explores using PHP to respond to requests made by JavaScript, in which you have to be concerned about caching and using alternate content types. 13.16 provides a full-fledged example of PHPJavaScript integration using the popular and powerful Dojo toolkit.

Two sample programs use the link extractor from Recipe 13.11. The program in Recipe 13.17 scans the links in a page and reports which are still valid, which have been moved, and which no longer work. The program in Recipe 13.18 reports on the freshness of links. It tells you when a linked-to page was last modified and if it's been moved.

Категории