Downloading a Web Page

The simplest and most common task for a web client application is fetching the contents of a web page. The client connects to the server, sends an HTTP GET request, and receives an HTTP response containing the requested page.

3.1.1. How Do I Do That?

Here's where you can begin to experience the usefulness of Twisted's built-in protocol support. The twisted.web package includes a complete HTTP implementation, saving you the work of developing the necessary Protocol and ClientFactory classes. Furthermore, it includes utility functions that allow you to make an HTTP request with a single function call. To fetch the contents of a web page, use the function twisted.web.client.getPage. Example 3-1 is a Python script called webcat.py, which fetches a URL that you specify.

Example 3-1. webcat.py

from twisted.web import client from twisted.internet import reactor import sys def printPage(data): print data reactor.stop( ) def printError(failure): print >> sys.stderr, "Error:", failure.getErrorMessage( ) reactor.stop( ) if len(sys.argv) == 2: url = sys.argv[1] client.getPage(url).addCallback( printPage).addErrback( printError) reactor.run( ) else: print "Usage: webcat.py "

Give webcat.py a URL as its first argument, and it will fetch and print the contents of the page:

$ python webcat.py http://www.oreilly.com/

oreilly.com -- Welcome to O'Reilly Media, Inc. -- computer books, software conferences, online publishing...

3.1.2. How Does That Work?

The printPage and printError functions are simple event handlers that print the downloaded page contents or an error message, respectively. The most important line in Example 3-1 is the call to client.getPage(url). This function returns a Deferred object that will be called back with the contents of the page once it has been completely downloaded.

Notice how the callbacks are added to the Deferred in a single line. This is possible because addCallback and addErrback both return a reference to their Deferred object. Therefore, the statements:

d = deferredFunction( ) d.addCallback(resultHandler) d.addErrback(errorHandler)

can be expressed as:

deferredFunction( ).addCallback(resultsHandler).addErrback(errorHandler)

Which of these two forms is more readable is probably a matter of personal opinion, but the latter is an idiom that appears frequently in Twisted code.

3.1.3. What About...

... writing the page to disk as it's being downloaded? One disadvantage to the webcat.py script in Example 3-1 is that it loads the entire contents of the downloading page into memory, which could present a problem if you're downloading a large file. A better approach might be to write the data to a temporary file on disk as it's being downloaded, and then read the contents back from the temp file once the download is complete.

twisted.web.client includes downloadPage, a function that is similar to getPage but that writes data to a file. Call downloadPage with a URL as the first argument, and a filename or file object as the second. The script webcat2.py in Example 3-2 does this.

Example 3-2. webcat2.py

from twisted.web import client import tempfile def downloadToTempFile(url): """ Given a URL, returns a Deferred that will be called back with the name of a temporary file containing the downloaded data. """ tmpfd, tempfilename = tempfile.mkstemp( ) os.close(tmpfd) return client.downloadPage(url, tempfilename).addCallback( returnFilename, tempfilename) def returnFilename(result, filename): return filename if __name__ == "_ _main_ _": import sys, os from twisted.internet import reactor def printFile(filename): for line in file(filename, 'r+b'): sys.stdout.write(line) os.unlink(filename) # delete file once we're done with it reactor.stop( ) def printError(failure): print >> sys.stderr, "Error:", failure.getErrorMessage( ) reactor.stop( ) if len(sys.argv) == 2: url = sys.argv[1] downloadToTempFile(url).addCallback( printFile).addErrback( printError) reactor.run( ) else: print "Usage: %s " % sys.argv[0]

The downloadToTempFile function in Example 3-2 returns the Deferred that results from calling twisted.web.client.downloadPage. downloadToTempFile adds returnFilename as a callback to this Deferred, with the temp filename as an additional argument. This means that when the result of downloadToTempFile comes in, the reactor will call returnFileName with the result of downloadToTempFile as the first argument and the filename as the second argument.

Example 3-2 registers another callback for the result of downloadToTempFile. Remember that the Deferred returned from downloadToTempFile already has returnFilename as a callback handler. Therefore, when the result comes in, returnFilename will be called first. The result of this function (the filename) will be used to call printFile.

Категории