Checking Whether a Page Has Changed
One popular HTTP application is RSS (Really Simple Syndication) aggregators, which download news items or blog posts in RSS (or Atom) format. RSS aggregators download a new copy of each RSS feed at regular intervals, typically once an hour. This process can end up wasting a lot of bandwidth for the publisher of the RSS feed, though: the contents of the feed may change infrequently, which means that the client will be downloading the same data over and over again.
To prevent this waste of network resources, RSS aggregators (and other applications that request the same page multiple times) are encouraged to use a conditional HTTP GET request. By including conditional HTTP headers with a request, a client instructs the server to return the page data only if certain conditions are met. And, of course, one of those conditions might be whether the page has been modified since it was last checked.
3.4.1. How Do I Do That?
Keep track of the headers returned the first time you download the page. Look for either an ETag header, which identifies the unique revision of the page, or the Last-Modified header, which gives the page's modification time. The next time you request the page, send the headers If-None-Match, with the ETag value, and If-Modified-Since, with the Last-Modified value. If the server supports conditional GET requests, it will return a 304 Unchanged response if the page has not been modified since the last request.
The getPage and downloadPage functions provided by twisted.web.client are handy, but they don't allow for the level of control necessary to use conditional requests. Therefore, you'll need to use the slightly lower-level HTTPClientFactory interface. Example 3-5 demonstrates using HTTPClientFactory to test whether a page has been updated.
Example 3-5. updatecheck.py
from twisted.web import client class HTTPStatusChecker(client.HTTPClientFactory): def _ _init_ _(self, url, headers=None): client.HTTPClientFactory._ _init_ _(self, url, headers=headers) self.status = None self.deferred.addCallback( lambda data: (data, self.status, self.response_headers)) def noPage(self, reason): # called for non-200 responses if self.status == '304': # Page hadn't changed client.HTTPClientFactory.page(self, '') else: client.HTTPClientFactory.noPage(self, reason) def checkStatus(url, contextFactory=None, *args, **kwargs): scheme, host, port, path = client._parse(url) factory = HTTPStatusChecker(url, *args, **kwargs) if scheme == 'https': from twisted.internet import ssl if contextFactory is None: contextFactory = ssl.ClientContextFactory( ) reactor.connectSSL(host, port, factory, contextFactory) else: reactor.connectTCP(host, port, factory) return factory.deferred def handleFirstResult(result, url): data, status, headers = result nextRequestHeaders = {} eTag = headers.get('etag') if eTag: nextRequestHeaders['If-None-Match'] = eTag[0] modified = headers.get('last-modified') if modified: nextRequestHeaders['If-Modified-Since'] = modified[0] return checkStatus(url, headers=nextRequestHeaders).addCallback( handleSecondResult) def handleSecondResult(result): data, status, headers = result print 'Second request returned status %s:' % status, if status == '200': print 'Page changed (or server does not support conditional requests).' elif status == '304': print 'Page is unchanged.' else: print 'Unexpected Response.' reactor.stop( ) def handleError(failure): print "Error", failure.getErrorMessage( ) reactor.stop( ) if __name__ == "_ _main_ _": import sys from twisted.internet import reactor url = sys.argv[1] checkStatus(url).addCallback( handleFirstResult, url).addErrback( handleError) reactor.run( )
Run updatecheck.py from the command line with a web URL as the first argument. It will download the page once, and then download it again using a conditional GET. It then indicates whether the second response was a 304, indicating that the server understood the conditional headers and indicated that the page had not changed. It's fairly typical for servers to support conditional GET requests for static files, such as RSS feeds, but not dynamically generated content, such as the home page:
$ python updatecheck.py http://slashdot.org/slashdot.rss Second request returned status 304: Page is unchanged $ python updatecheck.py http://slashdot.org/ Second request returned status 200: Page changed (or server does not support conditional requests).
3.4.2. How Does That Work?
The HTTPStatusChecker class is a subclass of client.HTTPClientFactory. It does a couple of notable things. During initialization, it adds an additional callback to self.deferred, using a lambda function. This anonymous function will catch the result of self.deferred before it gets passed to any external callback handlers. It will then replace this result (the downloaded data) with a tuple containing more information: the data, the HTTP status code, and self.response_headers, which is a dictionary of the headers returned with the response.
HTTPStatusChecker also overrides the noPage method, which HTTPClientFactory calls to indicate an unsuccessful response code. If the response status is 304 (the Unchanged status code), the noPage method calls HTTPClientFactory.page instead of the original noPage method, which indicates a successful response. In the case of a success, of course, the noPage in HTTPStatusChecker passes the call on to the overridden noPage in HTTPClientFactory. In this way, it prevents a 304 response from being considered an error.
The checkStatus function takes a URL and parses it using the twisted.web.client._parse utility function. It looks at the parts of the URL, gets the hostname it needs to connect to, and whether it's using HTTP (which runs over straight TCP) or HTTPS (which runs over SSL, and establishes the connection using reactor.connectSSL). Next, checkStatus creates an HTTPStatusChecker factory object, and opens the connection. All this code is basically lifted from twisted.web.client.getPage and modified to use the HTTPStatusChecker factory instead of the vanilla HTTPClientFactory.
When updatecheck.py runs, it calls checkStatus, setting handleFirstResult as the callback handler. handleFirstResult, in turn, makes a second request using the If-None-Match and If-Modified-Since conditional headers, setting handleSecondResult as the callback handler. The handleSecondResult function reports whether the server returned a 304 response, and then stops the reactor.
handleFirstResult actually returns the deferred result of handleSecondResult. This allows printError, the error handler function assigned to the first call to checkStatus, to handle any errors that come up in the second call to checkStatus as well.