Downloading Usenet Articles

The core of Usenet, of course, is downloading and reading articles. This lab shows how to learn which articles are available in a group, and then download the most recent ones.

9.2.1. How Do I Do That?

Call NNTPClient's fetchGroup method to get the article count and first and last message numbers for a newsgroup. Then call fetchArticle to download each article. Example 9-2 demonstrates the technique.

Example 9-2. nntpdownload.py

from twisted.news import nntp from twisted.internet import protocol, defer import time, email class NNTPGroupDownloadProtocol(nntp.NNTPClient): def connectionMade(self): nntp.NNTPClient.connectionMade(self) self.fetchGroup(self.factory.newsgroup) def gotGroup(self, groupInfo): articleCount, first, last, groupName = groupInfo first = int(first) last = int(last) start = max(first, last-self.factory.articleCount) self.articlesToFetch = range(start+1, last+1) self.articleCount = len(self.articlesToFetch) self.fetchNextArticle( ) def fetchNextArticle(self): if self.articlesToFetch: nextArticleIdx = self.articlesToFetch.pop(0) print "Fetching article %i of %i..." % ( self.articleCount-len(self.articlesToFetch), self.articleCount), self.fetchArticle(nextArticleIdx) else: # all done self.quit( ) self.factory.deferred.callback(0) def gotArticle(self, article): print "OK" self.factory.handleArticle(article) self.fetchNextArticle( ) def getArticleFailed(self, errorMessage): print errorMessage self.fetchNextArticle( ) def getGroupFailed(self, errorMessage): self.factory.deferred.errback(Exception(errorMessage)) self.quit( ) self.transport.loseConnection( ) def connectionLost(self, error): if not self.factory.deferred.called: self.factory.deferred.errback(error) class NNTPGroupDownloadFactory(protocol.ClientFactory): protocol = NNTPGroupDownloadProtocol def _ _init_ _(self, newsgroup, outputfile, articleCount=10): self.newsgroup = newsgroup self.articleCount = articleCount self.output = outputfile self.deferred = defer.Deferred( ) def handleArticle(self, articleData): parsedMessage = email.message_from_string(articleData) self.output.write(parsedMessage.as_string(unixfrom=True)) self.output.write(' ') if __name__ == "_ _main_ _": from twisted.internet import reactor import sys def handleError(error): print >> sys.stderr, error.getErrorMessage( ) reactor.stop( ) if len(sys.argv) != 4: print >> sys.stderr, "Usage: %s nntpserver newsgroup outputfile" sys.exit(1) server, newsgroup, outfile = sys.argv[1:4] factory = NNTPGroupDownloadFactory(newsgroup, file(outfile, 'w+b')) factory.deferred.addCallback( lambda _: reactor.stop( )).addErrback( handleError) reactor.connectTCP(server, 119, factory) reactor.run( )

Run nntpdownload.py with the name of an NNTP server, a newsgroup, and the filename to which the messages should be written. It will connect to the server, download the most recent 10 messages from that newsgroup, and then quit:

$ python nntpdownload.py freetext.usenetserver.com comp.lang.python > comp.lang.python-latest.mbox Fetching article 1 of 10... OK Fetching article 2 of 10... OK Fetching article 3 of 10... OK Fetching article 4 of 10... OK Fetching article 5 of 10... OK Fetching article 6 of 10... OK Fetching article 7 of 10... OK Fetching article 8 of 10... OK Fetching article 9 of 10... OK Fetching article 10 of 10... OK

 

9.2.2. How Does That Work?

The NNTPGroupDownloadProtocol class, a subclass of nntp.NNTPClient, does most of the work in nntpdownload.py. The self.fetchGroup method asks the server for information about the newsgroup. When the server responds, gotGroup is called with the returned information: the total number of articles in the group, the index of the first article, the index of the last article, and the group name. NNTPGroupDownloadProtocol then goes back the number of articles specified by self.factory.articleCount (unless there aren't that many messages, in which case it just goes back to the first available article) and uses Python's range function to create a list of every number from the starting message index to the ending message index. Then it calls fetchNextArticle to begin downloading the set of messages.

fetchNextArticle takes the remaining list of article indexes and downloads the first one with a call to self.fetchArticle. The gotArticle method, called when the article has been successfully downloaded, passes the article data to self.factory.handleArticle, and then calls self.fetchArticle again. If an article download fails, the gotArticleFailed method will be called. gotArticleFailed prints an error message, but doesn't abort the entire operation; instead, it simply goes on to the next message.

An alternative to the approach used here is to use the fetchNewNews method, which takes a date and returns a list of all the articles posted since. Unfortunately, the underlying NEWNEWS command is not supported by many servers.

Because Usenet articles are in the same format as email, they can be stored in the same Unix mbox format used by the mail client examples shown in Chapter 7. The NNTPGroupDownloadFactory's handleArticle method parses the message using the email module and writes it to the output file in mbox format, followed by two blank lines to ensure that it will be clearly delimited from the next article in the file.

Категории