A Simple Feed Aggregator
Credit: Rod Gaither
XML is the basis for many specialized langages. One of the most popular is RSS, an XML format often used to store lists of articles from web pages. With a tool called an aggregator, you can collect weblog entries and articles from several web sites RSS feeds, and read all those web sites at once without having to skip from one to the other. Here, well create a simple aggregator in Ruby.
Before aggregating RSS feeds, lets start by reading a single one. Fortunately we have several options for parsing RSS feeds into Ruby data structures. The Ruby standard library has built-in support for the three major versions of the RSS format (0.9, 1.0, and 2.0). This example uses the standard rss library to parse an RSS 2.0 feed and print out the titles of the items in the feed:
require ss/2.0 require open-uri url = http://www.oreillynet.com/pub/feed/1?format=rss2 feed = RSS::Parser.parse(open(url).read, false) puts "=== Channel: #{feed.channel.title} ===" feed.items.each do |item| puts item.title puts " (#{item.link})" puts puts item.description end # === Channel: OReilly Network Articles === # How to Make Your Sound Sing with Vocoders # (http://digitalmedia.oreilly.com/2006/03/29/vocoder-tutorial-and-tips.html) # …
Unfortunately, the standard rss library is a little out of date. Theres a newer syndication format called Atom, which serves the same purpose as RSS, and the rss library doesn support it. Any serious aggregator must support all the major syndication formats.
So instead, our aggregator will use Lucas Carlsons Simple RSS library, available as the simple-rss gem. This library supports the three main versions of RSS, plus Atom, and it does so in a relaxed way so that ill-formed feeds have a better chance of being read.
Heres the example above, rewritten to use Simple RSS. As you can see, only the name of the class is different:
require ubygems require simple-rss url = http://www.oreillynet.com/pub/feed/1?format=rss2 feed = RSS::Parser.parse(open(url), false) puts "=== Channel: #{feed.channel.title} ===" feed.items.each do |item| puts item.title puts " (#{item.link})" puts puts item.description end
Now we have a general method of reading a single RSS or Atom feed. Time to work on some aggregation!
Although the aggregator will be a simple Ruby script, theres no reason not to use Rubys object-oriented features. Our approach will be to create a class to encapsulate the aggregators data and behavior, and then write a sample program to use the class.
The RSSAggregator class that follows is a bare-bones aggregator that reads from multiple syndication feeds when instantiated. It uses a few simple methods to expose the data it has read.
#!/usr/bin/ruby # rss-aggregator.rb - Simple RSS and Atom Feed Aggregator require ubygems require simple-rss require open-uri class RSSAggregator def initialize(feed_urls) @feed_urls = feed_urls @feeds = [] read_feeds end protected def read_feeds @feed_urls.each { |url| @feeds.push(SimpleRSS.new(open(url).read)) } end public def refresh @feeds.clear read_feeds end def channel_counts @feeds.each_with_index do |feed, index| channel = "Channel(#{index.to_s}): #{feed.channel.title}" articles = "Articles: #{feed.items.size.to_s}" puts channel + , + articles end end def list_articles(id) puts "=== Channel(#{id.to_s}): #{@feeds[id].channel.title} ===" @feeds[id].items.each { |item| puts + item.title } end def list_all @feeds.each_with_index { |f, i| list_articles(i) } end end
Now we just need a few more lines of code to instantiate and use an RSSAggregator object:
test = RSSAggregator.new(ARGV) test.channel_counts puts " " test.list_all
Heres the output from a run of the test program against a few feed URLs:
$ ruby rss-aggregator.rb http://www.rubyriver.org/rss.xml http://rss.slashdot.org/Slashdot/slashdot http://www.oreillynet.com/pub/feed/1 http://safari.oreilly.com/rss/ Channel(0): RubyRiver, Articles: 20 Channel(1): Slashdot, Articles: 10 Channel(2): OReilly Network Articles, Articles: 15 Channel(3): OReilly Network Safari Bookshelf, Articles: 10 === Channel(0): RubyRiver === Mantis style isn eas… Its wonderful when tw… Red tailed hawk 37signals …
While a long way from a fully functional RSS aggregator, this program illustrates the basic requirements of any real aggregator. From this starting point, you can expand and refine the features of RSSAggregator.
One very important feature missing from the aggregator is support for the If-Modified-Since HTTP request header. When you call RSSAggregator#refresh, your aggregator downloads the specified feeds, even if it just grabbed the same feeds and none of them have changed since then. This wastes bandwidth.
Polite aggregators keep track of when they last grabbed a certain feed, and when they request it again they do a conditional request by supplying an HTTP request header called If-Modified Since. The details are a little beyond our scope, but basically the web server serves the reuqested feed only if it has changed since the last time the RSSAggregator downloaded it.
Another important feature our RSSAggregator is missing is the ability to store the articles it fetches. A real aggregator would store articles on disk or in a database to keep track of which stories are new since the last fetch, and to keep articles available even after they become old news and drop out of the feed.
Our simple aggregator counts the articles and lists their titles for review, but it doesn actually provide access to the article detail. As seen in the first example, the SimpleRSS.item has a link attribute containing the URL for the article, and a description attribute containing the (possibly HTML) body of the article. A real aggregator might generate a list of articles in HTML format for use in a browser, or convert the body of each article to text for output to a terminal.
See Also
- Recipe 14.1, "Grabbing the Contents of a Web Page"
- Recipe 14.3, "Customizing HTTP Request Headers"
- Recipe 11.15, "Converting HTML Documents from the Web into Text"
- A good comparison of the RSS and Atom formats (http://www.intertwingly.net/wiki/pie/Rss20AndAtom10Compared)
- Details on the Simple RSS project (http://simple-rss.rubyforge.org/)
- The FeedTools project has a more sophisticated aggregator library that supports caching and If-Modified-Since; see http://sporkmonger.com/projects/feedtools/ for details
- "HTTP Conditional Get for RSS Hackers" is a readable introduction to If-Modified-Since (http://fishbowl.pastiche.org/2002/10/21/http_conditional_get_for_rss_hackers)
Категории