Parsing Invalid Markup

Problem

You need to extract data from a document thats supposed to be HTML or XML, but that contains some invalid markup.

Solution

For a quick solution, use Rubyful Soup, written by Leonard Richardson and found in the rubyful_soup gem. It can build a document model even out of invalid XML or HTML, and it offers an idiomatic Ruby interface for searching the document model. Its good for quick screen-scraping tasks or HTML cleanup.

require ubygems require ubyful_soup invalid_html = A lot of tags are never closed. soup = BeautifulSoup.new(invalid_html) puts soup.prettify # A lot of # tags are # never closed. # # soup.b.i # => never closed. soup.i # => never closed. soup.find(nil, :attrs=>{class => 2}) # => never closed. soup.find_all(i) # => [never closed.] soup.b[class] # => "1" soup.find_text(/closed/) # => "never closed."

If you need better performance, do what Rubyful Soup does and write a custom parser on top of the event-based parser SGMLParser (found in the htmltools gem). It works a lot like REXMLs StreamListener interface.

Discussion

Sometimes it seems like the authors of markup parsers do their coding atop an ivory tower. Most parsers simply refuse to parse bad markup, but this cuts off an enormous source of interesting data. Most of the pages on the World Wide Web are invalid HTML, so if your application uses other peoples web pages as input, you need a forgiving parser. Invalid XML is less common but by no means rare.

The SGMLParser class in the htmltools gem uses regular expressions to parse an XMLlike data stream. When it finds an opening or closing tag, some data, or some other part of an XML-like document, it calls a hook method that you e supposed to define in a subclass. SGMLParser doesn build a document model or keep track of the document state: it just generates events. If closing tags don match up or if the markup has other problems, it won even notice.

Rubyful Soups parser classes define SGMLParser hook methods that build a document model out of an ambiguous document. Its BeautifulSoup class is intended for HTML documents: it uses heuristics like a web browsers to figure out what an ambiguous document "really" means. These heuristics are specific to HTML; to parse XML documents, you should use the BeautifulStoneSoup class. You can also subclass BeautifulStoneSoup and implement your own heuristics.

Rubyful Soup builds a densely linked model of the entire document, which uses a lot of memory. If you only need to process certain parts of the document, you can implement the SGMLParser hooks yourself and get a faster parser that uses less memory.

Heres a SGMLParser subclass that extracts URLs from a web page. It checks every A tag for an HRef attribute, and keeps the results in a set. Note the similarity to the LinkGrabber class defined in Recipe 11.13.

require ubygems require html/sgml-parser require set html = %{<a name="anchor"><a href="http://www.oreilly.com">OReilly</a> irrelevant<a href="http://www.ruby-lang.org/">Ruby</a>} class LinkGrabber < HTML::SGMLParser attr_reader :urls def initialize @urls = Set.new super end def do_a(attrs) url = attrs.find { |attr| attr[0] == href } @urls << url[1] if url end end extractor = LinkGrabber.new extractor.feed(html) extractor.urls # => #

The equivalent Rubyful Soup program is quicker to write and easier to understand, but it runs more slowly and uses more memory:

require ubyful_soup urls = Set.new BeautifulStoneSoup.new(html).find_all(a).each do |tag| urls << tag[href] if tag[href] end

You can improve performance by telling Rubyful Soups parser to ignore everything except A tags and their contents:

puts BeautifulStoneSoup.new(html, :parse_only_these => a) # <a name="anchor"></a> # <a href="http://www.oreilly.com">OReilly</a> # <a href="http://www.ruby-lang.org/">Ruby</a>

But the fastest implementation will always be a custom SGMLParser subclass. If your parser is part of a full application (rather than a one-off script), youll need to find the best tradeoff between performance and code legibility.

See Also

Категории