XML and HTML
XML and HTML are the most popular markup languages (textual ways of describing structured data). HTML is used to describe textual documents, like you see on the Web. XML is used for just about everything else: data storage, messaging, configuration files, you name it. Just about every software buzzword forged over the past few years involves XML.
Java and C++ programmers tend to regard XML as a lightweight, agile technology, and are happy to use it all over the place. XML is a lightweight technology, but only compared to Java or C++. Ruby programmers see XML from the other end of the spectrum, and from there it looks pretty heavy. Simpler formats like YAML and JSON usually work just as well (see Recipe 13.1 or Recipe 13.2), and are easier to manipulate. But to shun XML altogether would be to cut Ruby off from the rest of the world, and nobody wants that. This chapter covers the most useful ways of parsing, manipulating, slicing, and dicing XML and HTML documents.
There are two standard APIs for manipulating XML: DOM and SAX. Both are overkill for most everyday uses, and neither is a good fit for Ruby's code-blockheavy style. Ruby's solution is to offer a pair of APIs that capture the style of DOM and SAX while staying true to the Ruby programming philosophy.[1] Both APIs are in the standard library's REXML package, written by Sean Russell.
[1] REXML also provides the SAX2Parser and SAX2Listener classes, which implement the basic SAX2 API.
Like DOM, the Document class parses an XML document into a nested tree of objects. You can navigate the tree with Ruby accessors (Recipe 11.2)or with XPath queries (Recipe 11.4). You can modify the tree by creating your own Element and Text objects (Recipe 11.9). If even Document is too heavyweight for you, you can use the XmlSimple library to transform an XML file into a nested Ruby hash (Recipe 11.6).
With a DOM-style API like Document, you have to parse the entire XML file before you can do anything. The XML document becomes a large number of Ruby objects nested under a Document object, all sitting around taking up memory. With a SAXstyle parser like the StreamParser class, you can process a document as it's parsed, creating only the objects you want. The StreamParser API is covered in Recipe 11.3.
The main problem with the REXML APIs is that they're very picky. They'll only parse a document that's valid XML, or close enough to be have an unambiguous representation. This makes them nearly useless for parsing HTML documents off the World Wide Web, since the average web page is not valid XML. Recipe 11.5 shows how to use the third-party tools Rubyful Soup and SGMLParser; they give a DOMor SAX-style interface that handles even invalid XML.
- http://www.germane-software.com/software/rexml/
- http://www.germane-software.com/software/rexml/docs/tutorial.html