Effective XML: 50 Specific Ways to Improve Your XML
Whether you're writing code with DOM, JDOM, dom4j, Sparta, or some other tree-based API, one of the primary tasks is navigating the document to locate particular nodes, whether for query or update. XPath is a very powerful, very robust language for navigating XML documents. It often works better than explicitly specified navigation using methods such as getChildNode() or getNextSibling() . Navigation in a tree-based API is a surprisingly complex and error-prone operation. For example, consider a simple contacts file that includes names and phone numbers and claims to follow the DTD shown in Example 35-1. Example 35-1 A DTD for a Simple Contacts File
<!ELEMENT Contacts (Contact+)> <!ELEMENT Contact (Name, Phone)> <!ELEMENT Name (#PCDATA)> <!ELEMENT Phone (#PCDATA)> A typical document might look like Example 35-2. Example 35-2 A Simple Contacts Document
<?xml version="1.0"?> <!DOCTYPE Contacts SYSTEM "contacts.dtd"> <Contacts> <Contact> <Name>John Doe</Name> <Phone>626-555-3456</Phone> </Contact> <Contact> <Name>Joe Smith</Name> <Phone>212-555-3456</Phone> </Contact> <Contact> <Name>Jane and Ben Reilly</Name> <Phone>212-555-2341</Phone> </Contact> <Contact> <Name>Mohammed Jones</Name> <Phone>718-555-2349</Phone> </Contact> <Contact> <Name>David Smith</Name> <Phone>914-555-3145</Phone> </Contact> <Contact> <Name>Xao Li</Name> <Phone>212-555-0998</Phone> </Contact> </Contacts> This is very straightforward, very simple record-like data. What can go wrong when processing files like this? Actually, quite a lot, especially when you realize that not all instance documents are likely to be so simple or so valid. In fact, all too often you'll find yourself working with documents like Example 35-3 instead, even if you were promised data that precisely adheres to the DTD in Example 35-1. Invalid data is a fact of life. It's not acceptable to just throw up your hands in disgust because someone sent you a document that's slightly different from what you expected. Example 35-3 A Slightly More Realistic Contacts Document
<?xml version="1.0"?> <Contacts> <Contact> <Name>John Doe</Name> <Phone>626-555-3456</Phone> </Contact> <Contact> <Name>Joe <!--I need to check whether he prefers Joe or Joseph --> Smith </Name> <Phone>212-555-3456</Phone> </Contact> <Contact> <Name>Jane <![CDATA[& Ben]]> Reilly</Name> <Phone>212-555-2341</Phone> </Contact> <Contact> <Name>Mohammed Jones</Name> <Phone>718-555-2349</Phone> <Phone>914-555-7698</Phone> </Contact> <Contact> <Phone>914-555-3145</Phone> <Name>David Smith</Name> </Contact> <Contact> <Name>Jason Smith</Name> <Phone></Phone> </Contact> <Contact> <Name>Xao Li</Name> <Name>Jonathan Li</Name> <Phone>212-555-0998</Phone> </Contact> </Contacts> Potential problems include those listed below.
Some of these problems are mitigated by particular APIs. XOM resolves all entity references. DOM requires parsers to place the maximum possible contiguous run of text in each text node. MSXML throws away white-space-only nodes by default. Nonetheless, all APIs suffer from at least several of these problems. Validation would catch some of these, though arguably they're not mistakes. For instance, if people do have multiple phone numbers but the schema only allows one, it's the schema that's mistaken, not the document. The instance document more correctly reflects reality than the schema does. The same can be said of a program that assumes every person has exactly one phone number. The software should be changed, not the document. Is it possible to write a program that behaves correctly even in the face of unexpected changes in document structure? Surprisingly, the answer is yes, it is. Well-written programs can handle a lot of cases their designers did not explicitly plan for. The trick is to write programs that fit document structures loosely rather than tightly. The program should ask only for the information it needs without tying that to information it doesn't really care about. For example, if you want to find Joe Smith's phone numbers, you need to ask for the Phone siblings of the Name element whose value is Joe Smith. You don't want to ask for the Phone sibling of the Name element whose first text node is Joe Smith and is a child of a Contact element, which is a child of the Contacts element, which is the root element of the document. That's too specific. There are too many extra conditions such as "The root element is Contacts " that don't have anything to do with what you really want to know. It is not possible, of course, to write perfectly robust software. For instance, there's no way to find Joe Smith's phone number if it's not in the document somewhere. However, you can write software that behaves reliably on a much broader class of documents and fails gracefully when the information truly isn't there. XPath is a fourth-generation declarative language that allows you to specify which nodes you want to process without specifying exactly how the processor is supposed to navigate to those nodes. XPath's data model is very well designed to support exactly what almost all developers want from XML. For instance, it merges all adjacent text including that in CDATA sections, skips over comments and processing instructions, includes text from child and descendant elements, and resolves all external entity references. In practice, XPath expressions tend to be much more robust against unexpected but perhaps insignificant changes in the input document. For example, consider a very simple DOM program fragment that prints all the names in the contacts file. Element contactsElement = doc.getDocumentElement(); NodeList contacts = contactsElement.getChildNodes(); for (int i = 0; i < contacts.getLength(); i++) { Node current = contacts.item(i); if (current.getNodeType() == Node.ELEMENT_NODE) { Element contact = (Element) current; NodeList children = contact.getChildNodes(); for (int j = 0; j < contacts.getLength(); j++) { Node candidate = children.item(j); if (candidate.getNodeType() == Node.ELEMENT_NODE) { // Assuming name is the first child element String name = candidate.getFirstChild().getNodeValue(); System.out.println(name); // Assuming there's only one name per contact break; } } } } However, this code is very closely tied to the structure of the document. When actually run on Example 35-3, here's the output: John Doe Joe Jane Mohammed Jones 914-555-3145 Jason Smith Xao Li You can see that it lost half of Joe Smith's name and missed Ben Reilly completely. It also lost track of Jonathan Li and replaced David Smith with his phone number. With a little effort, we can improve the DOM code to fix most of the problems. The most important improvement we can make is relying on element names rather than positions . The second most important improvement is using the getElementsByTagName() method to search the entire tree rather than just some expected parts of it. The third most important improvement (and the most difficult to implement correctly) is not assuming all text nodes are adjacent. The final fix is removing the assumption that there's only one name per contact. The improved code follows . NodeList names = doc.getElementsByTagName("Name"); for (int i = 0; i < names.getLength(); i++) { Node name = names.item(i); String value = getFullText(name); System.out.println(value); } // ... public static String getFullText(Node node) { StringBuffer result = new StringBuffer(12); NodeList children = node.getChildNodes(); for (int i = 0; i < children.getLength(); i++) { Node child = children.item(i); int type = child.getNodeType(); if (type == Node.TEXT_NODE) { result.append(child.getNodeValue()); } else if (type == Node.CDATA_SECTION_NODE) { result.append(child.getNodeValue()); } else if (type == Node.ELEMENT_NODE) { result.append(getFullText(child)); } else if (type == Node.ENTITY_REFERENCE_NODE) { result.append(getFullText(child)); } } return result.toString(); } As you can see, this is somewhat more complex. Note especially the recursion that's necessary to accumulate the complete text of the node. Complexity would increase further in an API that doesn't have an equivalent to getElementsByTagName() , a fairly unique feature of DOM. Is there perhaps an easier way? Yes, there is, and that way is XPath. Consider the following XPath expression. /Contacts/Contact/Name It's about ten times simpler than the DOM version but accomplishes almost as much. This returns the desired content even if the Name elements contain CDATA sections and child elements, and no matter how many of them appear in any given Contact element. This particular XPath expression would still fail if the hierarchy were not just what was expected, for instance, if the root element were something other than Contacts or if a Name element were not an immediate child of a Contact element. However, an XPath expression that uses the descendant axis would be even more robust. The XPath expression below would still succeed. //Name There are numerous ways to integrate XPath processing into your programs. Most XSLT processors such as Xalan, libxslt , and MSXML include APIs you can use to merge XPath expressions with your own code. Unfortunately, there is as yet no standard API for XPath queries. For example, using Xalan, the above query might look like this: NodeList results = XPathAPI.selectNodeList(doc, "//Name"); for (int i = 0; i < results.getLength(); i++) { Node result = results.item(i); XObject value = XPathAPI.eval(result, "string()"); System.out.println(value.str()); } In MSXML, XPath evaluation is built into the DOM implementation classes, which makes for slightly more concise if less portable code. (See Item 31.) For example, here's the same XPath search implemented in JavaScript such as you might use from Internet Explorer. var results = doc.selectNodes("//Name"); for (i=0; i < results.length; i++) { var node = results.item(i); alert(node.text); } Many Java APIs are supported by the Jaxen XPath library (http:// jaxen . sourceforge .net) including DOM, JDOM, and dom4j. Using Jaxen on top of DOM, the above query would look something like this: XPath expression = new org.jaxen.dom.DOMXPath("//Name"); Navigator navigator = expression.getNavigator(); List results = expression.selectNodes(doc); Iterator iterator = results.iterator(); while (iterator.hasNext()) { Node result = (Node) iterator.next(); String value = StringFunction.evaluate(result, navigator); System.out.println(value); } There are a few things you need to be wary of when integrating XPath with more traditional code. First of all, XPath queries return node-sets , not single nodes. You should be prepared to iterate through the result, even if you expect only a single node. This handles cases both where the expected node is missing and where there's more than you expected. Avoid methods like selectSingleNode() in MSXML and Xalan that hide such results from the library client. They significantly reduce robustness. If you absolutely insist that there can be only one result, then by all means validate the document against a schema that requires this (Item 37) before processing the document, and reject it if it fails the schema. Do not simply assume that all documents passed to the program meet the constraints. Such assumptions fail far more often than most developers initially expect, and over a period of several years almost all such assumptions fail. Secondly, be aware that you may take a performance hit for choosing XPath, especially if you use the tree-walking descendant and descendant-or-self axes. The following axis also tends to be slow. Some (but not all) XPath engines have major performance problems with the ancestor and ancestor -or-self axes. You can use the child axis and more explicit descent instead. However, as noted above, this does reduce your program's robustness. Like all other performance issues, this should be evaluated within the context of a particular program, system, library, and use case. Some systems have CPU power to spare. Others may be so CPU or memory limited that the cost of even the most basic XPath processing is prohibitive. Sometimes switching to a different XPath engine can produce an order of magnitude improvement in processing particular XPath expressions. I recommend designing the processing for maximum robustness and optimizing the XPath expressions only when and if profiling proves this is a significant bottleneck. Personally, I'm willing to trade performance for robustness, and I find the opposite opinion (a willingness to trade correctness for performance) completely incomprehensible. XPath-based navigation is both easier to write and easier to write correctly than explicit navigation. It produces much more robust software. If you'd rather process more documents faster, at the cost of processing some of them incorrectly, you can do so; but please at least test the performance of an XPath-based alternative before you rule it out. |