Processing XML with Javaв„ў: A Guide to SAX, DOM, JDOM, JAXP, and TrAX

"Ignorable White Space"

One of the more obscure parts of the XML 1.0 specification is the perhaps misleadingly named "ignorable white space." This is white space that occurs between tags in places where the DTD does not allow mixed content. Consider the XML-RPC document in Example 6.13.

Example 6.13 A Document That Uses Ignorable White Space to Prettify the XML

<?xml version="1.0"?> <!DOCTYPE methodCall [ <!ELEMENT methodCall (methodName, params)> <!ELEMENT params (param+)> <!ELEMENT param (value)> <!ELEMENT value (string)> <!ELEMENT methodName (#PCDATA)> <!ELEMENT string (#PCDATA)> ]> <methodCall> <methodName>lookupSymbol</methodName> <params> <param> <value> <string> Red Hat </string> </value> </param> </params> </methodCall>

This example has quite a bit of white space just for indenting. In particular, the spaces, carriage returns, and linefeeds between the following exist only for indenting:

  • <methodCall> and <methodName>

  • </methodName> and <params>

  • <params> and <param>

  • <param> and <value>

  • </value> and </param>

  • </param> and </params>

  • </params> and </methodCall>

Furthermore, the DTD says that these elements cannot contain #PCDATA, and therefore it's known that this white space is ignorable. Thus a validating parser will not pass these white space characters to the characters() method. Instead it passes them to the ignorableWhiteSpace() method. A nonvalidating parser might do the same, or it might pass the ignorable white space to the characters() method instead. If this matters to you, make sure you use a validating parser.

The space and line break characters in the string element are not ignorable because the DTD allows this element to contain #PCDATA. This white space is passed to the characters() method along with the words Red and Hat. White space is considered ignorable only where #PCDATA is invalid.

For purposes of this method, white space consists exclusively of the ASCII space (&#x20;), tab (&#x9;), carriage return (&#xD;), and linefeed (&#xA;). Unicode includes many more space characters, including new line (&#x85;), em space (&#x2003;), en space (&#x2002;), and more. However, these characters are never ignorable.

The ignorableWhiteSpace() method has the same arguments and the same caveats as the characters() method. For example, there's no guarantee that each call to this method will contain the maximum contiguous run of ignorable white space. However, its text[] argument should contain nothing except space characters, tabs, carriage returns, and linefeeds, at least in the subarray delineated by start and start+length .

Категории