Real World XML (2nd Edition)
XML documents are made up of markup and character data (and possibly, one day, binary data, but there is no provision for enclosing binary data in a document made up of markup and character data yetuntil there is, you refer to external binary data with entity references, as we'll see). The markup in a document gives it its structure. Markup includes start tags, end tags, empty-element tags, entity references, character references, comments, CDATA section delimiters (we'll see more about CDATA sections in a few pages), DTDs, and processing instructions. So what's the character data in an XML document? All the text in a document that is not markup is character data. Here's a quick example using markup and character data that we've already seen: <?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello From XML </GREETING> <MESSAGE> Welcome to the wild and woolly world of XML. </MESSAGE> </DOCUMENT> Tags begin with < and end with > , so it's easy to see that the markup here consists of tags such as <?xml version="1.0" encoding="UTF-8"?> , <DOCUMENT> , and so on. The text Hello From XML and Welcome to the wild and woolly world of XML . is the character data. However, markup does not need to begin and end with < and > . Markup can also start with & and end with ; in the case of general entity references (an entity reference is replaced by the entity it refers to when it's parsed), or it can start with % and end with ; for parameter entity references, which are used in DTDs (as we'll see in the next chapter). Using entity references, some of the markup in a document can become character data when you process that document. For example, the markup > is a general entity reference that is turned into < when parsed. Likewise, the markup < is turned into > when parsed. Here's an example: Listing ch02_02.xml
<?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> This text is inside the <GREETING> element. </GREETING> </DOCUMENT> You can see this XML document in Internet Explorer in Figure 2-1, where you see that the markup > was turned into < , and the markup < was turned into > . Figure 2-1. Using markup in Internet Explorer.
Because some markup can turn into character data when parsed, the character data that results after everything has been parsedand markup that should be replaced by character data has been replacedhas a special name : parsed character data. Whitespace
If you're ever been concerned about exactly what characters are legal in XML documents, you'll find them listed in the XML 1.0 specification, under the production named Char. It's worth noting that spaces, carriage returns, linefeeds, and tabs are all treated as whitespace in XML. So, practically speaking, this document: <?xml version="1.0" encoding="UTF-8"?> <DOCUMENT> <GREETING> Hello From XML </GREETING> <MESSAGE> Welcome to the wild and woolly world of XML. </MESSAGE> </DOCUMENT> is equivalent to this one: <?xml version="1.0" encoding="UTF-8"?> <DOCUMENT><GREETING>Hello From XML</GREETING> <MESSAGE>Welcome to the wild and woolly world of XML.</MESSAGE></DOCUMENT> It's also worth noting that the XML recommendation specifies that XML documents use the Unix convention for line endings, which means that lines are ended with a linefeed character only (ASCII code 10). In DOS files, lines are ended with carriage-return linefeed pairs (ASCII codes 13 and 10), but when parsed, that's treated simply as a single linefeed (ASCII code 10).
That gets us started with what can go into XML documents: markup and character data. It's time to move to the next step up now and begin working on the actual structure of XML documents, starting with the prolog. |