XML in a Nutshell, Third Edition

     

While you can write markup by hand in a text editor, many non-programmers prefer a friendlier, more WYSIWYG approach. There's no reason a standard word processor can't save its data in XML, and indeed several now do, including Microsoft Word 2003 and OpenOffice.org Writer. Harold also wrote a much smaller book in XML using OpenOffice.org Writer ( Effective XML , Addison Wesley).

For what it's worth, in hindsight I regret that decision. If I were doing it again, I would write the XML by hand in DocBook as I did with Processing XML with Java , rather than using OpenOffice. As much as good GUI tools can improve productivity, bad GUI tools can hinder it. A poorly designed GUI is no guarantee of ease of use.Scott and I wrote this book in Microsoft Word, but mostly because the early editions predated the availability of high-quality XML publishing tools. That decision is hurting us now. For instance, the complicated tables in Chapter 27 are well beyond what Word can comfortably handle. In DocBook, they'd be a no-brainer. If we were starting from scratch, we'd write in DocBook.

Example 6-3 shows a fairly simple OpenOffice document. Again, the content comes from the book you're reading now. This differs from TEI and DocBook in several waysfor instance, it uses namespaces. TEI and DocBook don't. The title of the book and the names of the authors are not included because they'd normally be stored in a separate XML document containing only the metadata. Indexes and tables of contents are generated from the internal structure, content, and markup rather than being added explicitly. Perhaps the most unusual distinction is the lack of section elements of any kind. Instead, different chapters, sections, and subsections are identified by text:h elements with different levels. The contents of the section are everything that follows the text:h element until the next text:h element. Less obvious is that this format is more general because it's designed to handle several other OpenOffice document formats, including charts and spreadsheets, besides simple narrative content.

Example 6-3. An OpenOffice document

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE office:document-content PUBLIC "-//OpenOffice.org//DTD OfficeDocument 1.0//EN" "office.dtd"> <office:document-content xmlns:office="http://openoffice.org/2000/office" xmlns: xmlns:text="http://openoffice.org/2000/text" xmlns:fo="http://www.w3.org/1999/XSL/Format" office:class="text" office:version="1.0"> <office:script/> <office:font-decls> <style:font-decl style:name="Courier" fo:font-family="Courier" style:font-pitch="variable"/> <style:font-decl style:name="Times" fo:font-family="Times" style:font-pitch="variable"/> <style:font-decl style:name="Helvetica" fo:font-family="Helvetica" style:font-family-generic="swiss" style:font-pitch="variable"/> </office:font-decls> <office:automatic-styles> <style:style style:name="P1" style:family="paragraph" style:parent-style-name="ChapterLabel" style:master-page-name="First Page"> <style:properties style:page-number="0"/> </style:style> <style:style style:name="T1" style:family="text" style:parent-style-name="WW-Comment Reference"> <style:properties fo:color="#000000"/> </style:style> <style:style style:name="T2" style:family="text" style:parent-style-name="emphasis"> <style:properties fo:language="none" fo:country="none"/> </style:style> <style:style style:name="T3" style:family="text"> <style:properties fo:language="none" fo:country="none"/> </style:style> </office:automatic-styles> <office:body> <text:sequence-decls> <text:sequence-decl text:display-outline-level="0" text:name="Illustration"/> <text:sequence-decl text:display-outline-level="0" text:name="Table"/> <text:sequence-decl text:display-outline-level="0" text:name="Text"/> <text:sequence-decl text:display-outline-level="0" text:name="Drawing"/> </text:sequence-decls> <text:p text:style-name="ChapterTitle">Introducing XML</text:p> <text:p text:style-name="Standard"></text:p> <text:p text:style-name="ChapterTitle">XML as a Document Format</text:p> <text:p text:style-name="Standard">XML is first and foremost a document format. It was always intended for web pages, books, scholarly articles, poems, short stories, reference manuals, tutorials, textbooks, legal pleadings, contracts, instruction sheets, and other documents that human beings would read. Its use as a syntax for computer data in applications such as order processing, object serialization, database exchange and backup, and electronic data interchange is mostly a happy accident.</text:p> <text:h text:style-name="Heading 1" text:level="1">SGML's Legacy</text:h> <text:p text:style-name="Standard"></text:p> <text:h text:style-name="Heading 1" text:level="1">TEI</text:h> <text:p text:style-name="Standard"></text:p> <text:h text:style-name="Heading 1" text:level="1">DocBook</text:h> <text:p text:style-name="Standard">DocBook (<text:span text:style-name="online item">http://www.docbook.org/</text:span>) <text:alphabetical-index-mark text:string-value="DocBook" text:key1="narrative-oriented XML documents"/><text:s/> <text:alphabetical-index-mark text:string-value="DocBook"/>is an SGML application designed for new documents, not old ones. It's especially common in computer documentation. Several O'Reilly books have been written in DocBook, including <text:alphabetical-index-mark text:string-value="Walsh, Norman"/>Norm Walsh and Leonard <text:alphabetical-index-mark text:string-value="Muellner, Leonard"/>Muellner's <text:span text:style-name="emphasis">DocBook: The Definitive Guide</text:span>. No special tools are required to author it. Much of the Linux Documentation Project (LDP, <text:span text:style-name="online item">http://www.linuxdoc.org/</ text:span>) corpus is written in DocBook. </text:p> <text:p text:style-name="ChapterTitle">XML on the Web</text:p> <text:p text:style-name="Standard"></text:p> </office:body> </office:document-content>

This is actually only one piece of what OpenOffice saves (and it's been cleaned up some for display in this book). OpenOffice bundles up several related XML documents into a zip file and saves that. Before you can work with the raw XML, you'll need to unzip it. Once it is unzipped , the document like the one shown here is found in the file named content.xml . Other XML documents are used to hold styles, metadata, and settings. These can all be bundled into a single office:document element, but this is normally not done. The separation of content from presentation is a very useful feature of this application.

Despite that, overall, OpenOffice is a much more presentationally oriented format than either DocBook or TEI. This makes it more suitable as the file format for a WYSIWYG word processor, but less suitable for manipulation with XML tools such as SAX, DOM, and XSLT. Certainly, you can process an OpenOffice document with these tools; it's just that the markup has less semantics to lever off of. All a document really is a heading, paragraphs, lists, and tables (the latter two are not seen in this example). The basic semantics are impoverished compared to either DocBook or TEI. Much of the useful information in an OpenOffice document is tied up in style names rather than element names. If the authors did not use named styles, but simply formatted their document with italics, bold, Helvetica, and the like, then the semantics may well be irretrievable.

Категории