Flash and XML[c] A Developer[ap]s Guide
Let's begin with a historical perspective. SGML
Today's World Wide Web is built, of course, on HTML. This method of formatting pages shares ancestry with XML: They both derive from the Standard Generalized Markup Language (SGML). SGML was established around 1969 to unify the Babel of languages then in place to format documents for publication. These "markup languages" evolved out of handwritten typesetting instructions used in the printing industry. And they initially served the same function: to convey formatting decisions from designer to compositor. Dr. Charles Goldfarb, originator of SGML, foresaw a world in which computer systems as well as humans process documents. SGML can explain to a computer the structure of a very complex text. At first it was used to prepare documents for the printing press. Later it was used to format documents to be read by people on their computer screens. Then Goldfarb and his colleagues became interested in preparing texts that were never to reach human eyes: messages from computer to computer or even from human to computer. Data. Any SGML object is both document and data. This dual nature increases its value: one aspect emphasizes publishing, the other emphasizes data exchange. As a publishing tool it produces documents of very sophisticated structure and integrity. To data exchange it brings the flexibility of text documents. Publishing
The publishing model concentrates on documents presented to humans. The markup languages based on SGML have advantages over their predecessors (e.g., HTML over RTF): They allow publishers to create documents that can be viewed in many different systems. By differentiating a heading from the body of text (for example), the standard permits machines with different characteristics to display the same document. The pages don't look the same on all machines, but the basic structure of the document is clearly visible on a graphic workstation or a crude terminal. The empowering philosophy is abstraction, an expression of the function of each element of text rather than its appearance. The presentation software (browser) is free to decide precisely how to best set off a <BLOCKQUOTE> segment or whether <STRONG> text is best seen in bold or all caps. Document with Only Formatting: RTF
{\rtf1\ansi \deff4\deflang1033 {\fonttbl {\f4\froman\fcharset0\fprq2 Times New Roman;} } \pard\plain \widctlpar \f4\fs20 \par \pard \qc\widctlpar {\fs36 Who Needs Titles?\par } \pard \widctlpar \par \pard \fi720\widctlpar While the body of a document can be analyzed-by word counting or by sophisticated lexical heuristics-it is still difficult to automatically extract the author\rquote s original intent. It is more difficult still to generate a brief text fragment that telegraphs this meaning. On the other hand, by composing a title, the author has done exactly this. \par \pard \fi720\widctlpar \par } Similar Document with Structure: HTML
<HTML> <HEAD> <TITLE> Who Needs Titles?</TITLE></HEAD> <BODY> <P>Software can attempt to analyze a text, but it is very difficult to automatically extract the author's original intent. It is more difficult still to generate a brief text fragment that conveys this meaning. On the other hand, by composing a title, the author has done exactly this.</P> </BODY> </HTML> SGML is very powerful. You can express almost any document in SGML. Unfortunately, nobody could view your document. SGML is so abstract and so complex that no software can reliably display it. HTML
Enter HTML. In the early 1990s HTML was created as a simplification of the SGML standard. HTML was a slim SGML intended for use in publishing academic and scientific journals in the real world. Its page markup capabilities were extremely simplistic, and its current multimedia capabilities did not exist. HTML had two great strengths. Its simplicity was one. Because it demanded little of the device it was viewed on or of the software required to make the presentation, HTML browsers appeared quickly on many kinds of machines. Its other great strength is the hyperlink. The Internet created a single global computer system. The hyperlink created a single global document ”the World Wide Web. Hyperlinks and the Internet combined to make HTML an enormous success. Publishing became immediate: simple HTML documents could be displayed meaningfully on almost any device anywhere . In the brief childhood of the web, interoperability and intelligence were more important than graphic design. The success of HTML has made the web into a mass medium. With its success, HTML has faced pressures that have separated it from its original ideals. There is a new perspective. For better or worse , attractive graphics, not platform independence, not metastructure, drives the web. We want attractive screens. Who cares about the document structure? Actually the structure of information is more important than ever. The potential data content of the web is increasing enormously ”perhaps even faster than the growth in visual sophistication. But while we have done a great job of making the pages prettier, we have not done so well at the hard work of making them smart. The data content of the web is only potential, because most of its organization is lost in descriptions of presentation. We have HTML descriptions like the one below, which is a fragment of the options list for a new Saturn (from edmunds.com). The table may delight human eyes, but it ignores the real needs of a shopping robot or an affiliate marketer or a car buyer with a Palm. Consider, for example, that the "UP0" option package is simply the "UL0" package plus a CD player. A human can deduce this from the information scattered in the table. But it would take very complex and unreliable software to figure it out. Dr. Goldfarb's vision that a document is a data structure has been lost. HTML
<TABLE BORDER="0" cellPadding="0" cellSpacing="5" WIDTH="100%"> <TBODY> <TR> <TD vAlign="top"><FONT FACE="Arial"><I><B>Option<BR>Code</B></I> </FONT></TD> <TD vAlign="top"><FONT FACE="Arial"><I><B>Option Name</B></I> </FONT></TD> <TD ALIGN="right" vAlign="top" WIDTH="250"><FONT FACE="Arial"><I> <B>Invoice</B></I></FONT></TD> <TD ALIGN="right" vAlign="top" WIDTH="65"><FONT FACE="Arial"><I> <B>MSRP</B></I></FONT></TD> </TR> <TR> <TD vAlign="top"><FONT FACE="Arial" SIZE="2"><B>UL0</B></FONT> </TD> <TD vAlign="top"><FONT FACE="Arial" SIZE="2"><B>Radio: AM/FM Stereo Cassette and Automatic Tone Control (SC)</BBR> Includes theft protection and 4 coaxial speakers. NOT AVAILABLE with UP0.</FONT></TD> <TD ALIGN="right" vAlign="top" WIDTH="250"><FONT FACE="Arial" SIZE="2"></FONT></TD> <TD ALIGN="right" vAlign="top" WIDTH="65"><FONT FACE="Arial" SIZE="2">0</FONT></TD> </TR> <TR> <TD vAlign="top"><FONT FACE="Arial" SIZE="2"><B>UP0</B></FONT> </TD> <TD vAlign="top"><FONT FACE="Arial" SIZE="2"><B>Radio: AM/FM Stereo CD and Cassette (SC)</BBR> Includes auto tone control, theft protection and 4 coaxial speakers. NOT AVAILABLE with UL0.</FONT></TD> <TD ALIGN="right" vAlign="top" WIDTH="250"><FONT FACE="Arial" SIZE="2">8</FONT></TD> <TD ALIGN="right" vAlign="top" WIDTH="65"><FONT FACE="Arial" SIZE="2">0</FONT></TD> </TR> </TBODY> </TABLE> The result of this HTML is this chart.
XML to the Rescue
XML can represent this data in a variety of ways. The most natural formulations make clear the items included in each option package and can easily express a superset relationship like that of the UL0 and UP0 packages. XML
<OPTION code= "UP0" msrp="198" invoice="220" availability="SC"> Radio: AM/FM Stereo CD and Cassette <ITEM>CD player</ITEM> <OPTION code= "UL0" msrp="90" invoice="100" availability="SC"> Radio: AM/FM Stereo Cassette and Automatic Tone Control <ITEM>theft protection</ITEM> <ITEM>4 coaxial speakers</ITEM> <ITEM>AM/FM radio</ITEM> <ITEM>cassette player</ITEM> </OPTION> </OPTION> Data Exchange
Conversely, a data structure is also a document. Consider a few: Library catalog entry Telephone bill Class schedule Each of these is a data object with functional work to do. But people need to read them, as well. They exist in database format, and programmers can (and do) make them visible by writing middleware software to convert the content to HTML. But if these objects were in XML format, they could be read by browsers as easily (and sometimes more easily) as a web page and as attractively ”if styling files exist. This is the other approach to Dr. Goldfarb's ideal. The data is self-describing and open . The distinction between document and datagram disappears. |