Professional XML (Programmer to Programmer)

Although many people think of XML as a data format, many of the important uses for XML are in layout. Of these, one of the most significant is XHTML, or the Extensible HyperText Markup Language. XHTML is the "XML-ized" version of HTML, cleaning up many of the sloppier features of HTML and creating a more standardized, more easily validated document format. The Cascading Stylesheets (CSS) feature, although not an XML format, is widely viewed as important for XHTML development. CSS is a formatting language that can be used with either HTML or XHTML. It is generally viewed as a cleaner replacement for the Font tag and other similar devices that force a particular view. When used in combination with XHTML, the model is that the XHTML document carries all the content of the page, whereas CSS is used to format it. This chapter looks at these two sets of specifications, as well as some validation tools that help ensure your code is valid. In addition, this chapter looks at microformats, a relatively recent set of uses for both XHTML and CSS.

Understanding XHTML

When people hear that XHTML is the XML version of HTML, the first question is usually, "Isn't HTML already XML?" or "What's wrong with HTML that it has to be XML-ized?" I hope that I'll be able to answer both these questions and more in this chapter. For those who are planning on skipping this chapter or who want the answers now, the answers are, "sort of, but not exactly" and "a few fairly major things."

The Evolution of Markup

Markup is information added to text to describe the text. In HTML and XHTML, these are the tags (for example, <b></b>) that are added around the text. However, markup isn't just HTML and its family. Rich Text Format (RTF) is another example of a markup language. The text, "This is bold, and this isn't" could be marked up in RTF as {\b\insrsid801189\charrsid801189 This is bold}{\insrsid13238650\charrsid13238650, and this isn't}. Other markup languages include TeX and ASN.1. Markup, therefore, is just a way of adding formatting and semantic information. Formatting information includes identifiers such as bold, italic, first level of heading, or beginning of a table. Semantic information includes identifiers such as beginning of a section, a list item or similar notations.

The idea of markup is quite old-separate the content from the description of that content. A number of implementations using this concept arose back in the stone ages of computing (the 1960s), including Standard Generalized Markup Language (SGML). SGML was strategy for defining markup. That is, you used SGML to define the tags and attributes that someone else could use to markup a document. This notion was powerful, enabling the production of documents that could be rendered easily in a number of formats.

SGML begat HTML, and it was good. HTML was a markup language loosely defined on the concepts of SGML. It lifted the tagging concept, but simplified it greatly because HTML was intended solely as a means of displaying text on computer screens. Later versions attempted to increase the rigor of the standard, for example, creating a Document Type Description (DTD-the format SGML used as the means of defining a markup language). HTML slowly evolved in a fairly organic fashion: first adding tags and then becoming a standard (4.01). Meanwhile, on an almost parallel track, SGML begat XML, and it was good. XML was an attempt to simplify SGML, creating a technology that provided many of the same capabilities of language definition. Although it wasn't necessarily inevitable, these two cousins decided to get together and produce an offspring, XHTML. XHTML has XML's eye for rigor: XHTML documents must be well-formed XML documents first, and rules around formatting are specific. However, XHTML still has HTML's looks and broad appeal.

The Basics of XHTML

Unfortunately, no one XHTML standard exists. In fact, there are currently six flavors or versions of XHTML:

This chapter focuses mostly on XHTML 1.0 Strict and XHTML 1.1-primarily 1.1. The remaining current versions are primarily compatibility versions, meant to assist developers in migrating older code. XHTML 2.0 is still in the future, and even the planned broken compatibility may change before it becomes a standard.

Validating XHTML

The one main improvement of XHTML over HTML is in enforcement of what constitutes a valid document. XHTML requires that a document follow these rules:

Listing 3-1: Using CDATA with embedded script

<script type="text/JavaScript"> <![CDATA[ //JavaScript content here ]]> </script>

The next major set of changes you need to make to convert your HTML pages to XHTML is to remove some of the deprecated HTML tags. XHTML 1.0 (especially the Transitional and Frameset varieties) still permits these elements, but they are invalid in future versions, including XHTML 1.1. (See the following table for more discussion of the deprecated elements.) Most of these elements were removed because they caused an intermixing of content and specific layout. The recommended method of adding layout is now with CSS, as you learn later in this chapter. See Listing 3-2 for a simple XHTML 1.1 file.

Open table as spreadsheet

Deprecated Element

Replacement

Discussion

applet embed

object

Applet, object, and embed were all methods for including content such as Java Applets and ActiveX objects. Rather than maintain these three elements, the object element is used for embedding any external objects.

dir menu

ul

Dir and Menu were little-used elements that provided much of the same functionality as unordered lists (ul).

font basefont blockquote i strike center

CSS

These elements enforced a particular view on the content of a page and merged the content with layout. This functionality is now superseded by CSS, and you should use that technology instead. Browsers (such as screen readers for u the sight impaired) are free to ignore the CSS, if necessary, leaving the content usable.

layer

CSS

A Netscape/Mozilla-specific tag that was used to create dynamic HTML pages. The functionality is roughly replaceable with div and span tags.

isindex

input type=

This ancient tag (that I haven't seen for a while) was used to create a search field on a page. This should be replaced with a form containing search fields and "real" server-side search functionality.

style (attribute)

CSS

With XHTML 1.1, the style attribute is also considered deprecated. Although it is not yet removed from the standard, it should be avoided. Instead, use id or class attributes and CSS to apply style to individual elements.

Finally, to ensure your document is processed in the format you intend, you should include a reference to the DTD of the desired level of XHTML. This provides information to the browser or parser, which should then treat your document appropriately. The following table shows the expected DTD.

Open table as spreadsheet

XHTML Level

DocType Declaration

XHTML1.0 Transitional

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- transitional.dtd">

XHTML1.0 Frameset

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML1.0 Frameset//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- frameset.dtd">

XHTML1.0 Strict

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1- strict.dtd">

XHTMLBasic

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTMLBasic 1.0//EN" "http://www.w3.org/TR/xhtml-basic/xhtml- basic10.dtd">

XHTML1.1

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">

Listing 3-2: A simple XHTML 1.1 file

<?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"> <html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" > <head> <title>Some Title</title> </head> <body> <p>Page Content</p> </body> </html>

How can you ensure your documents are valid? By validating them, of course. That seems to be a circular argument, doesn't it? A number of XHTML validation services and applications are available to ensure the documents you create are both well-formed and valid, most notably the W3C Validation service and Tidy.

W3C Validation Service

As the standards body responsible for HTML and XHTML, it seems appropriate that the W3C has a service available for validating XHTML documents. This service (see Figure 3-1) is available at http://www.validator.w3.org, and enables checking a document by URL, file upload, or text input.

Figure 3-1

Tidy

Tidy is an application that was initially developed at the W3C, but later was taken over by the broader development community. It is a command-line application (see Figure 3-2 for some of the command-line arguments) that can validate a document, return a list of errors, or correct the errors. In addition, a number of wrappers are available that provide direct access to the functionality from the programming language of your choice.

Figure 3-2

The two most common uses for Tidy are to create new, compliant versions of your Web pages, and to clean up errors and formatting. Listing 3-3 shows an HTML file that contains a number of issues. Although this file would still be valid in a browser (see Figure 3-3), you can use Tidy to clean up its problems and convert the document to XHMTL.

Figure 3-3

Listing 3-3: Not very valid HTML

<head> <title>Lorem ipsum dolor sit amet, consectetuer adipiscing elit</title></head> <body lang=EN-US BGCOLOR=white text=black link=blue vlink=purple> <p><b><i>Lorem ipsum dolor sit amet</b></i>, consectetuer adipiscing elit. Suspendisse sit amet odio. Duis porta pulvinar arcu. Curabitur pellentesque, neque id hendrerit volutpat, ante nulla mattis lacus, sit amet varius augue orci a enim. Suspendisse ornare purus ac nunc. Maecenas cursus congue libero. Aliquam erat volutpat. Nulla interdum dui. Ut purus. Donec pellentesque lorem vitae purus. Pellentesque ultricies consectetuer nisl. Nulla facilisi. Etiam aliquam adipiscing sem. Nam metus ipsum, nonummy eget, vestibulum quis, fringilla non, nulla. Suspendisse placerat tempor tortor. Mauris tortor dolor, sollicitudin eget, gravida rhoncus, vestibulum vel, eros. Proin vitae nunc vel metus mattis viverra. Pellentesque at turpis vel quam laoreet dapibus. Maecenas interdum metus nec eros. Nam ut elit eu nisl ullamcorper tincidunt. Praesent faucibus pede in risus feugiat viverra.</p> <hr> <p><font face="arial" size=2>Integer vulputate nibh. Mauris convallis nisi vitae magna. Sed varius, velit eu pretium porta, enim tellus ornare ipsum, vel interdum nisi tellus vitae massa.</font></p> <p>Maecenas imperdiet nunc sed ipsum.</p> <li>Cras euismod, lorem et rhoncus placerat, felis nibh lobortis lorem, id eleifend felis eros rutrum dolor. <li>Nunc euismod, nunc viverra porttitor imperdiet, nibh tellus convallis erat, sit amet laoreet neque nunc ac purus.</li> </ul> <Center> <table border=1> <tr> <td width=197 valign=top style='width:2 padding:0in 5.4pt 0in 5.4pt'> <p>Ut ut lectus</p> <td width=197 valign=top style='width:2 border-left:none;padding:0in 5.4pt 0in 5.4pt'> <p>&nbsp;Nunc velit dui, fermentum quis, condimentum viverra, adipiscing quis, nisl</p> <td><p>&nbsp;Curabitur feugiat</p></tr><tr><td><p>&nbsp;Aliquam libero</p> <td> <p>&nbsp;Maecenas at enim</p> <td><p>Nunc non nulla a nulla molestie ornare&copy;</p> </table> </CENTER> </body>

In the preceding code, a number of errors are present in the HTML (such as a missing root html tag, missing close tags for the last tr, and so on). Also, a number of items that are valid HTML items are not valid in XHTML. For example, the hr tag is an empty tag; therefore, it should be written <hr />. In addition, many unquoted attributes are present, and the center tag is written in mixed case in one place and in all uppercase elsewhere.

Converting a document as shown in Listing 3-3 is not an uncommon task, but it can be quite difficult to do manually. HTML editing software and users have found just too many ways to hide bad code in Web pages. Running Tidy with the following command-line generates the list of warnings in Listing 3-4. As you can see, it detected many of the expected errors, as well as a few others.

tidy -o c:\temp\fixed.htm -f errors.txt -i -w 79 -c -b -asxhtml -utf8 Invalid.htm

Note: The options set are:

Many other command-line options exist. In addition, many other configuration settings alter the output of Tidy. See the documentation for more details. If you want a common set of parameters, it would be easier to create a configuration file for running Tidy. This is a text file, with the configuration elements listed one per line. With this in place, the previous command-line could be simplified to:

tidy -config myconfig.txt Invalid.htm

Listing 3-4 shows the result of running Tidy on the sample file.

Listing 3-4: Warnings generated

line 1 column 1 - Warning: missing <!DOCTYPE> declaration line 4 column 7 - Warning: replacing unexpected b by </b> line 4 column 4 - Warning: replacing unexpected i by </i> line 3 column 1 - Warning: <li> isn't allowed in <body> elements line 21 column 2 - Warning: inserting implicit <ul> line 25 column 1 - Warning: discarding unexpected </ul> line 21 column 2 - Warning: missing </ul> before <center> line 27 column 1 - Warning: <table> lacks "summary" attribute Info: Document content looks like HTML 4.01 Transitional 8 warnings, 0 errors were found!

Although the cleaned document may not reflect all the intent of the original (an inappropriate change sometimes occurs), it should be much easier to clean up. Listing 3-5 shows the output of the previous code.

Listing 3-5: Cleaned XHTML output

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd"> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <meta name="generator" content= "HTML Tidy for Windows (vers 1 September 2005), see www.w3.org" /> <title>Lorem ipsum dolor sit amet, consectetuer adipiscing elit</title> <style type="text/css"> /*<![CDATA[*/ body { background-color: white; color: black; } :link { color: blue } :visited { color: purple } div.c4 {text-align: center} td.c3 {width:2.05in;border:solid windowtext 1.0pt; border-left:none;padding:0in 5.4pt 0in 5.4pt} td.c2 {width:2.05in;border:solid windowtext 1.0pt; padding:0in 5.4pt 0in 5.4pt} p.c1 {font-family: arial; font-size: 80%} /*]]>*/ </style> </head> <body lang="EN-US" xml:lang="EN-US"> <p><b><i>Lorem ipsum dolor sit amet</i></b>, consectetuer adipiscing elit. Suspendisse sit amet odio. Duis porta pulvinar arcu. Curabitur pellentesque, neque id hendrerit volutpat, ante nulla mattis lacus, sit amet varius augue orci a enim. Suspendisse ornare purus ac nunc. Maecenas cursus congue libero. Aliquam erat volutpat. Nulla interdum dui. Ut purus. Donec pellentesque lorem vitae purus. Pellentesque ultricies consectetuer nisl. Nulla facilisi. Etiam aliquam adipiscing sem. Nam metus ipsum, nonummy eget, vestibulum quis, fringilla non, nulla. Suspendisse placerat tempor tortor. Mauris tortor dolor, sollicitudin eget, gravida rhoncus, vestibulum vel, eros. Proin vitae nunc vel metus mattis viverra. Pellentesque at turpis vel quam laoreet dapibus. Maecenas interdum metus nec eros. Nam ut elit eu nisl ullamcorper tincidunt. Praesent faucibus pede in risus feugiat viverra.</p> <hr /> <p >Integer vulputate nibh. Mauris convallis nisi vitae magna. Sed varius, velit eu pretium porta, enim tellus ornare ipsum, vel interdum nisi tellus vitae massa.</p> <p>Maecenas imperdiet nunc sed ipsum.</p> <ul> <li>Cras euismod, lorem et rhoncus placerat, felis nibh lobortis lorem, id eleifend felis eros rutrum dolor.</li> <li>Nunc euismod, nunc viverra porttitor imperdiet, nibh tellus convallis erat, sit amet laoreet neque nunc ac purus.</li> </ul> <div > <table border="1"> <tr> <td width="197" valign="top" class='c2'> <p>Ut ut lectus</p> </td> <td width="197" valign="top" class='c3'> <p> Nunc velit dui, fermentum quis, condimentum viverra, adipiscing quis, nisl</p> </td> <td> <p> Curabitur feugiat</p> </td> </tr> <tr> <td> <p> Aliquam libero</p> </td> <td> <p> Maecenas at enim</p> </td> <td> <p>Nunc non nulla a nulla molestie ornare(c)</p> </td> </tr> </table> </div> </body> </html>

Tidy UI

For those less than comfortable with the command-line, Charles Reitzel created a Windows application to enable working visually with Tidy (see Figure 3-4). This is a handy utility if you have only a small amount of HTML to convert. For larger quantities, the command-line (or one of the code wrappers) is a better solution.

Figure 3-4

Just as with the command-line version, you can easily see the errors and warnings your document generates (see Figure 3-5). Double-clicking the warning or error selects the appropriate line in the edit window.

Figure 3-5

The functionality of Tidy has also been exposed through a number of language wrappers. This allows you to integrate the functionality into your own applications. Wrappers are available for COM, .NET, Java, Perl, Python, and many other languages. See the Tidy home page (http://www.tidy.sourceforge.net/) for the full list.

The included project is a simple text editor that includes the capability to run Tidy (using the .NET wrapper) on the content. It is intentionally simple, but shows how you can integrate the Tidy functionality directly in an application.

First, create a new Windows Forms project. The sample project contains three tabs. The first is an edit window, the second a read-only text box containing the tidied XHTML, and the last is a Web browser window for viewing the resulting content. Next, add a reference to the .NET wrapper (see Figure 3-6). If you receive an error while adding the reference, it may be because the TidyATL.dll is not registered (the .NET wrapper is actually a .NET wrapper of the COM wrapper). Register the TidyATL.dll file using the command-line regsvr32 tidyatl.dll and try adding the reference again.

Figure 3-6

Most of the code in the included project is involved in the menus and file handling. The only code that actually calls the Tidy wrapper is in the TidyText function (see Listing 3-6). This takes a block of HTML, processes it with Tidy, and returns the result (see Figure 3-7). Each of the command-line properties of Tidy is exposed in an enumeration (TidyOptionId). You use the SetOptBool, SetOptInt and SetOptValue methods to set the desired settings. Alternatively, you can load the settings from a configuration file. This file is simply a list containing one parameter per line, along with the value, in the format:

property: value

Figure 3-7

For Boolean values, yes/no, true/false or 1/0 can be used for the value. ParseString loads the HTML, and SaveString returns the cleaned XHTML. You could alternatively use ParseFile and SaveFile to process files on disc or CleanAndRepair to clean a file in place.

Listing 3-6: Using the .NET Tidy wrapper

Private Function TidyText(ByVal text As String) As String Dim result As String = String.Empty Dim t As New Tidy.Document With t 'set options .SetOptBool(TidyOptionId.TidyIndentContent, 1) .SetOptBool(TidyOptionId.TidyXhtmlOut, 1) .SetOptBool(TidyOptionId.TidyMakeClean, 1) .SetOptBool(TidyOptionId.TidyIndentContent, 1) .SetOptInt(TidyOptionId.TidyIndentSpaces, 2) .SetOptValue(TidyOptionId.TidyCharEncoding, "utf8") 'or .LoadConfig("tidyconfig.txt") 'parse and return tidy'd html .ParseString(text) result = .SaveString() End With Return result End Function

The functionality of Tidy and its availability for multiple languages and platforms means you never have an excuse for invalid XHTML pages. Try to develop the habit of running it regularly on your XHTML to ensure it conforms.

The Evil Font Tag

When I first learned HTML, it was a fairly primitive formatting tool. You had your choice of bold, italics, or one of six headline levels. ("And we liked it!") Going further than this meant using the <font> element. Using this element, you could change the look of your Web sites, getting them to look closer to a corporate or other brand, or to make them look more like offline documentation.

However, like a lot of other gifts of technology, things were waiting to bite in this Pandora's Box. Using the <font> element meant that you were hard-coding huge amounts of information directly in the page. Maintaining <font> information as it changed was a chore. In addition, this information was repeated frequently through the document, causing page bloat and slow response times. Fortunately, around the time of HTML 4, CSS came along. As you will soon see, CSS is a way of applying the same type of information as you could using the <font> element (and more), but in a better way. Therefore, the <font> tag has been deprecated, and support for it in browsers will eventually go the way of the <blink> and other extinct HTML elements.

Категории