Encoding Special Characters in Web Output

16.5.1 Problem

Certain characters are special in HTML pages and must be encoded if you want to display them literally. Because database content often contains these characters, scripts that include query results in web pages should encode those results to prevent browsers from misinterpreting the information.

16.5.2 Solution

Use the methods that are provided by your API for performing HTML-encoding and URL-encoding.

16.5.3 Discussion

HTML is a markup languageit uses certain characters as markers that have a special meaning. To include literal instances of these characters in a page, you must encode them so that they are not interpreted as having their special meanings. For example, < should be encoded as < to keep a browser from interpreting it as the beginning of a tag. Furthermore, there are actually two kinds of encoding, depending on the context in which you use a character. One encoding is appropriate for general HTML text, another is used for text that is part of a URL in a hyperlink.

The MySQL show-tables scripts shown in Recipe 16.3 and Recipe 16.4 are simple demonstrations of how to produce web pages using programs. But with one exception, the scripts have a common failing: they take no care to properly encode special characters that occur in the information retrieved from the MySQL server. (The exception is the JSP version of the script; the tag used there handles encoding automatically, as we'll discuss shortly.)

As it happens, I deliberately chose information to display that is unlikely to contain any special characters, so they should work properly even in the absence of any encoding. However, in the general case, it's unsafe to assume that a query result will contain no special characters and thus you must be prepared to encode it. Neglecting to do this often results in scripts that generate pages containing malformed HTML that displays incorrectly.

This section describes how to handle special characters, beginning with some general principles, and then discusses how each API implements encoding support. The API-specific examples show how to process information drawn from a database table, but they can be adapted to any content you include in a web page, no matter its source.

16.5.4 General Encoding Principles

One form of encoding applies to characters that are used in writing HTML constructs, another applies to text that is included in URLs. It's important to understand this distinction so that you don't encode text inappropriately. Note too that encoding text for inclusion in a web page is an entirely different issue than encoding special characters in data values for inclusion in a SQL statement. The latter issue is discussed in Recipe 2.8.

16.5.4.1 Encoding characters that are special in HTML

HTML markup uses < and > characters to begin and end tags, & to begin special entity names (such as   to signify a non-breaking space), and " to quote attribute values in tags (such as

). Consequently, to display literal instances of these characters, you must encode them as HTML entities so that browsers or other clients understand your intent. To do this, convert <, >, &, and " to the corresponding HTML entity designators < (less than), > (greater than), & (ampersand), and " (quote).

Suppose you want to display the following string literally in a web page:

Paragraphs begin and end with

&

tags.

If you send this text to the client browser exactly as shown, the browser will misinterpret it. (The

and

tags will be taken as paragraph markers and the & may be taken as the beginning of an HTML entity designator.) To display the string the way you intend, the special characters should be encoded as the <, >, and &, entities:

Paragraphs begin and end with <p> & </p> tags.

The principle of encoding text this way is also useful within tags. For example, HTML tag attribute values usually are enclosed within double quotes, so it's important to perform HTML-encoding on attribute values. Suppose you want to include a text input box in a form, and you want to provide an initial value of Rich "Goose" Gossage to be displayed in the box. You cannot write that value literally in the tag like this:

 

The problem here is that the double-quoted value attribute includes internal double quotes, which makes the tag malformed. The proper way to write it is to encode the double quotes:

 

When a browser receives this text, it will decode the " entities back to " characters and interpret the value attribute value properly.

16.5.4.2 Encoding characters that are special in URLs

URLs for hyperlinks that occur within HTML pages have their own syntax, and their own encoding. This encoding applies to attributes within several tags:

<a href="<TT><I>URL</I></TT>"> </a>

<a href="<TT><I>URL</I></TT>">Many characters have special meaning within URLs, such as :, /, ?, =, &, and ;. The following URL contains some of these characters: </a>

<a href="<TT><I>URL</I></TT>">http://apache.snake.net/myscript.php?id=428&name=Gandalf</a>

<a href="<TT><I>URL</I></TT>">Here the : and / characters segment the URL into components, the ? character indicates that parameters are present, and the & characters separates the parameters, each of which is specified as a name=value pair. (The ; character is not present in the URL just shown, but commonly is used instead of & to separate parameters.) If you want to include any of these characters literally within a URL, you must encode them to prevent the browser from interpreting them with their usual special meaning. Other characters such as spaces require special treatment as well. Spaces are not allowed within a URL, so if you want to reference a page named my home page.html on the site apache.snake.net, the URL in the following hyperlink won't work: </a>

<a href="http://apache.snake.net/my home page.html">My Home Page</a>

URL-encoding for special and reserved characters is performed by converting each such character to % followed by two hexadecimal digits representing the character's ASCII code. For example, the ASCII value of the space character is 32 decimal, or 20 hexadecimal, so you'd write the preceding hyperlink like this:

<a href="http://apache.snake.net/my%20home%20page.html">My Home Page</a>

Sometimes you'll see spaces encoded as + in URLs. This too is legal.

16.5.4.3 Encoding interactions

Be sure to encode information properly for the context in which you're using it. Suppose you want to create a hyperlink to trigger a search for items matching a search term, and you want the term itself to appear as the link label that is displayed in the page. In this case, the term appears as a parameter in the URL, and also as HTML text between the and tags. If the search term is "cats & dogs", the unencoded hyperlink construct looks like this:

<a href="/cgi-bin/myscript?term=cats & dogs">cats & dogs</a>

That is incorrect because & is special in both contexts and the spaces are special in the URL. The link should be written like this instead:

<a href="/cgi-bin/myscript?term=cats%20%26%20dogs">cats & dogs</a>

Here, & is HTML-encoded as & for the link label, and is URL-encoded as %26 for the URL, which also includes spaces encoded as %20.

Granted, it's a pain to encode text before writing it to a web page, and sometimes you know enough about a value that you can skip the encoding (see the sidebar Do You Always Need to Encode Web Page Output?"). But encoding is the safe thing to do most of the time. Fortunately, most APIs provide functions to do the work for you. This means you need not know every character that is special in a given context. You just need to know which kind of encoding to perform, and call the appropriate function to produce the intended result.

Do You Always Need to Encode Web Page Output?

If you know a value is legal in a particular context within a web page, you need not encode it. For example, if you obtain a value from an integer-valued column in a database table that cannot be NULL, it must necessarily be an integer. No HTML- or URL-encoding is needed to include the value in a web page, because digits are not special in HTML text or within URLs. On the other hand, suppose you solicit an integer value using a field in a web form. You might be expecting the user to provide an integer, but the user might be confused and enter an illegal value. You could handle this by displaying an error page that shows the value and explains that it's not an integer. But if the value contains special characters and you don't encode it, the page won't display the value properly, possibly confusing the user further.

16.5.5 Encoding Special Characters Using Web APIs

The following encoding examples show how to pull values out of MySQL and perform both HTML-encoding and URL-encoding on them to generate hyperlinks. Each example reads a table named phrase that contains short phrases, using its contents to construct hyperlinks that point to a (hypothetical) script that searches for instances of the phrases in some other table. The table looks like this:

mysql> SELECT phrase_val FROM phrase ORDER BY phrase_val; +--------------------------+ | phrase_val | +--------------------------+ | are we "there" yet? | | cats & dogs | | rhinoceros | | the whole > sum of parts | +--------------------------+

The goal here is to generate a list of hyperlinks using each phrase both as the hyperlink label (which requires HTML-encoding) and in the URL as a parameter to the search script (which requires URL-encoding). The resulting links look like this:

<a href="/cgi-bin/mysearch.pl?phrase=are%20we%20%22there%22%20yet%3F"> are we "there" yet?</a> <a href="/cgi-bin/mysearch.pl?phrase=cats%20%26%20dogs"> cats & dogs</a> <a href="/cgi-bin/mysearch.pl?phrase=rhinoceros"> rhinoceros</a> <a href="/cgi-bin/mysearch.pl?phrase=the%20whole%20%3E%20sum%20of%20parts"> the whole > sum of parts</a>

The links produced by some APIs will look slightly different, because they encode spaces as + rather than as %20.

16.5.5.1 Perl

The Perl CGI.pm module provides two methods, escapeHTML( ) and escape( ), that handle HTML-encoding and URL-encoding. There are three ways to use these methods to encode a string $str:

I prefer the last alternative because it is consistent with the CGI.pm function call interface that you use for other imported method names. Just remember to include the encoding method names in the use CGI statement for any Perl script that requires them, or you'll get "undefined subroutine" errors when the script executes.

The following code reads the contents of the phrase table and produces hyperlinks from them using escapeHTML( ) and escape( ):

my $query = "SELECT phrase_val FROM phrase ORDER BY phrase_val"; my $sth = $dbh->prepare ($query); $sth->execute ( ); while (my ($phrase) = $sth->fetchrow_array ( )) { # URL-encode the phrase value for use in the URL # HTML-encode the phrase value for use in the link label my $url = "/cgi-bin/mysearch.pl?phrase=" . escape ($phrase); my $label = escapeHTML ($phrase); print a ({-href => $url}, $label) . br ( ) . " "; }

16.5.5.2 PHP

In PHP, the htmlspecialchars( ) and urlencode( ) functions perform HTML-encoding and URL-encoding. They're used as follows:

$query = "SELECT phrase_val FROM phrase ORDER BY phrase_val"; $result_id = mysql_query ($query, $conn_id); if ($result_id) { while (list ($phrase) = mysql_fetch_row ($result_id)) { # URL-encode the phrase value for use in the URL # HTML-encode the phrase value for use in the link label $url = "/mcb/mysearch.php?phrase=" . urlencode ($phrase); $label = htmlspecialchars ($phrase); printf ("<a href="">%s</a> ", $url, $label); } mysql_free_result ($result_id); }

16.5.5.3 Python

In Python, the cgi and urllib modules contain the relevant encoding methods. cgi.escape( ) performs HTML-encoding and urllib.quote( ) does URL-encoding:

import cgi import urllib query = "SELECT phrase_val FROM phrase ORDER BY phrase_val" cursor = conn.cursor ( ) cursor.execute (query) for (phrase,) in cursor.fetchall ( ): # URL-encode the phrase value for use in the URL # HTML-encode the phrase value for use in the link label url = "/cgi-bin/mysearch.py?phrase=" + urllib.quote (phrase) label = cgi.escape (phrase, 1) print "<a href="">%s</a> " % (url, label) cursor.close ( )

The first argument to cgi.escape( ) is the string to be HTML-encoded. By default, this function converts <, >, and & characters to their corresponding HTML entities. To tell cgi.escape( ) also to convert double quotes to the " entity, pass a second argument of 1, as shown in the example. This is especially important if you're encoding values to be placed into a double-quoted tag attribute.

16.5.5.4 Java

The JSTL tag automatically performs HTML-encoding for JSP pages. (Strictly speaking, it performs XML-encoding, but the set of characters affected is <, >, &, ", and ', which includes all those needed for HTML-encoding.) By using to display text in a web page, you need not even think about converting special characters to HTML entities. If for some reason you want to suppress encoding, invoke like this:

 

To URL-encode parameters for inclusion in a URL, use the tag. Specify the URL string in the tag's value attribute, and include any parameter values and names in tags in the body of the tag. A parameter value can be given either in the value attribute of a tag or in its body. Here's an example that shows both ways:

sky blue

This will URL-encode the values of the id and color parameters and add them to the end of the URL. The result is placed in an object named urlStr, which you can display as follows:

 

The tag does not encode special characters such as spaces in the string supplied in its value attribute. You must encode them yourself, so it's probably best just to avoid creating pages with spaces in their names, to avoid the likelihood that you'll need to refer to them.

The and tags can be used as follows to display entries from the phrase table:

SELECT phrase_val FROM phrase ORDER BY phrase_val <%-- URL-encode the phrase value for use in the URL --%> <%-- HTML-encode the phrase value for use in the link label --%> <a href="<c:out value=">"> </a>

Категории