Mastering Regular Expressions
3.4. Strings, Character Encodings, and Modes
Before getting into the various type of metacharacters generally available, there are a number of global issues to understand: regular expressions as strings, character encodings, and match modes. These are simple concepts, in theory, and in practice, some indeed are. With most, though, the small details, subtleties, and inconsistencies among the various implementations sometimes makes it hard to pin down exactly how they work in practice. The next sections cover some of the common and sometimes complex issues you'll face. 3.4.1. Strings as Regular Expressions
The concept is simple: in most languages except Perl, awk , and sed , the regex engine accepts regular expressions as normal strings strings that are often provided as string literals like "^From:(.*)" . What confuses many, especially early on, is the need to deal with the language's own string-literal metacharacters when composing a string to be used as a regular expression. Each language's string literals have their own set of metacharacters, and some languages even have more than one type of string literal, so there's no one rule that works everywhere, but the concepts are all the same. Many languages' string literals recognize escape sequences like \t , \\ , and \x2A , which are interpreted while the string's value is being composed . The most common regex- related aspect of this is that each backslash in a regex requires two backslashes in the corresponding string literal. For example, "\\n" is required to get the regex If you forget the extra backslash for the string literal and use "\n" , with many languages you'd then get Table 3-4. A Few String-Literal Examples
Every language's string literals are different, but some are quite different in that ' \ ' is not a metacharacter. For example. VB.NET's string literals have only one metacharacter, a double quote. The next sections look at the details of several common languages' string literals. Whatever the individual string-literal rules, the question on your mind when using them should be "what will the regular expression engine see after the language's string processing is done?" 3.4.1.1. Strings in Java
Java string literals are like those presented in the introduction, in that they are delimited by double quotes, and backslash is a metacharacter. Common combinations such as '\t' (tab), '\n' (newline), ' \\ ' (literal backslash), etc. are supported. Using a backslash in a sequence not explicitly supported by literal strings results in an error. 3.4.1.2. Strings in VB.NET
String literals in VB.NET are also delimited by double quotes, but otherwise are quite different from Java's. VB.NET strings recognize only one metasequence: a pair of double quotes in the string literal add one double quote into the string's value. For example, "he said ""hi""\." results in 3.4.1.3. Strings in C#
Although all the languages of Microsoft's .NET Framework share the same regular expression engine internally, each has its own rules about the strings used to create the regular-expression arguments. We just saw Visual Basic's simple string literals. In contrast, Microsoft's C # language has two types of string literals. C # supports the common double-quoted string similar to the kind discussed in this section's introduction, except that "" rather than \ " adds a double quote into the string's value. However, C # also supports "verbatim strings," which look like @"‹" . Verbatim strings recognize no backslash sequences, but instead, just one special sequence: a pair of double quotes inserts one double quote into the target value. This means that you can use "\\t\\x2A" or @"\t\x2A" to create the 3.4.1.4. Strings in PHP
PHP also offers two types of strings, yet both differ from either of C # 's types. With PHP's double-quoted strings, you get the common backslash sequences like ' \n ', but you also get variable interpolation as we've seen with Perl (˜77), and also the special sequence {‹} which inserts into the string the result of executing the code between the braces. These extra features of PHP double-quoted strings mean that you'll tend to insert extra backslashes into regular expressions, but there's one additional feature that helps mitigate that need. With Java and C # string literals, a backslash sequence that isn't explicitly recognized as special within strings results in an error, but with PHP double-quoted strings, such sequences are simply passed through to the string's value. PHP strings recognize \t , so you still need "\\t" to get PHP single-quoted strings offer uncluttered strings on the order of VB.NET's strings, or C # 's @"‹" strings, but in a slightly different way. Within a PHP single-quoted string, the sequence \ ' includes one single quote in the target value, and \\ includes a backslash. Any other character (including any other backslash) is not considered special, and is copied to the target value verbatim. This means that '\t\x2A' creates PHP single-quoted strings are discussed further in Chapter 10 (˜445). 3.4.1.5. Strings in Python
Python offers a number of string-literal types. You can use either single quotes or double quotes to create strings, but unlike PHP, there is no difference between the two. Python also offers "triple-quoted" strings of the form '''‹''' and """‹""", which are different in that they may contain unescaped newlines. All four types offer the common backslash sequences such as \n , but have the same twist that PHP has in that unrecognized sequences are left in the string verbatim. Contrast this with Java and C # strings, for which unrecognized sequences cause an error. Like PHP and C # , Python offers a more literal type of string, its "raw string." Similar to C # 's @"‹" notation, Python uses an ' r ' before the opening quote of any of the four quote types. For example, r "\t\x2A" yields 3.4.1.6. Strings in Tcl
Tcl is different from anything else in that it doesn't really have string literals at all. Rather, command lines are broken into "words," which Tcl commands can then consider as strings, variable names , regular expressions, or anything else as appropriate to the command. While a line is being parsed into words, common backslash sequences like \n are recognized and converted, and backslashes in unknown combinations are simply dropped. You can put double quotes around the word if you like, but they aren't required unless the word has whitespace in it. Tcl also has a raw literal type of quoting similar to Python's raw strings, but Tcl uses braces, {‹}, instead of r'‹' . Within the braces, everything except a backslash-newline combination is kept as-is, so you can use {\t\x2A} to get Within the braces, you can have additional sets of braces so long as they nest. Those that don't nest must be escaped with a backslash, although the backslash does remain in the string's value. 3.4.1.7. Regex literals in Perl
In the Perl examples we've seen so far in this book, regular expressions have been provided as literals ("regular-expression literals"). As it turns out, you can also provide them as strings. For example: $str =~ m/(\w+)/;
can also be written as: $regex = '(\w+)'; $str =~ $regex;
or perhaps: $regex = "(\w+)"; $str =~ $regex;
(Note that using a string can be much less efficient ˜242, 348.) When a regex is provided as a literal, Perl provides extra features that the regular-expression engine itself does not, including:
In Perl, a regex literal is parsed like a very special kind of string. In fact, these features are also available with Perl double-quoted strings. The point to be aware of is that these features are not provided by the regular-expression engine. Since the vast majority of regular expressions used within Perl are as regex literals, most think that More details are available in Chapter 7, starting on page 288. 3.4.2. Character-Encoding Issues
A character encoding is merely an explicit agreement on how bytes with various values should be interpreted. A byte with the decimal value 110 is interpreted as the character ' n ' with the ASCII encoding, but as ' > ' with EBCDIC. Why? Because that's what someone decided there's nothing intrinsic about those values and characters that makes one encoding better than the other. The byte is the same; only the interpretation changes. ASCII defines characters for only half the values that a byte can hold. The ISO-8859-1 encoding (commonly called "Latin-1") fills in the blank spots with accented characters and special symbols, making an encoding usable by a larger set of languages. With this encoding, a byte with a decimal value of 234 is to be interpreted as , instead of being undefined as it is with ASCII. The important question for us is: when we intend for a certain set of bytes to be considered in the light of a particular encoding, does the program actually do so? For example, if we have four bytes with the values 234, 116, 101, and 115 that we intend to be considered as Latin-1 (representing the French word " tes"), we'd like the regex 3.4.2.1. Richness of encoding-related support
There are many encodings. When you're concerned with a particular one, important questions you should ask include:
The richness of an encoding's support has several important issues, including:
Sometimes things are not as simple as they might seem. For example, the \b of Sun's java.util.regex package properly understands all the word-related characters of Unicode, but its \w does not (it understands only basic ASCII). We'll see more examples of this later in the chapter. 3.4.3. Unicode
There seems to be a lot of misunderstanding about just what "Unicode" is. At the most basic level, Unicode is a character set or a conceptual encoding a logical mapping between a number and a character. For example, the Korean character ‚µ is mapped to the number 49,333. The number, called a code point , is normally shown in hexadecimal, with " U+ " prepended. 49,333 in hex is C0B5 , so ‚µ is referred to as U+C0B5 . Included as part of the Unicode concept is a set of attributes for many characters, such as "3 is a digit" and " is an uppercase letter whose lowercase equivalent is ." At this level, nothing is yet said about just how these numbers are actually encoded as data on a computer. There are a variety of ways to do so, including the UCS-2 encoding (all characters encoded with two bytes), the UCS-4 encoding (all characters encoded with four bytes), UTF-16 (most characters encoded with two bytes, but some with four), and the UTF-8 encoding (characters encoded with one to six bytes). Exactly which (if any) of these encodings a particular program uses internally is usually not a concern to the user of the program. The user 's concern is usually limited to how to convert external data (such as data read from a file) from a known encoding (ASCII, Latin-1, UTF-8, etc.) to whatever the program uses. Programs that work with Unicode usually supply various encoding and decoding routines for doing the conversion. Regular expressions for programs that work with Unicode often support a \u num metasequence that can be used to match a specific Unicode character (˜117). The number is usually given as a four-digit hexadecimal number, so \uC0B5 matches ‚µ . It's important to realize that \uC0B5 is saying "match the Unicode character U+C0B5 ," and says nothing about what actual bytes are to be compared, which is dependent on the particular encoding used internally to represent Unicode code points. If the program happens to use UTF-8 internally, that character happens to be represented with three bytes. But you, as someone using the Unicode-enabled program, don't normally need to care. (Sometimes you do, as with PHP's preg suite and its u pattern modifier; ˜447). There are a few related issues that you may need to be aware of... 3.4.3.1. Characters versus combining-character sequences
What a person considers a "character" doesn't always agree with what Unicode or a Unicode-enabled program (or regex engine) considers to be a character. For example, most would consider to be a single character, but in Unicode, it can be composed of two code points, U+0061 (a) combined with the grave accent U +0300 ('). Unicode offers a number of combining characters that are intended to follow (and be combined with) a base character. This makes things a bit more complex for the regular-expression engine for example, should dot match just one code point, or the entire U+0061 plus U+0300 combination? In practice, it seems that many programs treat "character" and "code point" as synonymous, which means that dot matches each code point individually, whether it is base character or one of the combining characters. Thus, ( U+0061 plus U+0300 ) is matched by Perl and PCRE (and by extension, PHP's preg suite) support the \X metasequence, which fulfills what many might expect from dot ("match one character ") in that it matches a base character followed by any number of combining characters. See more on page 120. It's important to keep combining characters in mind when using a Unicode-enabled editor to input Unicode characters directly into regular-expressions. If an accented character, say , ends up in a regular expression as 'A' plus ' ,' it likely can't match a string containing the single code point version of (single code point versions are discussed in the next section). Also, it appears as two distinct characters to the regular-expression engine itself, so specifying In a similar vein, if a two-code-point character like is followed by a quantifier, the quantifier actually applies only to the second code point, just as with an explicit 3.4.3.2. Multiple code points for the same character
In theory, Unicode is supposed to be a one-to-one mapping between code points and characters, but there are many situations where one character can have multiple representations. The previous section notes that is U+0061 followed by U+0300 . It is, however, also encoded separately as the single code point U+00E0 . Why is it encoded twice? To maintain easier conversion between Unicode and Latin-1. If you have Latin-1 text that you convert to Unicode, will likely be converted to U+00E0 . But, it could well be converted to a U+0061, U+0300 combination. Often, there's nothing you can do to automatically allow for these different ways of expressing characters, but Sun's java.util.regex package provides a special match option, CANON_EQ , which causes characters that are "canonically equivalent" to match the same, even if their representations in Unicode differ (˜368). Somewhat related is that different characters can look virtually the same, which could account for some confusion at times among those creating the text you're tasked to check. For example, the Roman letter I ( U+0049 ) could be confused with I, the Greek letter Iota ( U+0399 ). Add dialytika to that to get or and it can be encoded four different ways ( U+00CF ; U+03AA ; U+0049 U+0308 ; U+0399 U+0308 ). This means that you might have to manually allow for these four possibilities when constructing a regular expression to match . There are many examples like this. Also plentiful are single characters that appear to be more than one character. For example, Unicode defines a character called "SQUARE HZ" ( U+3390 ), which appears as Hz . This looks very similar to the two normal characters Hz ( U+0048 U+007A ). Although the use of special characters like Hz is minimal now, their adoption over the coming years will increase the complexity of programs that scan text, so those working with Unicode would do well to keep these issues in the back of their mind. Along those lines, one might already expect, for example, the need to allow for both normal spaces ( U+0020 ) and no-break spaces ( U+00A0 ), and perhaps also any of the dozen or so other types of spaces that Unicode defines. 3.4.3.3. Unicode 3.1+ and code points beyond U+FFFF
With the release of Unicode Version 3.1 in mid 2001, characters with code points beyond U+FFFF were added. (Previous versions of Unicode had built in a way to allow for characters at those code points, but until Version 3.1, none were actually defined.) For example, there is a character for musical symbol C Clef defined at U+1D121 . Older programs built to handle only code points U+FFFF and below won't be able to handle this. Most programs' \ unum indeed allow only a four-digit hexadecimal number. Programs that can handle characters at these new code points generally offer the \x { num } sequence, where num can be any number of digits. (This is offered instead of, or in addition to, the four-digit \u num notation.) You can then use \x{1D121} to match the C Clef character. 3.4.3.4. Unicode line terminator
Unicode defines a number of characters (and one sequence of two characters) that are to be considered line terminators , shown in Table 3-5. Table 3-5. Unicode Line Terminators
When fully supported, line terminators influence how lines are read from a file (including, in scripting languages, the file the program is being read from). With regular expressions, they can influence both what dot matches (˜111), and where 3.4.4. Regex Modes and Match Modes
Most regex engines support a number of different modes for how a regular expression is interpreted or applied. We've seen an example of each with Perl's /x modifier (regex mode that allows free whitespace and comments ˜72) and /i modifier (match mode for case-insensitive matching ˜47). Modes can generally be applied globally to the whole regex, or in many modern flavors, partially, to specific subexpressions of the regex. The global application is achieved through modifiers or options, such as Perl's /i , PHP's i pattern modifier (˜446), or java.util.regex 's Pattern.CASE_INSENSITIVE flag (˜99). If supported, the partial application of a mode is achieved with a regex construct that looks like How these modes are invoked within a regex is discussed later in this chapter (˜135). In this section, we'll merely review some of the modes commonly available in most systems. 3.4.4.1. Case-insensitive match mode
The almost ubiquitous case-insensitive match mode ignores letter case during matching, so that Historically, case-insensitive matching support has been surprisingly fraught with bugs , but most have been fixed over the years. Still, Ruby's case-insensitive matching doesn't apply to octal and hex escapes . There are special Unicode-related issues with case-insensitive matching (which Unicode calls "loose matching"). For starters, not all alphabets have the concept of upper and lower case, and some have an additional title case used only at the start of a word. Sometimes there's not a straight one-to-one mapping between upper and lower case. A common example is that a Greek Sigma, & pound ;, has two lowercase versions, Another issue is that sometimes a single character maps to a sequence of multiple characters. One well known example is that the uppercase version of is the two-character combination "SS". Only Perl handles this properly. There are also Unicode-manufactured problems. One example is that while there's a single character j ( U+01F0 ), it has no single-character uppercase version. Rather, J requires a combining sequence (˜107), U+004A and U+030C . Yet, j and J should match in a case-insensitive mode. There are even examples like this that involve one-to-three mappings. Luckily, most of these do not involve commonly used characters. 3.4.4.2. Free-spacing and comments regex mode
In this mode, whitespace outside of character classes is mostly ignored. Whitespace within a character class still counts (except in java.util.regex ), and comments are allowed between # and a newline. We've already seen examples of this for Perl (˜72), Java (˜98), and VB.NET (˜99). Except for java.util.regex , it's not quite true that all whitespace outside of classes is ignored, but that it's turned into a do-nothing metacharacter. The distinction is important with something like Of course, just what is and isn't "whitespace" is subject to the character encoding in effect, and its fullness of support. Most programs recognize only ASCII whitespace. 3.4.4.3. Dot-matches-all match mode (a.k.a., "single-line mode")
Usually, dot does not match a newline. The original Unix regex tools worked on a line-by-line basis, so the thought of matching a newline wasn't an issue until the advent of sed and lex . By that time, [ For modern programming languages, a mode in which dot matches a newline can be as useful as one where dot doesn't. Which of these is most convenient for a particular situation depends on, well, the situation. Many programs now offer ways for the mode to be selected on a per-regex basis. There are a few exceptions to the common standard. Unicode-enabled systems, such as Sun's Java regex package, may expand what dot normally does not match to include any of the single-character Unicode line terminators (˜109). Tcl's normal state is that its dot matches everything, but in its special "newline-sensitive" and "partial newline-sensitive" matching modes, both dot and a negated character class are prohibited from matching a newline. An unfortunate name . When first introduced by Perl with its /s modifier, this mode was called "single-line mode." This unfortunate name continues to cause no end of confusion because it has nothing whatsoever to do with 3.4.4.4. Enhanced line-anchor match mode (a.k.a., "multiline mode")
An enhanced line-anchor match mode influences where the line anchors, It's much the same for Programs that offer this mode often offer As with dot, there are exceptions to the common standard. A text editor like GNU Emacs normally lets the line anchors match at embedded newlines, since that makes the most sense for an editor. On the other hand, lex has its Unicode-enabled systems, such as Sun's java.util.regex , may allow the line anchors in this mode to match at any line terminator (˜109). Ruby's line anchors normally do match at any embedded newline, and Python's Traditionally, this mode has been called "multiline mode." Although it is unrelated to "single-line mode," the names confusingly imply a relation. One simply modifies how dot matches, while the other modifies how [ 3.4.4.5. Literal-text regex mode
A "literal text" mode is one that doesn't recognize most or all regex metacharacters. For example, a literal-text mode version of |