Mastering Regular Expressions

2017-07-07 02:10:07

3.4. Strings, Character Encodings, and Modes

Before getting into the various type of metacharacters generally available, there are a number of global issues to understand: regular expressions as strings, character encodings, and match modes.

These are simple concepts, in theory, and in practice, some indeed are. With most, though, the small details, subtleties, and inconsistencies among the various implementations sometimes makes it hard to pin down exactly how they work in practice. The next sections cover some of the common and sometimes complex issues you'll face.

3.4.1. Strings as Regular Expressions

The concept is simple: in most languages except Perl, awk , and sed , the regex engine accepts regular expressions as normal strings strings that are often provided as string literals like "^From:(.*)" . What confuses many, especially early on, is the need to deal with the language's own string-literal metacharacters when composing a string to be used as a regular expression.

Each language's string literals have their own set of metacharacters, and some languages even have more than one type of string literal, so there's no one rule that works everywhere, but the concepts are all the same. Many languages' string literals recognize escape sequences like \t , \\ , and \x2A , which are interpreted while the string's value is being composed . The most common regex- related aspect of this is that each backslash in a regex requires two backslashes in the corresponding string literal. For example, "\\n" is required to get the regex \n .

If you forget the extra backslash for the string literal and use "\n" , with many languages you'd then get , which just happens to do exactly the same thing as \n . Well, actually, if the regex is in an /x type of free-spacing mode, becomes empty, while \n remains a regex to match a newline. So, you can get bitten if you forget. Table 3-4 below shows a few examples involving \t and \x2A ( 2A is the ASCII code for ' * '.) The second pair of examples in the table show the unintended results when the string-literal metacharacters arent taken into account.

Table 3-4. A Few String-Literal Examples

String literal	`"[\t\x2A]"`	`"[\\t\\x2A]"`	`"\t\x2A"`	`"\\t\\x2A"`
String value	'[ *]'	' `[\t\x2A]` '	' *'	' `\t\x2A` '
As regex	`[ *]`	`[\t\x2A]`	`*`	`\t\x2A`
Matches	tab or star	tab or star	any number tabs	tab followed by star
In `/x` mode	tab or star	tab or star	error	tab followed by star

Every language's string literals are different, but some are quite different in that ' \ ' is not a metacharacter. For example. VB.NET's string literals have only one metacharacter, a double quote. The next sections look at the details of several common languages' string literals. Whatever the individual string-literal rules, the question on your mind when using them should be "what will the regular expression engine see after the language's string processing is done?"

3.4.1.1. Strings in Java

Java string literals are like those presented in the introduction, in that they are delimited by double quotes, and backslash is a metacharacter. Common combinations such as '\t' (tab), '\n' (newline), ' \\ ' (literal backslash), etc. are supported. Using a backslash in a sequence not explicitly supported by literal strings results in an error.

3.4.1.2. Strings in VB.NET

String literals in VB.NET are also delimited by double quotes, but otherwise are quite different from Java's. VB.NET strings recognize only one metasequence: a pair of double quotes in the string literal add one double quote into the string's value. For example, "he said ""hi""\." results in he said "hi\.

3.4.1.3. Strings in C#

Although all the languages of Microsoft's .NET Framework share the same regular expression engine internally, each has its own rules about the strings used to create the regular-expression arguments. We just saw Visual Basic's simple string literals. In contrast, Microsoft's C # language has two types of string literals.

C # supports the common double-quoted string similar to the kind discussed in this section's introduction, except that "" rather than \ " adds a double quote into the string's value. However, C # also supports "verbatim strings," which look like @"‹" . Verbatim strings recognize no backslash sequences, but instead, just one special sequence: a pair of double quotes inserts one double quote into the target value. This means that you can use "\\t\\x2A" or @"\t\x2A" to create the \t\x2A example. Because of this simpler interface, one would tend to use these @"‹" verbatim strings for most regular expressions.

3.4.1.4. Strings in PHP

PHP also offers two types of strings, yet both differ from either of C # 's types. With PHP's double-quoted strings, you get the common backslash sequences like ' \n ', but you also get variable interpolation as we've seen with Perl (˜77), and also the special sequence {‹} which inserts into the string the result of executing the code between the braces.

These extra features of PHP double-quoted strings mean that you'll tend to insert extra backslashes into regular expressions, but there's one additional feature that helps mitigate that need. With Java and C # string literals, a backslash sequence that isn't explicitly recognized as special within strings results in an error, but with PHP double-quoted strings, such sequences are simply passed through to the string's value. PHP strings recognize \t , so you still need "\\t" to get \t , but if you use "\w , you'll get \w because \w is not among the sequences that PHP double-quoted strings recognize. This extra feature, while handy at times, does add yet another level of complexity to PHP double-quoted strings, so PHP also offers its simpler single-quoted strings.

PHP single-quoted strings offer uncluttered strings on the order of VB.NET's strings, or C # 's @"‹" strings, but in a slightly different way. Within a PHP single-quoted string, the sequence \ ' includes one single quote in the target value, and \\ includes a backslash. Any other character (including any other backslash) is not considered special, and is copied to the target value verbatim. This means that '\t\x2A' creates \t\x2A . Because of this simplicity, single-quoted strings are the most convenient for PHP regular expressions.

PHP single-quoted strings are discussed further in Chapter 10 (˜445).

3.4.1.5. Strings in Python

Python offers a number of string-literal types. You can use either single quotes or double quotes to create strings, but unlike PHP, there is no difference between the two. Python also offers "triple-quoted" strings of the form '''‹''' and """‹""", which are different in that they may contain unescaped newlines. All four types offer the common backslash sequences such as \n , but have the same twist that PHP has in that unrecognized sequences are left in the string verbatim. Contrast this with Java and C # strings, for which unrecognized sequences cause an error.

Like PHP and C # , Python offers a more literal type of string, its "raw string." Similar to C # 's @"‹" notation, Python uses an ' r ' before the opening quote of any of the four quote types. For example, r "\t\x2A" yields \t\x2A . Unlike the other languages, though, with Pythons raw strings, all backslashes are kept in the string, including those that escape a double quote (so that the double quote can be included within the string): r" he said \"hi\"\." results in he said \"hi\"\. . This isnt really a problem when using strings for regular expressions, since Python's regex flavor treats \" as " , but if you like, you can bypass the issue by using one of the other types of raw quoting: r' he said "hi"\ .'

3.4.1.6. Strings in Tcl

Tcl is different from anything else in that it doesn't really have string literals at all. Rather, command lines are broken into "words," which Tcl commands can then consider as strings, variable names , regular expressions, or anything else as appropriate to the command. While a line is being parsed into words, common backslash sequences like \n are recognized and converted, and backslashes in unknown combinations are simply dropped. You can put double quotes around the word if you like, but they aren't required unless the word has whitespace in it.

Tcl also has a raw literal type of quoting similar to Python's raw strings, but Tcl uses braces, {‹}, instead of r'‹' . Within the braces, everything except a backslash-newline combination is kept as-is, so you can use {\t\x2A} to get \t\x2A .

Within the braces, you can have additional sets of braces so long as they nest. Those that don't nest must be escaped with a backslash, although the backslash does remain in the string's value.

3.4.1.7. Regex literals in Perl

In the Perl examples we've seen so far in this book, regular expressions have been provided as literals ("regular-expression literals"). As it turns out, you can also provide them as strings. For example:

$str =~ m/(\w+)/;

can also be written as:

$regex = '(\w+)'; $str =~ $regex;

or perhaps:

$regex = "(\w+)"; $str =~ $regex;

(Note that using a string can be much less efficient ˜242, 348.)

When a regex is provided as a literal, Perl provides extra features that the regular-expression engine itself does not, including:

The interpolation of variables (incorporating the contents of a variable as part of the regular expression).

Support for a literal-text mode via \Q‹\E (˜113).

Optional support for a \N { name } construct, which allows you to specify characters via their official Unicode names. For example, you can match ' Hola! ' with \N{INVERTED EXCLAMATION MARK}Hola! .

In Perl, a regex literal is parsed like a very special kind of string. In fact, these features are also available with Perl double-quoted strings. The point to be aware of is that these features are not provided by the regular-expression engine. Since the vast majority of regular expressions used within Perl are as regex literals, most think that \Q‹\E is part of Perls regex language, but if you ever use regular expressions read from a configuration file (or from the command line, etc.), it's important to know exactly what features are provided by which aspect of the language.

More details are available in Chapter 7, starting on page 288.

3.4.2. Character-Encoding Issues

A character encoding is merely an explicit agreement on how bytes with various values should be interpreted. A byte with the decimal value 110 is interpreted as the character ' n ' with the ASCII encoding, but as ' > ' with EBCDIC. Why? Because that's what someone decided there's nothing intrinsic about those values and characters that makes one encoding better than the other. The byte is the same; only the interpretation changes.

ASCII defines characters for only half the values that a byte can hold. The ISO-8859-1 encoding (commonly called "Latin-1") fills in the blank spots with accented characters and special symbols, making an encoding usable by a larger set of languages. With this encoding, a byte with a decimal value of 234 is to be interpreted as , instead of being undefined as it is with ASCII.

The important question for us is: when we intend for a certain set of bytes to be considered in the light of a particular encoding, does the program actually do so? For example, if we have four bytes with the values 234, 116, 101, and 115 that we intend to be considered as Latin-1 (representing the French word " tes"), we'd like the regex ^\w+$ or ^\b to match. This happens if the programs \w and \b know to treat those bytes as Latin-1 characters, and probably doesn't happen otherwise.

3.4.2.1. Richness of encoding-related support

There are many encodings. When you're concerned with a particular one, important questions you should ask include:

Does the program understand this encoding?

How does it know to treat this data as being of that encoding?

How rich is the regex support for this encoding?

The richness of an encoding's support has several important issues, including:

Are characters that are encoded with multiple bytes recognized as such? Do expressions like dot and [^x] match single characters , or single bytes ?

Do \w, \d, \s, \b, etc., properly understand all the characters in the encoding? For example, even if is known to be a letter, do \w and \b treat it as such?

Does the program try to extend the interpretation of class ranges? Is matched by [a-z] ?

Does case-insensitive matching work properly with all the characters? For example, are and treated as being equal?

Sometimes things are not as simple as they might seem. For example, the \b of Sun's java.util.regex package properly understands all the word-related characters of Unicode, but its \w does not (it understands only basic ASCII). We'll see more examples of this later in the chapter.

3.4.3. Unicode

There seems to be a lot of misunderstanding about just what "Unicode" is. At the most basic level, Unicode is a character set or a conceptual encoding a logical mapping between a number and a character. For example, the Korean character ‚µ is mapped to the number 49,333. The number, called a code point , is normally shown in hexadecimal, with " U+ " prepended. 49,333 in hex is C0B5 , so ‚µ is referred to as U+C0B5 . Included as part of the Unicode concept is a set of attributes for many characters, such as "3 is a digit" and " is an uppercase letter whose lowercase equivalent is ."

At this level, nothing is yet said about just how these numbers are actually encoded as data on a computer. There are a variety of ways to do so, including the UCS-2 encoding (all characters encoded with two bytes), the UCS-4 encoding (all characters encoded with four bytes), UTF-16 (most characters encoded with two bytes, but some with four), and the UTF-8 encoding (characters encoded with one to six bytes). Exactly which (if any) of these encodings a particular program uses internally is usually not a concern to the user of the program. The user 's concern is usually limited to how to convert external data (such as data read from a file) from a known encoding (ASCII, Latin-1, UTF-8, etc.) to whatever the program uses. Programs that work with Unicode usually supply various encoding and decoding routines for doing the conversion.

Regular expressions for programs that work with Unicode often support a \u num metasequence that can be used to match a specific Unicode character (˜117). The number is usually given as a four-digit hexadecimal number, so \uC0B5 matches ‚µ . It's important to realize that \uC0B5 is saying "match the Unicode character U+C0B5 ," and says nothing about what actual bytes are to be compared, which is dependent on the particular encoding used internally to represent Unicode code points. If the program happens to use UTF-8 internally, that character happens to be represented with three bytes. But you, as someone using the Unicode-enabled program, don't normally need to care. (Sometimes you do, as with PHP's preg suite and its u pattern modifier; ˜447).

There are a few related issues that you may need to be aware of...

3.4.3.1. Characters versus combining-character sequences

What a person considers a "character" doesn't always agree with what Unicode or a Unicode-enabled program (or regex engine) considers to be a character. For example, most would consider to be a single character, but in Unicode, it can be composed of two code points, U+0061 (a) combined with the grave accent U +0300 ('). Unicode offers a number of combining characters that are intended to follow (and be combined with) a base character. This makes things a bit more complex for the regular-expression engine for example, should dot match just one code point, or the entire U+0061 plus U+0300 combination?

In practice, it seems that many programs treat "character" and "code point" as synonymous, which means that dot matches each code point individually, whether it is base character or one of the combining characters. Thus, ( U+0061 plus U+0300 ) is matched by ^..$ , and not by ^.$ .

Perl and PCRE (and by extension, PHP's preg suite) support the \X metasequence, which fulfills what many might expect from dot ("match one character ") in that it matches a base character followed by any number of combining characters. See more on page 120.

It's important to keep combining characters in mind when using a Unicode-enabled editor to input Unicode characters directly into regular-expressions. If an accented character, say , ends up in a regular expression as 'A' plus ' ,' it likely can't match a string containing the single code point version of (single code point versions are discussed in the next section). Also, it appears as two distinct characters to the regular-expression engine itself, so specifying [‹ ‹] adds the two characters to the class, just as the explicit [‹A °‹] does.

In a similar vein, if a two-code-point character like is followed by a quantifier, the quantifier actually applies only to the second code point, just as with an explicit A °+ .

3.4.3.2. Multiple code points for the same character

In theory, Unicode is supposed to be a one-to-one mapping between code points and characters, but there are many situations where one character can have multiple representations. The previous section notes that is U+0061 followed by U+0300 . It is, however, also encoded separately as the single code point U+00E0 . Why is it encoded twice? To maintain easier conversion between Unicode and Latin-1. If you have Latin-1 text that you convert to Unicode, will likely be converted to U+00E0 . But, it could well be converted to a U+0061, U+0300 combination. Often, there's nothing you can do to automatically allow for these different ways of expressing characters, but Sun's java.util.regex package provides a special match option, CANON_EQ , which causes characters that are "canonically equivalent" to match the same, even if their representations in Unicode differ (˜368).

Somewhat related is that different characters can look virtually the same, which could account for some confusion at times among those creating the text you're tasked to check. For example, the Roman letter I ( U+0049 ) could be confused with I, the Greek letter Iota ( U+0399 ). Add dialytika to that to get or and it can be encoded four different ways ( U+00CF ; U+03AA ; U+0049 U+0308 ; U+0399 U+0308 ). This means that you might have to manually allow for these four possibilities when constructing a regular expression to match . There are many examples like this.

Also plentiful are single characters that appear to be more than one character. For example, Unicode defines a character called "SQUARE HZ" ( U+3390 ), which appears as Hz . This looks very similar to the two normal characters Hz ( U+0048 U+007A ).

Although the use of special characters like Hz is minimal now, their adoption over the coming years will increase the complexity of programs that scan text, so those working with Unicode would do well to keep these issues in the back of their mind. Along those lines, one might already expect, for example, the need to allow for both normal spaces ( U+0020 ) and no-break spaces ( U+00A0 ), and perhaps also any of the dozen or so other types of spaces that Unicode defines.

3.4.3.3. Unicode 3.1+ and code points beyond U+FFFF

With the release of Unicode Version 3.1 in mid 2001, characters with code points beyond U+FFFF were added. (Previous versions of Unicode had built in a way to allow for characters at those code points, but until Version 3.1, none were actually defined.) For example, there is a character for musical symbol C Clef defined at U+1D121 . Older programs built to handle only code points U+FFFF and below won't be able to handle this. Most programs' \ unum indeed allow only a four-digit hexadecimal number.

Programs that can handle characters at these new code points generally offer the \x { num } sequence, where num can be any number of digits. (This is offered instead of, or in addition to, the four-digit \u num notation.) You can then use \x{1D121} to match the C Clef character.

3.4.3.4. Unicode line terminator

Unicode defines a number of characters (and one sequence of two characters) that are to be considered line terminators , shown in Table 3-5.

Table 3-5. Unicode Line Terminators

Characters		Description
LF	`U+000A`	ASCII Line Feed
VT	`U+000B`	ASCII Vertical Tab
FF	`U+000C`	ASCII Form Feed
CR	`U+000D`	ASCII Carriage Return
CR/LF	`U+000D U+000A`	ASCII Carriage Return / Line Feed sequence
NEL	`U+0085`	Unicode NEXT LINE
LS	`U+2028`	Unicode LINE SEPARATOR
PS	`U+2029`	Unicode PARAGRAPH SEPARATOR

When fully supported, line terminators influence how lines are read from a file (including, in scripting languages, the file the program is being read from). With regular expressions, they can influence both what dot matches (˜111), and where ^ , $ , and \Z match (˜112).

3.4.4. Regex Modes and Match Modes

Most regex engines support a number of different modes for how a regular expression is interpreted or applied. We've seen an example of each with Perl's /x modifier (regex mode that allows free whitespace and comments ˜72) and /i modifier (match mode for case-insensitive matching ˜47).

Modes can generally be applied globally to the whole regex, or in many modern flavors, partially, to specific subexpressions of the regex. The global application is achieved through modifiers or options, such as Perl's /i , PHP's i pattern modifier (˜446), or java.util.regex 's Pattern.CASE_INSENSITIVE flag (˜99). If supported, the partial application of a mode is achieved with a regex construct that looks like (?i) to turn on case-insensitive matching, or (?-i) to turn it off. Some flavors also support (?i:‹) and (?-i:‹) , which turn on and off case-insensitive matching for the subexpression enclosed .

How these modes are invoked within a regex is discussed later in this chapter (˜135). In this section, we'll merely review some of the modes commonly available in most systems.

3.4.4.1. Case-insensitive match mode

The almost ubiquitous case-insensitive match mode ignores letter case during matching, so that b matches both ' b ' and ' B '. This feature relies upon proper character encoding support, so all the cautions mentioned earlier apply.

Historically, case-insensitive matching support has been surprisingly fraught with bugs , but most have been fixed over the years. Still, Ruby's case-insensitive matching doesn't apply to octal and hex escapes .

There are special Unicode-related issues with case-insensitive matching (which Unicode calls "loose matching"). For starters, not all alphabets have the concept of upper and lower case, and some have an additional title case used only at the start of a word. Sometimes there's not a straight one-to-one mapping between upper and lower case. A common example is that a Greek Sigma, & pound ;, has two lowercase versions, and ƒ; all three should mutually match in case-insensitive mode. Of the systems I've tested , only Perl and Java's java.util.regex handle this correctly.

Another issue is that sometimes a single character maps to a sequence of multiple characters. One well known example is that the uppercase version of is the two-character combination "SS". Only Perl handles this properly.

There are also Unicode-manufactured problems. One example is that while there's a single character j ( U+01F0 ), it has no single-character uppercase version. Rather, J requires a combining sequence (˜107), U+004A and U+030C . Yet, j and J should match in a case-insensitive mode. There are even examples like this that involve one-to-three mappings. Luckily, most of these do not involve commonly used characters.

3.4.4.2. Free-spacing and comments regex mode

In this mode, whitespace outside of character classes is mostly ignored. Whitespace within a character class still counts (except in java.util.regex ), and comments are allowed between # and a newline. We've already seen examples of this for Perl (˜72), Java (˜98), and VB.NET (˜99).

Except for java.util.regex , it's not quite true that all whitespace outside of classes is ignored, but that it's turned into a do-nothing metacharacter. The distinction is important with something like \12 3 , which in this mode is taken as \12 followed by 3 , and not \123 , as some might expect.

Of course, just what is and isn't "whitespace" is subject to the character encoding in effect, and its fullness of support. Most programs recognize only ASCII whitespace.

3.4.4.3. Dot-matches-all match mode (a.k.a., "single-line mode")

Usually, dot does not match a newline. The original Unix regex tools worked on a line-by-line basis, so the thought of matching a newline wasn't an issue until the advent of sed and lex . By that time, .* had become a common idiom to match "the rest of the line," so the new languages disallowed it from crossing line boundaries in order to keep it familiar. ^{[ ]} Thus, tools that could work with multiple lines (such as a text editor) generally disallow dot from matching a newline.

^{[ ]} As Ken Thompson ( ed 's author) explained it to me, it kept .* from becoming "too unwieldy."

For modern programming languages, a mode in which dot matches a newline can be as useful as one where dot doesn't. Which of these is most convenient for a particular situation depends on, well, the situation. Many programs now offer ways for the mode to be selected on a per-regex basis.

There are a few exceptions to the common standard. Unicode-enabled systems, such as Sun's Java regex package, may expand what dot normally does not match to include any of the single-character Unicode line terminators (˜109). Tcl's normal state is that its dot matches everything, but in its special "newline-sensitive" and "partial newline-sensitive" matching modes, both dot and a negated character class are prohibited from matching a newline.

An unfortunate name . When first introduced by Perl with its /s modifier, this mode was called "single-line mode." This unfortunate name continues to cause no end of confusion because it has nothing whatsoever to do with ^ and $ , which are influenced by the "multiline mode" discussed in the next section. "Single-line mode" merely means that dot has no restrictions and can match any character.

3.4.4.4. Enhanced line-anchor match mode (a.k.a., "multiline mode")

An enhanced line-anchor match mode influences where the line anchors, ^ and $ , match. The anchor ^ normally does not match at embedded newlines, but rather only at the start of the string that the regex is being applied to. However, in enhanced mode, it can also match after an embedded newline, effectively having ^ treat the string as multiple logical lines if the string contains newlines in the middle. We saw this in action in the previous chapter (˜69) while developing a Perl program to converting text to HTML. The entire text document was within a single string, so we could use the search-and-replace s /^$/<p> /mg to convert "... tags. It's ..." to " ... tags. <p> It's ..." The substitution replaces empty "lines" with paragraph tags.

It's much the same for $ , although the basic rules about when $ can normally match can be a bit more complex to begin with (˜129). However, as far as this section is concerned, enhanced mode simply includes locations before an embedded newline as one of the places that $ can match.

Programs that offer this mode often offer \A and \Z , which normally behave the same as ^ and $ except they are not modified by this mode. This means that \A and \Z never match at embedded newlines. Some implementations also allow $ and \Z to match before a string-ending newline. Such implementations often offer \z , which disregards all newlines and matches only at the very end of the string. See page 129 for details.

As with dot, there are exceptions to the common standard. A text editor like GNU Emacs normally lets the line anchors match at embedded newlines, since that makes the most sense for an editor. On the other hand, lex has its $ match only before a newline (while its ^ maintains the common meaning.)

Unicode-enabled systems, such as Sun's java.util.regex , may allow the line anchors in this mode to match at any line terminator (˜109). Ruby's line anchors normally do match at any embedded newline, and Python's \Z behaves like its \z , rather than its normal $ .

Traditionally, this mode has been called "multiline mode." Although it is unrelated to "single-line mode," the names confusingly imply a relation. One simply modifies how dot matches, while the other modifies how ^ and $ match. Another problem is that they approach newlines from different views. The first changes the concept of how dot treats a newline from "special to "not special," while the other does the opposite and changes the concept of how ^ and $ treat newlines from "not special to "special." ^{[ ]}

^{[ ]} Tcl normally lets its dot match everything, so in one sense it's more straightforward than other languages. In Tcl regular expressions, newlines are not normally treated specially in any way ( neither to dot nor to the line anchors), but by using match modes, they become special. However, since other systems have always done it another way, Tcl could be considered confusing to those used to those other ways.

3.4.4.5. Literal-text regex mode

A "literal text" mode is one that doesn't recognize most or all regex metacharacters. For example, a literal-text mode version of [a-z]* matches the string ' [a-z]* '. A fully literal search is the same as a simple string search ("find this string" as opposed to "find a match for this regex"), and programs that offer regex support also tend to offer separate support for simple string searches. A regex literal-text mode becomes more interesting when it can be applied to just part of a regular expression. For example, PCRE (and hence PHP) regexes and Perl regex literals offer the special sequence \Q‹\E , the contents of which have all metacharacters ignored (except the \E itself, of course).