Regular Expressions
Regular expressions are powerful tools for validating input, for extracting data from input, and for searching and replacing (see Table 13.1). The regular expression pattern matching language is used to describe regular expressions. A regular expression (regex for short) describes constraints on the way a string is composed.
Pattern |
Meaning |
---|---|
hello |
matches the literal string, hello |
c*at |
(quantifier) zero or more occurrances of c, followed by at |
c?at |
zero or 1 occurrances of c, followed by at |
c.t |
(character set: anychar) c followed by any character followed by t: cat, cot, c3t, ct |
c.*t |
(char set and quantifier) c followed by 0 or more anychar followed by t: ct, caaatttt, carsdfsdft |
ca+t |
(quantifier) + means 1 or more of the preceding "thing," so this matches cat, caat, caaaat, but not ct |
c.*t |
(literals) backslashes precede special characters to "escape them," so this matches the literal: c.*t |
c\.t |
(literals) c.t |
c[0-9a-c]+z |
matches c312abbaz caa211bac2z |
the (cat|dog) ate the (fish|mouse) |
(alternation) the cat ate the fish or the dog ate the mouse or the dog ate the fish, and the obvious last one. |
w+ |
(charset) a sequence of alphanumerics, or a word, same as [a-zA-Z0-9]+ |
W |
a character that is not part of a word (punctuation, whitespace, etc.) |
s{5} |
exactly 5 whitespace chars (tabs, spaces, newlines) |
S{1,5} |
at least 1, at most 5 non-whitespace (printable characters) |
d |
a digit [0-9] (and D is a non-digit, i.e., [^0-9]) |
d{3}-d{4} |
7-digit phone numbers: 555-1234 |
m[A-Z]w+ |
means word boundary: matches mBuffer but not StreamBuffer |
Regular expressions were first available in tools such as vi, emacs, awk, sed, and the POSIX Standard Library. Perl was the first mainstream programming language to integrate regular expressions so tightly into the language that it caused many people to learn regular expressions for the first time. Many enhancements have been made to the version of regular expressions that Perl recognizes. The enhancements are part of what we call Perl-style extended regular expressions. These extended regular expressions are also available in Java and Python.
Qt has a class, QRegExp, that implements most of the Perl-style extended regular expression language.
13.2.1. Regular Expression Syntax
A regular expression can be a simple string, in which case it specifies an exact string match, or it can be a string that includes regular expression meta-characters. A meta-character is a character that describes other characters.
Here are some of the most commonly used meta-characters.
- Special characters
- . (the dot matches any character)
- (matches the newline character)
- f (matches the form feed character)
- (matches the tab character)
- xhhhh (matches the Unicode character whose code is the hexadecimal number hhhh in the range 0x0000 to 0xFFFF)
- Quantifiers: Modifiers that specify the number of occurrences of the preceding character (or group) that may appear in the matching expression.
- + (1 or more occurrences)
- ? (0 or 1 occurrences)
- * (0 or more occurrences)
- {i,j} (at least i but not more than j occurrences)
- Character sets: Sets of allowable values for the character in the specified position of the matching expression. Several character sets are predefined:
- s (matches any whitespace character)
- S (matches any non-whitespace character)
- d (matches any digit character: '0' to '9')
- D (matches any non-digit character)
- w (matches any "word" character; i.e., any letter or digit or the underscore '_')
- W (matches any non-word character)
Character sets can also be specified in square brackets:
- [AEIOU] (matches any of the chars ' A ', ' E ', ' I ', ' O ', or ' U ')
- [a-g] (the dash makes this a range from 'a' to 'g')
- [^xyz] (matches any character except for ' x ', ' y ', and ' z ')
- Grouping and capturing characters (round parentheses): Characters that can be used to form a group. Groups can be back-referenced, meaning that if there is a match, the grouped values can be captured and accessed in various ways.
For convenience, up to 9 groups can be referenced within the regular expression by using the identifiers 1 thru 9.
There is also a QRegExp member function cap(int nth) that returns the nth group (as a QString).
- Anchoring characters: Assertions that specify the boundaries of a matching effort.
- The caret (^), if it is the first character in the regex, indicates that the match starts at the beginning of the string.
- The dollar sign ($), when it is the last character in the regex, means that the effort to match must continue to the end of the string.
- In addition, there are word boundary () or non-word boundary (B) assertions that help to focus the attention of the regex.
Backslashes are used for escaping special characters in C++ strings as well, so this means that regular expression strings inside C++ strings must be "double-backslashed." In other words, every becomes \, and to match the backslash character itself you will need four: \\ |
There is much more to regular expressions. Time spent learning to use them is well-invested time. The documentation for QRegExp is a good place to start. For a much more extensive discussion, we recommend [Fried198].
13.2.2. Regular Expressions: Phone Number Recognition
The Problem
We want to specify conditions, in a generic way, that must be satisfied by input data at runtime. For example:
- In a U.S. address, every zip code can have five digits, followed by an optional dash (-) and four more digits.
- A U.S. phone number consists of ten digits, usually grouped 3 + 3 + 4, with optional parentheses and dashes and an optional initial 1.
- A U.S. state abbreviation must be one from the set of 50 approved abbreviations.
How can we impose conditions such as these on incoming data in an object-oriented way?
Suppose that you wanted to write a program that recognized phone number formats, and could accept Dutch or U.S./Canada phone numbers. You would need to take the following things into consideration.
- For any U.S./Canada format numbers, there must AAA EEE NNNN, where A = area code, E = exchange, and N = number.
- For Dutch format numbers, there must be CC MM followed by either NN NN NNN or NNN NNNN, where C = country code, M = municipal code, and N = localNumberDigits.
- There might be dashes or spaces between number clusters.
- There might be + or 00 in front of the country code.
Imagine how you would write this program using the standard tools available to you in C++. It would be necessary to write lengthy parsing routines for each possible format. Example 13.3 shows the desired output of such a program.
Example 13.3. src/regexp/testphone.txt
$> ./testphone Enter a phone Number (or q to quit): 1112223333 validated: (US/Canada) +1 111-222-3333 Enter a phone Number (or q to quit): 20618676017 validated: (Europe) +20 (0)61-86-76-017 Enter a phone Number (or q to quit): 31206472582 validated: (Europe) +31 (0)20-64-72-582 Enter a phone Number (or q to quit): 16175551212 validated: (US/Canada) +1 617-555-1212 Enter a phone Number (or q to quit): +31 20 64 28 258 validated: (Europe) +31 (0)20-64-28-258 Enter a phone Number (or q to quit): 1 (617) 222 3333 validated: (US/Canada) +1 617-222-3333 Enter a phone Number (or q to quit): 31 20 111 1111 validated: (Europe) +31 (0)20-11-11-111 Enter a phone Number (or q to quit): asdf Unknown format Enter a phone Number (or q to quit): 111 2222 333 Unknown format Enter a phone Number (or q to quit): 111223 Unknown format Enter a phone Number (or q to quit): 1112222333 validated: (US/Canada) +1 111-222-2333 Enter a phone Number (or q to quit): q $ > |
A procedural C-style solution that shows how to use QRegExp is shown in Example 13.4.
Example 13.4. src/regexp/testphoneread.cpp
[ . . . . ] QRegExp usformat ("(\+?1[- ]?)?\(?(\d{3})\)?[\s-]?(\d{3})[\s-]?(\d{4}"); <-- 1 QRegExp nlformat ("(\+|00)?[\s\-]?(31)[\s\-]?(\d\d)[\s\-]?(.*)$"); <-- 2 QRegExp nlformat2 ("(\d\d)(\d\d)(\d{3}"); <-- 3 QRegExp filtercharacters ("[\s-\+\(\)\-]"); <-- 4 QString stdinReadPhone() { <-- 5 QString str; bool format=false; do { <-- 6 cout << "Enter a phone Number (or q to quit): "; cout.flush(); str = cin.readLine(); if (str=="q") return str; if (usformat.exactMatch(str)) { format = true; QString areacode = usformat.cap(2); QString exchange = usformat.cap(3); QString number = usformat.cap(4); str = QString("(US/Canada) +1 %1-%2-%3") .arg(areacode).arg(exchange).arg(number); } [ . . . . ] if (format == false) { cout << "Unknown format" << endl; } } while (format == false) ; return str; } int main() { QString str; do { str = stdinReadPhone(); if (str != "q") cout << "validated: " << str << endl; } while (str != "q"); return 0; } [ . . . . ] (1)All usformat numbers have country code 1, and have 3 + 3 + 4 = 10 digits. White spaces, dashes, and parantheses between these digit groups are ignored, but they help to make the digit groups recognizable. (2)Netherlands (country code 31) numbers have 2 + 2 + 7 = 11 digits. (3)The last seven digits will be arranged as 2 + 2 + 3. (4)These are characters we ignore in the last seven digits of NL numbers. (5)Ensures that the user-entered phone string complies with a regular expression and extracts the proper components from it. (6)Keep asking until you get a valid number. |
Exercises: Regular Expressions: Phone Number Recognition
1. |
Rewrite the birthday reminder application from Section 3.6 so that it accepts dates in any of the following formats:
|
2. |
Many operating systems come with a reasonably good list of words that are used by various programs to check spelling. On *nix systems the word list is generally named (at least indirectly) "words". You can locate the file on your *nix system by typing the command locate words | grep dict Piping the output of this command through grep reduces the output to those lines that contain the string "dict". |
Once you have located your system word list file, write a program that will read lines from the file and, using a suitable regex, display all the words that do the following:
- Begin with a pair of repeated letters
- End in "gory"
- Have more than one pair of repeated letters
- Are palindromes
- Consist of letters arranged in strictly increasing alphabetic order (e.g., knot)
If you cannot find such a suitable word list on your system, you can use the file canadian-english-small.gz, in our dist directory.[1] After you download it, you must uncompress it with the command
[1] http://oop.mcs.suffolk.edu/dist
gunzip canadian-english-small.gz