Regular Expressions and Class Regex
Regular expressions are specially formatted strings used to find patterns in text. They can be useful during information validation, to ensure that data is in a particular format. For example, a ZIP code must consist of five digits, and a last name must start with a capital letter. Compilers use regular expressions to validate the syntax of programs. If the program code does not match the regular expression, the compiler indicates that there is a syntax error.
The .NET Framework provides several classes to help developers recognize and manipulate regular expressions. Class Regex (of the System.Text.RegularExpressions namespace) represents an immutable regular expression. Regex method Match returns an object of class Match that represents a single regular expression match. Regex also provides method Matches, which finds all matches of a regular expression in an arbitrary string and returns an object of the class MatchCollection object containing all the Matches. A collection is a data structure, similar to an array and can be used with a foreach statement to iterate through the collection's elements. We discuss collections in more detail in Chapter 27, Collections. To use class Regex, you should add a using directive for the namespace System.Text.RegularExpressions.
Regular Expression Character Classes
The table in Fig. 16.18 specifies some character classes that can be used with regular expressions. Please do not confuse a character class with a C# class declaration. A character class is simply an escape sequence that represents a group of characters that might appear in a string.
Character class |
Matches |
---|---|
d |
any digit |
w |
any word character |
s |
any whitespace |
D |
any non-digit |
W |
any non-word character |
S |
any non-whitespace |
A word character is any alphanumeric character or underscore. A whitespace character is a space, a tab, a carriage return, a newline or a form feed. A digit is any numeric character. Regular expressions are not limited to the character classes in Fig. 16.18. As you will see in our first example, regular expressions can use other notations to search for complex patterns in strings.
16.16.1. Regular Expression Example
The program of Fig. 16.19 tries to match birthdays to a regular expression. For demonstration purposes, the expression matches only birthdays that do not occur in April and that belong to people whose names begin with "J".
Figure 16.19. Regular expressions checking birthdays.
1 // Fig. 16.19: RegexMatches.cs 2 // Demonstrating Class Regex. 3 using System; 4 using System.Text.RegularExpressions; 5 6 class RegexMatches 7 { 8 public static void Main() 9 { 10 // create regular expression 11 Regex expression = 12 new Regex( @"J.*d[0-35-9]-dd-dd" ); 13 14 string string1 = "Jane's Birthday is 05-12-75 " + 15 "Dave's Birthday is 11-04-68 " + 16 "John's Birthday is 04-28-73 " + 17 "Joe's Birthday is 12-17-77"; 18 19 // match regular expression to string and 20 // print out all matches 21 foreach ( Match myMatch in expression.Matches( string1 ) ) 22 Console.WriteLine( myMatch ); 23 } // end method Main 24 } // end class RegexMatches
|
Lines 1112 create a Regex object and pass a regular expression pattern string to the Regex constructor. Note that we precede the string with @. Recall that backslashes within the double quotation marks following the @ character are regular backslash characters, not the beginning of escape sequences. To define the regular expression without prefixing @ to the string, you would need to escape every backslash character, as in
"J.*\d[0-35-9]-\d\d-\d\d"
which makes the regular expression more difficult to read.
The first character in the regular expression, "J", is a literal character. Any string matching this regular expression is required to start with "J". In a regular expression, the dot character "." matches any single character except a newline character. When the dot character is followed by an asterisk, as in ".*", the regular expression matches any number of unspecified characters except newlines. In general, when the operator "*" is applied to a pattern, the pattern will match zero or more occurrences. By contrast, applying the operator "+" to a pattern causes the pattern to match one or more occurrences. For example, both "A*" and "A+" will match "A", but only "A*" will match an empty string.
As indicated in Fig. 16.18, "d" matches any numeric digit. To specify sets of characters other than those that belong to a predefined character class, characters can be listed in square brackets, []. For example, the pattern "[aeiou]" matches any vowel. Ranges of characters are represented by placing a dash (-) between two characters. In the example, "[0-35-9]" matches only digits in the ranges specified by the patterni.e., any digit between 0 and 3 or between 5 and 9; therefore, it matches any digit except 4. You can also specify that a pattern should match anything other than the characters in the brackets. To do so, place ^ as the first character in the brackets. It is important to note that "[^4]" is not the same as "[0-35-9]"; "[^4]" matches any non-digit and digits other than 4.
Although the "" character indicates a range when it is enclosed in square brackets, instances of the "-" character outside grouping expressions are treated as literal characters. Thus, the regular expression in line 12 searches for a string that starts with the letter "J", followed by any number of characters, followed by a two-digit number (of which the second digit cannot be 4), followed by a dash, another two-digit number, a dash and another two-digit number.
Lines 2122 use a foreach statement to iterate through the MatchCollection returned by the expression object's Matches method, which received string1 as an argument. The elements in the MatchCollection are Match objects, so the foreach statement declares variable myMatch to be of type Match. For each Match, line 22 outputs the text that matched the regular expression. The output in Fig. 16.19 indicates the two matches that were found in string1. Notice that both matches conform to the pattern specified by the regular expression.
Quantifiers
The asterisk (*) in line 12 of Fig. 16.19 is more formally called a quantifier. Figure 16.20 lists various quantifiers that you can place after a pattern in a regular expression and the purpose of each quantifier.
Quantifier |
Matches |
---|---|
* |
Matches zero or more occurrences of the preceding pattern. |
+ |
Matches one or more occurrences of the preceding pattern. |
? |
Matches zero or one occurrences of the preceding pattern. |
{n} |
Matches exactly n occurrences of the preceding pattern. |
{n,} |
Matches at least n occurrences of the preceding pattern. |
{n,m} |
Matches between n and m (inclusive) occurrences of the preceding pattern. |
We have already discussed how the asterisk (*) and plus (+) quantifiers work. The question mark (?) quantifier matches zero or one occurrences of the pattern that it quantifies. A set of braces containing one number ({n}) matches exactly n occurrences of the pattern it quantifies. We demonstrate this quantifier in the next example. Including a comma after the number enclosed in braces matches at least n occurrences of the quantified pattern. The set of braces containing two numbers ({n,m}), matches between n and m occurrences (inclusively) of the pattern that it qualifies. All of the quantifiers are greedythey will match as many occurrences of the pattern as possible until the pattern fails to make a match. If a quantifier is followed by a question mark (?), the quantifier becomes lazy and will match as few occurrences as possible as long as there is a successful match.
16.16.2. Validating User Input with Regular Expressions
The Windows application in Fig. 16.21 presents a more involved example that uses regular expressions to validate name, address and telephone number information input by a user.
Figure 16.21. Validating user information using regular expressions.
(This item is displayed on pages 798 - 801 in the print version)
1 // Fig. 16.21: Validate.cs 2 // Validate user information using regular expressions. 3 using System; 4 using System.Text.RegularExpressions; 5 using System.Windows.Forms; 6 7 public partial class ValidateForm : Form 8 { 9 // default constructor 10 public ValidateForm() 11 { 12 InitializeComponent(); 13 } // end constructor 14 15 // handles OkButton Click event 16 private void okButton_Click( object sender, EventArgs e ) 17 { 18 // ensures no TextBoxes are empty 19 if ( lastNameTextBox.Text == "" || firstNameTextBox.Text == "" || 20 addressTextBox.Text == "" || cityTextBox.Text == "" || 21 stateTextBox.Text == "" || zipCodeTextBox.Text == "" || 22 phoneTextBox.Text == "" ) 23 { 24 // display popup box 25 MessageBox.Show( "Please fill in all fields", "Error", 26 MessageBoxButtons.OK, MessageBoxIcon.Error ); 27 lastNameTextBox.Focus(); // set focus to lastNameTextBox 28 return; 29 } // end if 30 31 // if last name format invalid show message 32 if ( !Regex.Match( lastNameTextBox.Text, 33 "^[A-Z][a-zA-Z]*$" ).Success ) 34 { 35 // last name was incorrect 36 MessageBox.Show( "Invalid last name", "Message", 37 MessageBoxButtons.OK, MessageBoxIcon.Error ); 38 lastNameTextBox.Focus(); 39 return; 40 } // end if 41 42 // if first name format invalid show message 43 if ( !Regex.Match( firstNameTextBox.Text, 44 "^[A-Z][a-zA-Z]*$" ).Success ) 45 { 46 // first name was incorrect 47 MessageBox.Show( "Invalid first name", "Message", 48 MessageBoxButtons.OK, MessageBoxIcon.Error ); 49 firstNameTextBox.Focus(); 50 return; 51 } // end if 52 53 // if address format invalid show message 54 if ( !Regex.Match( addressTextBox.Text, 55 @"^[0-9]+s+([a-zA-Z]+|[a-zA-Z]+s[a-zA-Z]+)$" ).Success ) 56 { 57 // address was incorrect 58 MessageBox.Show( "Invalid address", "Message", 59 MessageBoxButtons.OK, MessageBoxIcon.Error ); 60 addressTextBox.Focus(); 61 return; 62 } // end if 63 64 // if city format invalid show message 65 if ( !Regex.Match( cityTextBox.Text, 66 @"^([a-zA-Z]+|[a-zA-Z]+s[a-zA-Z]+)$" ).Success ) 67 { 68 // city was incorrect 69 MessageBox.Show( "Invalid city", "Message", 70 MessageBoxButtons.OK, MessageBoxIcon.Error ); 71 cityTextBox.Focus(); 72 return; 73 } // end if 74 75 // if state format invalid show message 76 if ( !Regex.Match( stateTextBox.Text, 77 @"^([a-zA-Z]+|[a-zA-Z]+s[a-zA-Z]+)$" ).Success ) 78 { 79 // state was incorrect 80 MessageBox.Show( "Invalid state", "Message", 81 MessageBoxButtons.OK, MessageBoxIcon.Error ); 82 stateTextBox.Focus(); 83 return; 84 } // end if 85 86 // if zip code format invalid show message 87 if ( !Regex.Match( zipCodeTextBox.Text, @"^d{5}$" ).Success ) 88 { 89 // zip was incorrect 90 MessageBox.Show( "Invalid zip code", "Message", 91 MessageBoxButtons.OK, MessageBoxIcon.Error ); 92 zipCodeTextBox.Focus(); 93 return; 94 } // end if 95 96 // if phone number format invalid show message 97 if ( !Regex.Match( phoneTextBox.Text, 98 @"^[1-9]d{2}-[1-9]d{2}-d{4}$" ).Success ) 99 { 100 // phone number was incorrect 101 MessageBox.Show( "Invalid phone number", "Message", 102 MessageBoxButtons.OK, MessageBoxIcon.Error ); 103 phoneTextBox.Focus(); 104 return; 105 } // end if 106 107 // information is valid, signal user and exit application 108 this.Hide(); // hide main window while MessageBox displays 109 MessageBox.Show( "Thank You!", "Information Correct", 110 MessageBoxButtons.OK, MessageBoxIcon.Information ); 111 Application.Exit(); 112 } // end method okButton_Click 113 } // end class ValidateForm (a) (b) (c) (d) |
When a user clicks the OK button, the program checks to make sure that none of the fields is empty (lines 1922). If one or more fields are empty, the program displays a message to the user (lines 2526) that all fields must be filled in before the program can validate the input information. Line 27 calls lastNameTextBox's Focus method to place the cursor in the lastNameTextBox. The program then exits the event handler (line 28). If there are no empty fields, lines 32105 validate the user input. Lines 3240 validate the last name by calling static method Match of class Regex, passing both the string to validate and the regular expression as arguments. Method Match returns a Match object. This object contains a Success property that indicates whether method Match's first argument matched the pattern specified by the regular expression in the second argument. If the value of Success is false (i.e., there was no match), lines 3637 display an error message, line 38 sets the focus back to the lastNameTextBox so that the user can retype the input and line 39 terminates the event handler. If there is a match, the event handler proceeds to validate the first name. This process continues until the event handler validates the user input in all the TextBoxes or until a validation fails. If all of the fields contain valid information, the program displays a message dialog stating this, and the program exits when the user dismisses the dialog.
In the previous example, we searched a string for substrings that matched a regular expression. In this example, we want to ensure that the entire string in each TextBox conforms to a particular regular expression. For example, we want to accept "Smith" as a last name, but not "9@Smith#". In a regular expression that begins with a "^" character and ends with a "$" character, the characters "^" and "$" represent the beginning and end of a string, respectively. These characters force a regular expression to return a match only if the entire string being processed matches the regular expression.
The regular expression in line 33 uses the square bracket and range notation to match an uppercase first letter, followed by letters of any casea-z matches any lowercase letter, and A-Z matches any uppercase letter. The * quantifier signifies that the second range of characters may occur zero or more times in the string. Thus, this expression matches any string consisting of one uppercase letter, followed by zero or more additional letters.
The notation s matches a single whitespace character (lines 55, 66 and 77). The expression d{5}, used in the Zip (zip code) field, matches any five digits (line 87). Note that without the "^" and "$" characters, the regular expression would match any five consecutive digits in the string. By including the "^" and "$" characters, we ensure that only five-digit zip codes are allowed.
The character "|" (lines 55, 66 and 77) matches the expression to its left or the expression to its right. For example, Hi (John|Jane) matches both Hi John and Hi Jane. In line 55, we use the character "|" to indicate that the address can contain a word of one or more characters or a word of one or more characters followed by a space and another word of one or more characters. Note the use of parentheses to group parts of the regular expression. Quantifiers may be applied to patterns enclosed in parentheses to create more complex regular expressions.
The Last name and First name fields both accept strings of any length that begin with an uppercase letter. The regular expression for the Address field (line 55) matches a number of at least one digit, followed by a space and then either one or more letters or else one or more letters followed by a space and another series of one or more letters. Therefore, "10 Broadway" and "10 Main Street" are both valid addresses. As currently formed, the regular expression in line 55 does not match an address that does not start with a number or that has more than two words. The regular expressions for the City (line 66) and State (line 77) fields match any word of at least one character or, alternatively, any two words of at least one character if the words are separated by a single space. This means both Waltham and West Newton would match. Again, these regular expressions would not accept names that have more than two words. The regular expression for the Zip code field (line 87) ensures that the zip code is a five-digit number. The regular expression for the Phone field (line 98) indicates that the phone number must be of the form xxx-yyy-yyyy, where the xs represent the area code and the ys the number. The first x and the first y cannot be zero, as specified by the range [19] in each case.
16.16.3. Regex methods Replace and Split
Sometimes it is useful to replace parts of one string with another or to split a string according to a regular expression. For this purpose, the Regex class provides static and instance versions of methods Replace and Split, which are demonstrated in Fig. 16.22.
Figure 16.22. Regex methods Replace and Split.
1 // Fig. 16.22: RegexSubstitution.cs 2 // Using Regex method Replace. 3 using System; 4 using System.Text.RegularExpressions; 5 6 class RegexSubstitution 7 { 8 public static void Main() 9 { 10 string testString1 = 11 "This sentence ends in 5 stars *****"; 12 string output = ""; 13 string testString2 = "1, 2, 3, 4, 5, 6, 7, 8"; 14 Regex testRegex1 = new Regex( @"d" ); 15 string[] result; 16 17 Console.WriteLine( "Original string: " + 18 testString1 ); 19 testString1 = Regex.Replace( testString1, @"*", "^" ); 20 Console.WriteLine( "^ substituted for *: " + testString1 ); 21 testString1 = Regex.Replace( testString1, "stars", 22 "carets" ); 23 Console.WriteLine( ""carets" substituted for "stars": " + 24 testString1 ); 25 Console.WriteLine( "Every word replaced by "word": " + 26 Regex.Replace( testString1, @"w+", "word" ) ); 27 Console.WriteLine( " Original string: " + testString2 ); 28 Console.WriteLine( "Replace first 3 digits by "digit": " + 29 testRegex1.Replace( testString2, "digit", 3 ) ); 30 Console.Write( "string split at commas [" ); 31 32 result = Regex.Split( testString2, @",s" ); 33 34 foreach ( string resultString in result ) 35 output += """ + resultString + "", "; 36 37 // Delete ", " at the end of output string 38 Console.WriteLine( output.Substring( 0, output.Length - 2 ) + "]" ); 39 } // end method Main 40 } // end class RegexSubstitution
|
Method Replace replaces text in a string with new text wherever the original string matches a regular expression. We use two versions of this method in Fig. 16.22. The first version (line 19) is static and takes three parametersthe string to modify, the string containing the regular expression to match and the replacement string. Here, Replace replaces every instance of "*" in testString1 with "^". Notice that the regular expression (@"*") precedes character * with a backslash, . Normally, * is a quantifier indicating that a regular expression should match any number of occurrences of a preceding pattern. However, in line 19, we want to find all occurrences of the literal character *; to do this, we must escape character * with character . By escaping a special regular expression character with a , we tell the regular-expression matching engine to find the actual character * rather than use it as a quantifier.
The second version of method Replace (line 29) is an instance method that uses the regular expression passed to the constructor for testRegex1 (line 14) to perform the replacement operation. Line 14 instantiates testRegex1 with argument @"d". The call to instance method Replace in line 29 takes three argumentsa string to modify, a string containing the replacement text and an int specifying the number of replacements to make. In this case, line 29 replaces the first three instances of a digit ("d") in testString2 with the text "digit".
Method Split divides a string into several substrings. The original string is broken at delimiters that match a specified regular expression. Method Split returns an array containing the substrings. In line 32, we use the static version of method Split to separate a string of comma-separated integers. The first argument is the string to split; the second argument is the regular expression that represents the delimiter. The regular expression @",s" separates the substrings at each comma. By matching any whitespace characters (s* in the regular expression), we eliminate extra spaces from the resulting substrings.