Visual Basic 2005 for Programmers (2nd Edition)

16.16. Regular Expressions and Class RegEx

Regular expressions are specially formatted Strings used to find (and possibly replace) patterns in text. They can be useful during information validation, to ensure that data is in a particular format. For example, a United States ZIP code must consist of five digits, and a last name must start with a capital letter. Compilers use regular expressions to validate the syntax of programs. If the program code does not match the regular expression, the compiler indicates that there is a syntax error.

The .NET Framework provides several classes to help developers recognize and manipulate regular expressions. Class Regex (of the System.Text.RegularExpressions namespace) represents an immutable regular expression. Regex method Match returns an object of class Match that represents a single regular-expression match. Regex also provides method Matches, which finds all matches of a regular expression in an arbitrary String and returns an object of the class MatchCollection object containing all the Matches. A collection is a data structure similar to an array and can be used with a For Each statement to iterate through the collection's elements. We discuss collections in more detail in Chapter 26, Collections. To use class Regex, add an Imports statement for the package System.Text.RegularExpressions.

Regular Expression Character Classes

The table in Fig. 16.18 specifies some character classes that can be used with regular expressions. Do not confuse a character class with a Visual Basic class declaration. A character class is simply an escape sequence that represents a group of characters that might appear in a String.

Figure 16.18. Character classes.

Character class

Matches

Character class

Matches

\d

any digit

\D

any non-digit

\w

any word character

\W

any non-word character

\s

any whitespace

\S

any non-whitespace

A word character is any alphanumeric character or underscore. A whitespace character is a space, a tab, a carriage return, a newline or a form feed. A digit is any numeric character. Regular expressions are not limited to the character classes in Fig. 16.18 As you will see in our first example, regular expressions can use other notations to search for complex patterns in Strings.

16.16.1. Regular Expression Example

The program in Fig. 16.19 tries to match birthdays to a regular expression. For demonstration purposes, the expression matches only birthdays that do not occur in April and that belong to people whose names begin with "J".

Figure 16.19. Regular expressions checking birthdays.

1 ' Fig. 16.19: RegexMatches.vb 2 ' Demonstrating Class Regex. 3 Imports System.Text.RegularExpressions 4 5 Module RegexMatches 6 Sub Main() 7 ' create regular expression 8 Dim expression As New Regex( "J.*\d[0-35-9]-\d\d-\d\d") 9 10 Dim string1 As String = "Jane's Birthday is 05-12-75" & vbCrLf &_ 11 "Dave's Birthday is 11-04-68" & vbCrLf & _ 12 "John's Birthday is 04-28-73" & vbCrLf & _ 13 "Joe's Birthday is 12-17-77" 14 15 ' match regular expression to string and 16 ' print out all matches 17 For Each myMatch As Match In expression.Matches(string1) 18 Console.WriteLine(myMatch) 19 Next myMatch 20 End Sub ' Main 21 End Module ' RegexMatches

Jane's Birthday is 05-12-75 Joe's Birthday is 12-17-77

Line 8 creates a Regex object and passes a regular-expression pattern string to the Regex constructor. The first character in the regular expression, "J", is a literal character. Any String matching this regular expression is required to start with "J". In a regular expression, the dot character "." matches any single character except a newline character. When the dot character is followed by an asterisk, as in ".*", the regular expression matches any number of unspecified characters except newlines. In general, when the regular-expression operator "*" is applied to a pattern, the pattern will match zero or more occurrences. By contrast, applying the regular-expression operator "+" to a pattern causes the pattern to match one or more occurrences. For example, both "A*" and "A+" will match "A", but only "A*" will match an empty String.

As indicated in Fig. 16.18, "\d" matches any numeric digit. To specify sets of characters other than those that belong to a predefined character class, characters can be listed in square brackets, []. For example, the pattern "[aeiou]" matches any vowel. Ranges of characters are represented by placing a dash (-) between two characters. In the example, "[0-35-9]" matches only digits in the ranges specified by the patternany digit between 0 and 3 or between 5 and 9; therefore, it matches any digit except 4. You can also specify that a pattern should match anything other than the characters in the brackets. To do so, place ^ as the first character in the brackets. It is important to note that "[^4]" is not the same as "[0-35-9]"; "[^4]" matches any non-digit and digits other than 4.

Although the "-" character indicates a range when it is enclosed in square brackets, instances of the "-" character outside grouping expressions are treated as literal characters. Thus, the regular expression in line 8 searches for a String that starts with the letter "J", followed by any number of characters, followed by a two-digit number (of which the second digit cannot be 4), followed by a dash, another two-digit number, a dash and another two-digit number.

Lines 1719 use a For Each statement to iterate through the MatchCollection returned by the expression object's Matches method, which received string1 as an argument. The elements in the MatchCollection are Match objects, so the For Each statement declares variable myMatch to be of type Match. For each Match, line 18 outputs the text that matched the regular expression. The output in Fig. 16.19 indicates the two matches that were found in string1. Note that both matches conform to the pattern specified by the regular expression.

Quantifiers

The asterisk (*) in line 8 of Fig. 16.19 is more formally called a quantifier. Figure 16.20 lists various quantifiers that you can place after a pattern in a regular expression and the purpose of each quantifier.

Figure 16.20. Quantifiers used in regular expressions.

Quantifier

Matches

*

Matches zero or more occurrences of the preceding pattern.

+

Matches one or more occurrences of the preceding pattern.

?

Matches zero or one occurrences of the preceding pattern.

{n}

Matches exactly n occurrences of the preceding pattern.

{n,}

Matches at least n occurrences of the preceding pattern.

{n,m}

Matches between n and m (inclusive) occurrences of the preceding pattern.

We have already discussed how the asterisk (*) and plus (+) quantifiers work. The question mark (?) quantifier matches zero or one occurrences of the pattern that it quantifies. A set of braces containing one number ({n}) matches exactly n occurrences of the pattern it quantifies. We demonstrate this quantifier in the next example. Including a comma after the number enclosed in braces matches at least n occurrences of the quantified pattern. The set of braces containing two numbers ({n,m}) matches between n and m occurrences (inclusively) of the pattern it qualifies. All of the quantifiers are greedythey will match as many occurrences of the pattern as possible until the pattern fails to make a match. If a quantifier is followed by a question mark (?), the quantifier becomes lazy and will match as few occurrences as possible as long as there is a successful match.

16.16.2. Validating User Input with Regular Expressions

The Windows application in Fig. 16.21 presents a more involved example that uses regular expressions to validate name, address and telephone number information input by a user.

Figure 16.21. Validating user information using regular expressions.

1 ' Fig. 16.21: Validate.vb 2 ' Validate user information using regular expressions. 3 Imports System.Text.RegularExpressions 4 5 Public Class frmValidate 6 ' handles btnOk Click event 7 Private Sub btnOk_Click(ByVal sender As System.Object, _ 8 ByVal e As System.EventArgs) Handles btnOk.Click 9 ' ensures no TextBoxes are empty 10 If txtLastName.Text = "" Or txtFirstName.Text = "" Or _ 11 txtAddress.Text = "" Or txtCity.Text = "" Or _ 12 txtState.Text = "" Or txtZipCode.Text = "" Or _ 13 txtPhone.Text = "" Then 14 ' display popup box 15 MessageBox.Show( "Please fill in all fields" , "Error" , _ 16 MessageBoxButtons.OK, MessageBoxIcon.Error ) 17 txtLastName.Focus() ' set focus to txtLastName 18 Return 19 End If 20 21 ' if last name format invalid show message 22 If Not Regex.Match(txtLastName.Text, _ 23 "^[A-Z][a-zA-Z]*$").Success Then 24 ' last name was incorrect 25 MessageBox.Show( "Invalid last name", "Message", _ 26 MessageBoxButtons.OK, MessageBoxIcon.Error ) 27 txtLastName.Focus() 28 Return 29 End If 30 31 ' if first name format invalid show message 32 If Not Regex.Match(txtFirstName.Text, _ 33 "^[A-Z][a-zA-Z]*$").Success Then 34 ' first name was incorrect 35 MessageBox.Show( "Invalid first name", "Message", _ 36 MessageBoxButtons.OK, MessageBoxIcon.Error ) 37 txtFirstName.Focus() 38 Return 39 End If 40 41 ' if address format invalid show message 42 If Not Regex.Match(txtAddress.Text, _ 43 "^[0-9]+\s+([a-zA-Z]+|[a-zA-Z]+\s[a-zA-Z]+)$").Success Then 44 ' address was incorrect 45 MessageBox.Show( "Invalid address" , "Message", _ 46 MessageBoxButtons.OK, MessageBoxIcon.Error ) 47 txtAddress.Focus() 48 Return 49 End If 50 51 ' if state format invalid show message 52 If Not Regex.Match(txtCity.Text, _ 53 "^([a-zA-Z]+|[a-zA-Z]+\s[a-zA-Z]+)$").Success Then 54 ' city was incorrect 55 MessageBox.Show( "Invalid city" , "Message", _ 56 MessageBoxButtons.OK, MessageBoxIcon.Error ) 57 txtCity.Focus() 58 Return 59 End If 60 61 ' if state format invalid show message 62 If Not Regex.Match(txtState.Text, _ 63 "^([a-zA-Z]+|[a-zA-Z]+\s[a-zA-Z]+)$").Success Then 64 ' state was incorrect 65 MessageBox.Show( "Invalid state", "Message", _ 66 MessageBoxButtons.OK, MessageBoxIcon.Error ) 67 txtState.Focus() 68 Return 69 End If 70 71 ' if zip code format invalid show message 72 If Not Regex.Match(txtZipCode.Text, "^\d{5}$").Success Then 73 ' zip was incorrect 74 MessageBox.Show( "Invalid zip code", "Message", _ 75 MessageBoxButtons.OK, MessageBoxIcon.Error ) 76 txtZipCode.Focus() 77 Return 78 End If 79 80 ' if phone number format invalid show message 81 If Not Regex.Match(txtPhone.Text, _ 82 "^[1-9]\d{2}-[1-9]\d{2}-\d{4}$").Success Then 83 ' phone number was incorrect 84 MessageBox.Show( "Invalid phone number" , "Message", _ 85 MessageBoxButtons.OK, MessageBoxIcon.Error ) 86 txtPhone.Focus() 87 Return 88 End If 89 90 ' information is valid, signal user and exit application 91 Me .Hide() ' hide main window while MessageBox displays 92 MessageBox.Show( "Thank You!", "Information Correct" , _ 93 MessageBoxButtons.OK, MessageBoxIcon.Information) 94 Application.Exit() 95 End Sub ' btnOk_Click 96 End Class ' frmValidate

(a)

(b)

(c)

(d)

When a user clicks the OK button, the program checks to make sure that none of the fields is empty (lines 1013). If one or more fields are empty, the program displays a message to the user (lines 1516) that all fields must be filled in before the program can validate the input information. Line 17 calls txtLastName's Focus method so that the user can begin typing in that TextBox. The program then exits the event handler (line 18). If there are no empty fields, lines 2288 validate the user input. Lines 2229 validate the last name by calling Shared method Match of class Regex, passing both the string to validate and the regular expression as arguments. Method Match returns a Match object. This object contains a Success property that indicates whether method Match's first argument matched the pattern specified by the regular expression in the second argument. If the value of Success is False (i.e., there was no match), lines 2526 display an error message, line 27 sets the focus back to txtLastName so that the user can retype the input, and line 28 terminates the event handler. If there is a match, the event handler proceeds to validate the first name. This process continues for each TextBox's contents until the event handler validates the user input in all the TextBoxes or until a validation fails. If all of the fields contain valid information, the program displays a message dialog stating this, and the program terminates when the user dismisses the dialog.

In the previous example, we searched a String for substrings that matched a regular expression. In this example, we want to ensure that the entire String in each TextBox conforms to a particular regular expression. For example, we want to accept "Smith" as a last name, but not "9@Smith#". In a regular expression that begins with a "^" character and ends with a "$" character, the characters "^" and "$" represent the beginning and end of a String, respectively. These characters force a regular expression to return a match only if the entire String being processed matches the regular expression.

The regular expression in line 23 uses the square bracket and range notation to match an uppercase first letter, followed by letters of any casea-z matches any lowercase letter, and A-Z matches any uppercase letter. The * quantifier signifies that the second range of characters may occur zero or more times in the String. Thus, this expression matches any String consisting of one uppercase letter followed by zero or more additional letters.

The notation \s matches a single whitespace character (lines 43, 53 and 63). The expression \d{5}, used in the Zip (zip code) field, matches any five digits (line 72). In general, an expression with a positive integer x in the curly braces will match any x consecutive digits. Note that without the "^" and "$" characters, the regular expression would match any five consecutive digits in the String. By including the "^" and "$" characters, we ensure that only five-digit zip codes are allowed.

The character "|" (lines 43, 53 and 63) matches the expression to its left or the expression to its right. For example, Hi (John|Jane) matches both Hi John and Hi Jane. In line 43, we use the character "|" to indicate that the address can contain a word of one or more characters or a word of one or more characters followed by a space and another word of one or more characters. Note the use of parentheses to group parts of the regular expression. Quantifiers may be applied to patterns enclosed in parentheses to create more complex regular expressions.

The Last name and First name fields both accept Strings of any length that begin with an uppercase letter. The regular expression for the Address field (line 43) matches a number of at least one digit, followed by a space and then either one or more letters, or one or more letters followed by a space and another series of one or more letters. Therefore, "10 Broadway" and "10 Main Street" are both valid addresses. As currently formed, the regular expression in line 43 does not match an address that does not start with a number or that has more than two words. The regular expressions for the City (line 53) and State (line 63) fields match any word of at least one character or, alternatively, any two words of at least one character if the words are separated by a single space. This means both Waltham and West Newton would match. Again, these regular expressions would not accept names that have more than two words. The regular expression for the Zip code field (line 72) ensures that the zip code is a five-digit number. The regular expression for the Phone field (line 82) indicates that the phone number must be of the form xxx-yyy-yyyy, where the xs represent the area code and the ys the number. The first x and the first y cannot be zero, as specified by the range [19] in each case.

16.16.3. Regex methods Replace and Split

Sometimes it is useful to replace parts of one String with another or to split a String according to a regular expression. For this purpose, the Regex class provides Shared and instance versions of methods Replace and Split, which are demonstrated in Fig. 16.22.

Figure 16.22. Regex methods Replace and Split.

1 ' Fig. 16.22: RegexSubstitution.vb 2 ' Using Regex method Replace. 3 Imports System.Text.RegularExpressions 4 5 Module RegexSubstitution 6 Sub Main() 7 Dim testString1 As String = _ 8 "This sentence ends in 5 stars *****" 9 Dim output As String = "" 10 Dim testString2 As String = "1, 2, 3, 4, 5, 6, 7, 8" 11 Dim testRegex1 As New Regex( "\d") 12 Dim result() As String 13 14 Console.WriteLine("Original string: " & testString1) 15 testString1 = Regex.Replace(testString1, "\*" , "^" ) 16 Console.WriteLine(" ^ substituted for *: " & testString1) 17 testString1 = Regex.Replace(testString1, "stars" , "carets" ) 18 Console.WriteLine("""carets"" substituted for ""stars"": " & _ 19 testString1) 20 Console.WriteLine("Every word replaced by ""word"": " & _ 21 Regex.Replace(testString1, "\w+", "word")) 22 Console.WriteLine(vbCrLf & "Original string: " & testString2) 23 Console.WriteLine("Replace first 3 digits by ""digit"": " & _ 24 testRegex1.Replace(testString2, "digit" , 3 )) 25 Console.Write("string split at commas [" ) 26 27 result = Regex.Split(testString2, ",\s" ) 28 29 For Each resultString As String In result 30 output &= """" & resultString & """, " 31 Next resultString 32 33 ' Delete ", " at the end of output string 34 Console.WriteLine(output.Substring(0 , output.Length - 2 ) + "]" ) 35 End Sub ' Main 36 End Module ' RegexSubstitution

[View full width]

Original string: This sentence ends in 5 stars ***** ^ substituted for *: This sentence ends in 5 stars ^^^^^ "carets" substituted for "stars": This sentence ends in 5 carets ^^^^^ Every word replaced by "word": word word word word word word ^^^^^ Original string: 1, 2, 3, 4, 5, 6, 7, 8 Replace first 3 digits by "digit": digit, digit, digit, 4, 5, 6, 7, 8 string split at commas ["1", "2", "3", "4", "5", "6", "7", "8"]

Method Replace replaces text in a String with new text wherever the original String matches a regular expression. We use two versions of this method in Fig. 16.22. The first version (line 15) is Shared and takes three parametersthe String to modify, the String containing the regular expression to match and the replacement String. Here, Replace replaces every instance of "*" in testString1 with "^". Note that the regular expression ("\*") precedes character * with a backslash, \. Normally, * is a quantifier indicating that a regular expression should match any number of occurrences of a preceding pattern. However, in line 15, we want to find all occurrences of the literal character *; to do this, we must escape character * with a \. By escaping a special regular-expression character with a \, we tell the regular-expression matching engine to find the actual character * rather than use it as a quantifier. Line 17 replaces every occurrence of "stars" in testString1 with "carets". Line 21 replaces every word in testString1 with "word".

The second version of method Replace (line 24) is an instance method that uses the regular expression passed to the constructor for testRegex1 (line 11) to perform the replacement operation. Line 11 instantiates testRegex1 with argument "\d". The call to instance method Replace in line 24 takes three argumentsa String to modify, a String containing the replacement text and an Integer specifying the number of replacements to make. In this case, line 24 replaces the first three instances of a digit ("\d") in testString2 with the text "digit".

Method Split divides a String into several substrings. The original String is broken at delimiters that match a specified regular expression. Method Split returns an array containing the substrings. In line 27, we use the Shared version of method Split to separate a String of comma-separated integers. The first argument is the String to split; the second argument is the regular expression that represents the delimiter. The regular expression ",\s*" separates the substrings at each comma. By matching any whitespace characters (\s* in the regular expression), we eliminate extra spaces from the resulting substrings.

Категории