Regular Expressions, Class Pattern and Class Matcher

Regular expressions are sequences of characters and symbols that define a set of strings. They are useful for validating input and ensuring that data is in a particular format. For example, a ZIP code must consist of five digits, and a last name must contain only letters, spaces, apostrophes and hyphens. One application of regular expressions is to facilitate the construction of a compiler. Often, a large and complex regular expression is used to validate the syntax of a program. If the program code does not match the regular expression, the compiler knows that there is a syntax error within the code.

Class String provides several methods for performing regular-expression operations, the simplest of which is the matching operation. String method matches receives a string that specifies the regular expression and matches the contents of the String object on which it is called to the regular expression. The method returns a boolean indicating whether the match succeeded.

A regular expression consists of literal characters and special symbols. Figure 29.19 specifies some predefined character classes that can be used with regular expressions. A character class is an escape sequence that represents a group of characters. A digit is any numeric character. A word character is any letter (uppercase or lowercase), any digit or the underscore character. A whitespace character is a space, a tab, a carriage return, a newline or a form feed. Each character class matches a single character in the string we are attempting to match with the regular expression.

Figure 29.19. Predefined character classes.

Character

Matches

Character

Matches

d

any digit

D

any non-digit

w

any word character

W

any non-word character

s

any whitespace

S

any non-whitespace

Regular expressions are not limited to these predefined character classes. The expressions employ various operators and other forms of notation to match complex patterns. We examine several of these techniques in the application in Fig. 29.20 and Fig. 29.21 which validates user input via regular expressions. [Note: This application is not designed to match all possible valid user input.]

Figure 29.20. Validating user information using regular expressions.

(This item is displayed on page 1380 in the print version)

1 // Fig. 29.20: ValidateInput.java 2 // Validate user information using regular expressions. 3 4 public class ValidateInput 5 { 6 // validate first name 7 public static boolean validateFirstName( String firstName ) 8 { 9 return firstName.matches( "[A-Z][a-zA-Z]*" ); 10 } // end method validateFirstName 11 12 // validate last name 13 public static boolean validateLastName( String lastName ) 14 { 15 return lastName.matches( "[a-zA-z]+([ '-][a-zA-Z]+)*" ); 16 } // end method validateLastName 17 18 // validate address 19 public static boolean validateAddress( String address ) 20 { 21 return address.matches( 22 "\d+\s+([a-zA-Z]+|[a-zA-Z]+\s[a-zA-Z]+)" ); 23 } // end method validateAddress 24 25 // validate city 26 public static boolean validateCity( String city ) 27 { 28 return city.matches( "([a-zA-Z]+|[a-zA-Z]+\s[a-zA-Z]+)" ); 29 } // end method validateCity 30 31 // validate state 32 public static boolean validateState( String state ) 33 { 34 return state.matches( "([a-zA-Z]+|[a-zA-Z]+\s[a-zA-Z]+)" ); 35 } // end method validateState 36 37 // validate zip 38 public static boolean validateZip( String zip ) 39 { 40 return zip.matches( "\d{5}" ); 41 } // end method validateZip 42 43 // validate phone 44 public static boolean validatePhone( String phone ) 45 { 46 return phone.matches( "[1-9]\d{2}-[1-9]\d{2}-\d{4}" ); 47 } // end method validatePhone 48 } // end class ValidateInput

Figure 29.21. Inputs and validates data from user using the ValidateInput class.

(This item is displayed on pages 1381 - 1382 in the print version)

1 // Fig. 29.21: Validate.java 2 // Validate user information using regular expressions. 3 import java.util.Scanner; 4 5 public class Validate 6 { 7 public static void main( String[] args ) 8 { 9 // get user input 10 Scanner scanner = new Scanner( System.in ); 11 System.out.println( "Please enter first name:" ); 12 String firstName = scanner.nextLine(); 13 System.out.println( "Please enter last name:" ); 14 String lastName = scanner.nextLine(); 15 System.out.println( "Please enter address:" ); 16 String address = scanner.nextLine(); 17 System.out.println( "Please enter city:" ); 18 String city = scanner.nextLine(); 19 System.out.println( "Please enter state:" ); 20 String state = scanner.nextLine(); 21 System.out.println( "Please enter zip:" ); 22 String zip = scanner.nextLine(); 23 System.out.println( "Please enter phone:" ); 24 String phone = scanner.nextLine(); 25 26 // validate user input and display error message 27 System.out.println( " Validate Result:" ); 28 29 if ( !ValidateInput.validateFirstName( firstName ) ) 30 System.out.println( "Invalid first name" ); 31 else if ( !ValidateInput.validateLastName( lastName ) ) 32 System.out.println( "Invalid last name" ); 33 else if ( !ValidateInput.validateAddress( address ) ) 34 System.out.println( "Invalid address" ); 35 else if ( !ValidateInput.validateCity( city ) ) 36 System.out.println( "Invalid city" ); 37 else if ( !ValidateInput.validateState( state ) ) 38 System.out.println( "Invalid state" ); 39 else if ( !ValidateInput.validateZip( zip ) ) 40 System.out.println( "Invalid zip code" ); 41 else if ( !ValidateInput.validatePhone( phone ) ) 42 System.out.println( "Invalid phone number" ); 43 else 44 System.out.println( "Valid input. Thank you." ); 45 } // end main 46 } // end class Validate  

Please enter first name: Jane Please enter last name: Doe Please enter address: 123 Some Street Please enter city: Some City Please enter state: SS Please enter zip: 123 Please enter phone: 123-456-7890 Validate Result: Invalid zip code  

 

Please enter first name: Jane Please enter last name: Doe Please enter address: 123 Some Street Please enter city: Some City Please enter state: SS Please enter zip: 12345 Please enter phone: 123-456-7890 Validate Result: Valid input. Thank you.  

Figure 29.20 validates user input. Line 9 validates the first name. To match a set of characters that does not have a predefined character class, use square brackets, []. For example, the pattern "[aeiou]" matches a single character that is a vowel. Ranges of characters can be represented by placing a dash (-) between two characters. In the example, "[A-Z]" matches a single uppercase letter. If the first character in the brackets is "^", the expression accepts any character other than those indicated. However, it is important to note that "[^Z]" is not the same as "[A-Y]", which matches uppercase letters AY"[^Z]" matches any character other than capital Z, including lowercase letters and non-letters such as the newline character. Ranges in character classes are determined by the letters' integer values. In this example, "[A-Za-z]" matches all uppercase and lowercase letters. The range "[A-z]" matches all letters and also matches those characters (such as % and 6) with an integer value between uppercase Z and lowercase a (for more information on integer values of characters see Appendix B, ASCII Character Set). Like predefined character classes, character classes delimited by square brackets match a single character in the search object.

In line 9, the asterisk after the second character class indicates that any number of letters can be matched. In general, when the regular-expression operator "*" appears in a regular expression, the application attempts to match zero or more occurrences of the subexpression immediately preceding the "*". Operator "+" attempts to match one or ore occurrences of the subexpression immediately preceding "+". So both "A*" and "A+" will match "AAA", but only "A*" will match an empty string.

If method validateFirstName returns TRue (line 29), the application attempts to validate the last name (line 31) by calling validateLastName (lines 1316 of Fig. 29.20). The regular expression to validate the last name matches any number of letters split by spaces, apostrophes or hyphens.

Line 33 validates the address by calling method validateAddress (lines 1923 of Fig. 29.20). The first character class matches any digit one or more times (\d+). Note that two characters are used because normally starts an escape sequences in a string. So \d in a Java string represents the regular expression pattern d. Then we match one or more whitespace characters (\s+). The character "|" allows a match of the expression to its left or to its right. For example, "Hi (John|Jane)" matches both "Hi John" and "Hi Jane". The parentheses are used to group parts of the regular expression. In this example, the left side of | matches a single word, and the right side matches two words separated by any amount of white space. So the address must contain a number followed by one or two words. Therefore, "10 Broadway" and "10 Main Street" are both valid addresses in this example. The city (line 2629 of Fig. 29.20) and state (line 3235 of Fig. 29.20) methods also match any word of at least one character or, alternatively, any two words of at least one character if the words are separated by a single space. This means both Waltham and West Newton would match.

The asterisk (*) and plus (+) are formally called quantifiers. Figure 29.22 lists all the quantifiers. We have already discussed how the asterisk (*) and plus (+) quantifiers work. All quantifiers affect only the subexpression immediately preceding the quantifier. Quantifier question mark (?) matches zero or one occurrences of the expression that it quantifies. A set of braces containing one number ({n}) matches exactly n occurrences of the expression it quantifies. We demonstrate this quantifier to validate the zip code in Fig. 29.20 at line 40. Including a comma after the number enclosed in braces matches at least n occurrences of the quantified expression. The set of braces containing two numbers ({n, m}), matches between n and m occurrences of the expression that it qualifies. Quantifiers may be applied to patterns enclosed in parentheses to create more complex regular expressions.

Figure 29.22. Quantifiers used in regular expressions.

Quantifier

Matches

*

Matches zero or more occurrences of the pattern.

+

Matches one or more occurrences of the pattern.

?

Matches zero or one occurrences of the pattern.

{ n }

Matches exactly n occurrences.

{ n, }

Matches at least n occurrences.

{ n, m}

Matches between n and m (inclusive) occurrences.

All of the quantifiers are greedy. This means that they will match as many occurrences as they can as long as the match is still successful. However, if any of these quantifiers is followed by a question mark (?), the quantifier becomes reluctant (sometimes called lazy). It then will match as few occurrences as possible as long as the match is still successful.

The zip code (line 40 in Fig. 29.20) matches a digit five times. This regular expression uses the digit character class and a quantifier with the digit 5 between braces. The phone number (line 46 in Fig. 29.20) matches three digits (the first one cannot be zero) followed by a dash followed by three more digits (again the first one cannot be zero) followed by four more digits.

String Method matches checks whether an entire string conforms to a regular expression. For example, we want to accept "Smith" as a last name, but not "9@Smith#". If only a substring matches the regular expression, method matches returns false.

Replacing Substrings and Splitting Strings

Sometimes it is useful to replace parts of a string or to split a string into pieces. For this purpose, class String provides methods replaceAll, replaceFirst and split. These methods are demonstrated in Fig. 29.23.

Figure 29.23. Methods replaceFirst, replaceAll and split.

(This item is displayed on pages 1384 - 1385 in the print version)

1 // Fig. 29.23: RegexSubstitution.java 2 // Using methods replaceFirst, replaceAll and split. 3 4 public class RegexSubstitution 5 { 6 public static void main( String args[] ) 7 { 8 String firstString = "This sentence ends in 5 stars *****"; 9 String secondString = "1, 2, 3, 4, 5, 6, 7, 8"; 10 11 System.out.printf( "Original String 1: %s ", firstString ); 12 13 // replace '*' with '^' 14 firstString = firstString.replaceAll( "\*" , "^" ); 15 16 System.out.printf( "^ substituted for *: %s ", firstString ); 17 18 // replace 'stars' with 'carets' 19 firstString = firstString.replaceAll( "stars", "carets" ); 20 21 System.out.printf( 22 ""carets" substituted for "stars": %s ", firstString ); 23 24 // replace words with 'word' 25 System.out.printf( "Every word replaced by "word": %s ", 26 firstString.replaceAll( "\w+", "word" ) ); 27 28 System.out.printf( "Original String 2: %s ", secondString ); 29 30 // replace first three digits with 'digit' 31 for ( int i = 0; i < 3; i++ ) 32 secondString = secondString.replaceFirst( "\d", "digit" ); 33 34 System.out.printf( 35 "First 3 digits replaced by "digit" : %s ", secondString ); 36 String output = "String split at commas: [" ; 37 38 String[] results = secondString.split( ",\s*" ); // split on commas 39 40 for ( String string : results ) 41 output += """ + string + "", "; // output results 42 43 // remove the extra comma and add a bracket 44 output = output.substring( 0, output.length() - 2 ) + "]"; 45 System.out.println( output ); 46 } // end main 47 } // end class RegexSubstitution  

Original String 1: This sentence ends in 5 stars ***** ^ substituted for *: This sentence ends in 5 stars ^^^^^ "carets" substituted for "stars": This sentence ends in 5 carets ^^^^^ Every word replaced by "word": word word word word word word ^^^^^ Original String 2: 1, 2, 3, 4, 5, 6, 7, 8 First 3 digits replaced by "digit" : digit, digit, digit, 4, 5, 6, 7, 8 String split at commas: ["digit", "digit", "digit", "4", "5", "6", "7", "8"]  

Method replaceAll replaces text in a string with new text (the second argument) wherever the original string matches a regular expression (the first argument). Line 14 replaces every instance of "*" in firstString with "^". Note that the regular expression ("\*") precedes character * with two backslashes, . Normally, * is a quantifier indicating that a regular expression should match any number of occurrences of a preceding pattern. However, in line 14, we want to find all occurrences of the literal character *to do this, we must escape character * with character . By escaping a special regular-expression character with a , we instruct the regular-expression matching engine to find the actual character, as opposed to what it represents in a regular expression. Since the expression is stored in a Java string and is a special character in Java strings, we must include an additional . So the Java string "\*" represents the regular-expression pattern * which matches a single * character in the search string. In line 19, every match for the regular expression "stars" in firstString is replaced with "carets".

Method replaceFirst (line 32) replaces the first occurrence of a pattern match. Java Strings are immutable, therefore method replaceFirst returns a new string in which the appropriate characters have been replaced. This line takes the original string and replaces it with the string returned by replaceFirst. By iterating three times we replace the first three instances of a digit (d) in secondString with the text "digit".

Method split divides a string into several substrings. The original string is broken in any location that matches a specified regular expression. Method split returns an array of strings containing the substrings between matches for the regular expression. In line 38, we use method split to tokenize a string of comma-separated integers. The argument is the regular expression that locates the delimiter. In this case, we use the regular expression ",\s*" to separate the substrings wherever a comma occurs. By matching any whitespace characters, we eliminate extra spaces from the resulting substrings. Note that the commas and whitespace are not returned as part of the substrings. Again, note that the Java string ",\s*" represents the regular expression ,s*.

Classes Pattern and Matcher

In addition to the regular-expression capabilities of class String, Java provides other classes in package java.util.regex that help developers manipulate regular expressions. Class Pattern represents a regular expression. Class Matcher contains both a regular-expression pattern and a CharSequence in which to search for the pattern.

CharSequence is an interface that allows read access to a sequence of characters. The interface requires that the methods charAt, length, subSequence and toString be declared. Both String and StringBuffer implement interface CharSequence, so an instance of either of these classes can be used with class Matcher.

Common Programming Error 29.4

A regular expression can be tested against an object of any class that implements interface CharSequence, but the regular expression must be a String. Attempting to create a regular expression as a StringBuffer is an error.

If a regular expression will be used only once, static Pattern method matches can be used. This method takes a string that specifies the regular expression and a CharSequence on which to perform the match. This method returns a boolean indicating whether the search object (the second argument) matches the regular expression.

If a regular expression will be used more than once, it is more efficient to use static Pattern method compile to create a specific Pattern object for that regular expression. This method receives a string representing the pattern and returns a new Pattern object, which can then be used to call method matcher. This method receives a CharSequence to search and returns a Matcher object.

Matcher provides method matches, which performs the same task as Pattern method matches, but receives no argumentsthe search pattern and search object are encapsulated in the Matcher object. Class Matcher provides other methods, including find, lookingAt, replaceFirst and replaceAll.

Figure 29.24 presents a simple example that employs regular expressions. This program matches birthdays against a regular expression. The expression only matches birthdays that do not occur in April and that belong to people whose names begin with "J".

Lines 1112 create a Pattern by invoking static Pattern method compile. The dot character "." in the regular expression (line 12) matches any single character except a new-line character.

Figure 29.24. Regular expressions checking birthdays.

1 // Fig. 29.24: RegexMatches.java 2 / Demonstrating Classes Pattern and Matcher. 3 import java.util.regex.Matcher; 4 import java.util.regex.Pattern; 5 6 public class RegexMatches 7 { 8 public static void main( String args[] ) 9 { 10 // create regular expression 11 Pattern expression = 12 Pattern.compile( "J.*\d[0-35-9]-\d\d-\d\d" ); 13 14 String string1 = "Jane's Birthday is 05-12-75 " + 15 "Dave's Birthday is 11-04-68 " + 16 "John's Birthday is 04-28-73 " + 17 "Joe's Birthday is 12-17-77"; 18 19 // match regular expression to string and print matches 20 Matcher matcher = expression.matcher( string1 ); 21 22 while ( matcher.find() ) 23 System.out.println( matcher.group() ); 24 } // end main 25 } // end class RegexMatches  

Jane's Birthday is 05-12-75 Joe's Birthday is 12-17-77  

Line 20 creates the Matcher object for the compiled regular expression and the matching sequence (string1). Lines 2223 use a while loop to iterate through the string. Line 22 uses Matcher method find to attempt to match a piece of the search object to the search pattern. Each call to this method starts at the point where the last call ended, so multiple matches can be found. Matcher method lookingAt performs the same way, except that it always starts from the beginning of the search object and will always find the first match if there is one.

Common Programming Error 29.5

Method matches (from class String, Pattern or Matcher) will return true only if the entire search object matches the regular expression. Methods find and lookingAt (from class Matcher) will return TRue if a portion of the search object matches the regular expression.

Line 23 uses Matcher method group, which returns the string from the search object that matches the search pattern. The string that is returned is the one that was last matched by a call to find or lookingAt. The output in Fig. 29.24 shows the two matches that were found in string1.

Regular Expression Web Resources

This section presents several of Java's regular-expression capabilities. The following Web sites provide more information on regular expressions.

developer.java.sun.com/developer/technicalArticles/releases/1.4regex

Thoroughly describes Java's regular-expression capabilities.

java.sun.com/docs/books/tutorial/extra/regex/index.html

This tutorial explains how to use Java's regular-expression API.

java.sun.com/j2se/5.0/docs/api/java/util/regex/package-summary.html

This page is the javadoc overview of package java.util.regex.

Категории