Using Regular Expressions
Overview
Regular expressions are not expressions that have a lot of fiber in their diet. Instead, a regular expression is a special type of pattern-matching string that can be very useful for programs that do string manipulation. Regular expression strings contain special pattern-matching characters in them that can be matched against another string to see if the other string fits the pattern. Regular expressions are very handy for doing complex data validation-for example, for making sure users enter properly formatted phone numbers, e-mail addresses, or Social Security numbers.
Regular expressions are also useful for many other purposes, including searching text files to see if they contain certain patterns (can you say, Google?), filtering e-mail based on its contents, or performing complicated search-and-replace functions.
In this chapter, you find out the basics of using regular expressions. I emphasize validation, and focus on comparing strings entered by users against patterns specified by regular expressions to see if they match up. For more complex uses for regular expressions, you have to turn to a more extensive regular expression reference.
Warning |
Regular expressions are actually constructed using a simple but powerful mini-language, so they are like little programs unto themselves. Unfortunately, this mini-language is terse-very terse-to the point of sometimes being downright arcane. Much of it depends on single characters that are packed with meaning that's often obscure. So be warned-the syntax for regular expressions takes a little getting used to. But once you get your mind around the basics, you'll find that simple regular expressions aren't that tough to create and can be very useful. |
Also, be aware that this chapter only covers a portion of all you can do with regular expressions. If you find that you need to use more complicated patterns, you can find plenty of helpful information on the Internet. Just search any search engine for regular expression.
Tip |
A regular expression is often called a regex. Most people pronounce that with a soft g, as if it were spelled rejex. And some pronounce it as if it were spelled rejects. |
A Program for Experimenting with Regular Expressions
Before I get into the details of putting together regular expressions, Listing 3-1 presents a short program that can be a very useful tool while you're learning how to create regular expressions. This program lets you enter a regular expression. Then, you can enter a string, and the program tests it against the regular expression and lets you know whether or not the string matches the regex. The program then prompts you for another string to compare. You can keep entering strings to compare with the regex you've already entered. When you're done, just press the Enter key without entering a string. The program then asks if you want to enter another regular expression. If you answer Y, the whole process repeats. If you answer N, the program ends.
Here's a sample run of this program. For now, don't worry about the details of the regular expression string. Just note that it should match any three-letter word that begins with f, ends with r, and has an a, i, or o in the middle.
Welcome to the Regex Tester Enter regex: f[aio]r Enter string: for Match. Enter string: fir Match. Enter string: fur Does not match. Enter string: fod Does not match. Enter string: Another? (Y or N) n
In this test, I entered the regular expression f[aio]. Then I entered the string for. The program indicated that this string matched the expression and asked for another string. So I entered fir, which also matched. Then I entered fur and fod, which didn't match. I then entered a blank string, so the program asked if I wanted to test another regex. I entered n, so the program ended.
Tip |
This program uses the Pattern and Matcher classes, which I don't explain until the end of the chapter. However, I suggest you use this program alongside this chapter. Regular expressions make a lot more sense if you actually try them out to see them in action. Plus, you can learn a lot by trying simple variations as you go. (You can always download the source code for this program from this book's Web site if you don't want to enter it yourself.) |
In fact, I use portions of console output from this program throughout the rest of this chapter to illustrate regular expressions. There's no better way to see how regular expressions work than to see an expression and some samples of strings that match and don't match the expression.
Listing 3-1: The Regular Expression Test Program
import java.util.regex.*; import java.util.Scanner; public final class Reg { static String r, s; static Pattern pattern; static Matcher matcher; static boolean match, validRegex, doneMatching; private static Scanner sc = new Scanner(System.in); public static void main(String[] args) { System.out.println("Welcome to the " + "Regex Tester "); do { do { System.out.print(" Enter regex: "); r = sc.nextLine(); validRegex = true; try { pattern = Pattern.compile(r); } catch (Exception e) { System.out.println(e.getMessage()); validRegex = false; } } while (!validRegex); doneMatching = false; while (!doneMatching) { System.out.print("Enter string: "); s = sc.nextLine(); if (s.length() == 0) doneMatching = true; else { matcher = pattern.matcher(s); if (matcher.matches()) System.out.println("Match."); else System.out.println( "Does not match."); } } } while (askAgain()); } private static boolean askAgain() { System.out.print("Another? (Y or N) "); String reply = sc.nextLine(); if (reply.equalsIgnoreCase("Y")) return true; return false; } }
Basic Character Matching
Most regular expressions simply match characters to see if a string complies to a simple pattern. For example, you can check a string to see if it matches the format for Social Security numbers (xxx-xx-xxxx), phone numbers [(xxx) xxx-xxxx], or more complicated patterns such as e-mail addresses. (Well actually, Social Security and phone numbers are more complicated than you might think, too. More on that later.) In the following sections, you find out how to create regex patterns for basic character matching.
Matching single characters
The simplest regex patterns just match a string literal exactly. For example:
Enter regex: abc Enter string: abc Match. Enter string: abcd Does not match.
Here, the pattern abc matches the string abc but not abcd.
Using predefined character classes
A character class represents a particular type of character rather than a specific character. regex lets you use two types of character classes: predefined classes and custom classes. The predefined character classes are shown in Table 3-1.
Regex |
Matches … |
---|---|
. |
Any character |
d |
Any digit (0–9) |
D |
Any non-digit (anything other than 0–9) |
s |
Any white-space character (space, tab, newline, return, or backspace) |
S |
Any character other than a white space character |
w |
Any word character (a–z, A–Z, 0–9, or an underscore) |
W |
Any character other than a word character |
The period is like a wildcard that matches any character. For example:
Enter regex: c.t Enter string: cat Match. Enter string: cot Match. Enter string: cart Does not match.
Here c.t matches any three-letter string that starts with c and ends with t. In this example, the first two strings (cat and cot) match, but the third string (cart) doesn't because it's more than three characters.
The d class represents a digit and is often used in regex patterns used to validate input data. For example, here's a simple regex pattern that validates a U.S. Social Security number, which must be entered in the form xxx-xx-xxxx:
Enter regex: ddd-dd-dddd Enter string: 779-54-3994 Match. Enter string: 550-403-004 Does not match.
Here the regex pattern specifies that the string must contain three digits, a hyphen, two digits, another hyphen, and four digits.
TECHNICAL STAUFF |
Note that this regex pattern isn't really enough to validate real Social Security numbers, because the government places more restrictions on these numbers than just the pattern xxx-xx-xxxx. For example, no Social Security number can begin with 779. Thus, the number 779-54-3994 entered in the preceding example isn't really a valid Social Security number. |
Note that the d class has a counterpart: D. The D class matches any character that is not a digit. For example, here's a first attempt at a regex for validating droid names:
Enter regex: Dd-Dd Enter string: R2-D2 Match. Enter string: C2-D0 Match. Enter string: C-3PO Does not match.
Here the pattern matches strings that begin with a character that isn't a digit, followed by a character that is a digit, followed by a hyphen, followed by another non-digit character, and ending with a digit. Thus, R2-D2 and C3-P0 match. Unfortunately, this regex is far from perfect, as any Star Wars fan can tell you. That's because the proper spelling of the shiny gold protocol droid's name is C-3PO, not C3-P0. Typical.
The s class matches white space characters including spaces, tabs, new-lines, returns, and backspaces. This class is useful when you want to allow the user to separate parts of a string in various ways. For example:
Enter regex: ...s... Enter string: abc def Match. Enter string: abc def Match.
Here the pattern specifies that the string can be two groups of any three characters separated by one white space character. In the first string that's entered, the groups are separated by a space. In the second group, they're separated by a tab. The s class also has a counterpart: S. It matches any character that isn't a white-space character.
Tip |
If you want to limit white space characters to actual spaces, just use a space in the regex. For example: Enter regex: ... ... Enter string: abc def Match. Enter string: abc def Does not match. |
Here the regex specifies two groups of any character separated by a space. The first input string matches this pattern, but the second does not because the groups are separated by a tab.
The last set of predefined classes are w and W. The w class identifies any character that's normally used in words. That includes upper-and lowercase letters, digits, and the underscore. An example shows how all that looks:
Enter regex: wwwWwww Enter string: abc def Match. Enter string: 123 456 Match. Enter string: 123A456 Does not match.
Here the pattern calls for two groups of word characters separated by a non-word character.
TECHNICAL STAUFF |
Isn't it strange that underscores are considered to be word characters? I don't know of too many words in the English language (or any other language, for that matter) that have underscores in them. I guess that's the computer-nerd origins of regular expressions showing through. |
Using custom character classes
To create a custom character class, you simply list all the characters that you want included in the class within a set of brackets. Here's an example:
Enter regex: b[aeiou]t Enter string: bat Match. Enter string: bet Match. Enter string: bit Match. Enter string: bot Match. Enter string: but Match. Enter string: bmt Does not match.
Here the pattern specifies that the string must start with the letter b, followed by a class that can include a, e, i, o, or u, followed by t. In other words, it accepts three-letter words that begin with b, end with t, and have a vowel in the middle.
If you want to let the pattern include uppercase letters as well as lowercase letters, you have to list them both:
Enter regex: b[aAeEiIoOuU]t Enter string: bat Match. Enter string: BAT Does not match. Enter string: bAt Match.
You can use as many custom groups on a line as you want. For example, here's one that defines classes for the first and last characters so they too can be upper-or lowercase:
Enter regex: [bB][aAeEiIoOuU][tT] Enter string: bat Match. Enter string: BAT Match.
This pattern specifies three character classes. The first can be b or B, the second can be any upper-or lowercase vowel, and the third can be t or T.
Using ranges
Custom character classes can also specify ranges of letters and numbers. For example:
Enter regex: [a-z][0-5] Enter string: r2 Match. Enter string: b9 Does not match.
Here the string can be two characters long. The first must be a character from a through z, and the second must be from 0 through 5.
You can also use more than one range in a class, like this:
Enter regex: [a-zA-Z][0-5] Enter string: r2 Match. Enter string: R2 Match.
Here the first character can be lowercase a through z or uppercase A through Z.
Tip |
You can use ranges to build a class that accepts only characters that appear in real words (as opposed to the w class, which allows underscores): Enter regex: [a-zA-Z0-9] Enter string: a Match. Enter string: N Match. Enter string: 9 Match. |
Using negation
Regular expressions can include classes that match any character but the ones listed for the class. To do that, you start the class with a caret, like this:
Enter regex: [^cf]at Enter string: bat Match. Enter string: cat Does not match. Enter string: fat Does not match.
Here the string must be a three-letter word that ends in at, but isn't fat or cat.
Matching multiple characters
The regex patterns described so far in this chapter require that each position in the input string always match a specific character class. For example, the pattern dW[a-z] requires a digit in the first position, a white space character in the second position, and one of the letters a through z in the third position. These are pretty rigid requirements.
To create more flexible patterns, you can use any of the quantifiers listed in Table 3-2. These quantifiers let you create patterns that match a variable number of characters at a certain position in the string.
Regex |
Matches the Preceding Element … |
---|---|
? |
Zero or one time |
* |
Zero or more times |
+ |
One or more times |
{n } |
Exactly n times |
{n, } |
At least n times |
{n, m } |
At least n times but no more than m times |
To use a quantifier, you code it immediately after the element you want it to apply to. For example, here's a version of the Social Security number pattern that uses quantifiers:
Enter regex: d{3}-d{2}-d{4} Enter string: 779-48-9955 Match. Enter string: 483-488-9944 Does not match.
Here the pattern matches three digits, followed by a hyphen, followed by two digits, followed by another hyphen, followed by four digits.
Tip |
Simply duplicating elements rather than using a quantifier is just as easy, if not easier. For example, dd is just as clear as d{2}. |
The ? quantifier lets you create an optional element that may or may not be present in the string. For example, suppose you want to allow the user to enter Social Security numbers without the hyphens. Then, you could use this pattern:
Enter regex: d{3}-?d{2}-?d{4} Enter string: 779-48-9955 Match. Enter string: 779489955 Match. Enter string: 779-489955 Match. Enter string: 77948995 Does not match.
The question marks indicate that the hyphens are optional. Notice that this pattern lets you include or omit either hyphen. The last string entered doesn't match because it has only eight digits, and the pattern requires nine.
Using escapes
In regular expressions, certain characters have special meaning. This leads to the question, what if you want to search for one of those special characters? In that case, you escape the character by preceding it with a backslash. Here's an example:
Enter regex: (d{3}) d{3}-d{4} Enter string: (559) 555-1234 Match. Enter string: 559 555-1234 Does not match.
Here, ( represents a left parenthesis, and ) represents a right parenthesis. Without the backslashes, the regular expression treats the parenthesis as a grouping element.
Here are a few additional points to ponder about escapes:
-
Tip Strictly speaking, you need to use the backslash escape only for characters that have special meanings in regular expressions. However, I recommend you escape any punctuation character or symbol, just to be sure.
- You can't escape alphabetic characters (letters). That's because a backslash followed by certain alphabetic characters represents a character, a class, or some other regex element.
- To escape a backslash, code two slashes in a row. For example, the regex dd\dd accepts strings made up of two digits followed by a backslash and two more digits, such as 2388 and 9555.
Using parentheses to group characters
You can use parentheses to create groups of characters to apply other regex elements to. For example:
Enter regex: (bla)+ Enter string: bla Match. Enter string: blabla Match. Enter string: blablabla Match. Enter string: bla bla bla Does not match.
Here the parentheses treat bla as a group, so the + quantifier applies to the entire sequence. Thus, this pattern looks for one or more occurrences of the sequence bla.
Here's an example that finds U.S. phone numbers that can have an optional area code:
Enter regex: ((d{3})s?)?d{3}-d{4} Enter string: 555-1234 Match. Enter string: (559) 555-1234 Match. Enter string: (559)555-1239 Match.
This regex pattern is a little complicated, but if you examine it element by element, you should be able to figure it out. It starts with a group that indicates the optional area code: ((d{3})s?)?. This group begins with the left parenthesis, which marks the start of the group. The characters in the group consist of an escaped left parenthesis, three digits, an escaped right parenthesis, and an optional white space character. Then a right parenthesis closes the group, and the question mark indicates that the entire group is optional. The rest of the regex pattern looks for three digits followed by a hyphen and four more digits.
When you mark a group of characters with parentheses, the text that matches that group is captured so you can use it later in the pattern. The groups that are captured are called capture groups and are numbered beginning with 1. You can then use a backslash followed by the capture group number to indicate that the text must match the text that was captured for the specified capture group.
For example, suppose that droids named following the pattern wd-wd must have the same digit in the second and fifth characters. In other words, r2-d2 and b9-k9 are valid droid names, but r2-d4 and d3-r4 are not.
Here's an example that can validate that type of name:
Enter regex: w(d)-w1 Enter string: r2-d2 Match. Enter string: d3-r4 Does not match. Enter string: b9-k9 Match.
Here 1 refers to the first capture group. Thus, the last character in the string must be the same as the second character, which must be a digit.
Using the | symbol
The | symbol defines an or operation, which lets you create patterns that accept any of two or more variations. For example, here's an improvement to the pattern for validating droid names:
Enter regex: (wd-wd)|(w-dww) Enter string: r2-d2 Match. Enter string: c-3po Match.
Here the | character indicates that either the group on the left or the group on the right can be used to match the string. The group on the left matches a word character, a digit, a hyphen, a word character, and another digit. The group on the right matches a word character, a hyphen, a digit, and two word characters.
You may want to use an additional set of parentheses around the entire part of the pattern that the | applies to. Then you can add additional pattern elements before or after the | groups. For example, what if you want to let a user enter the area code for a phone number with or without parentheses. Here's a regex pattern that does the trick:
Enter regex: ((d{3})|((d{3})))?d{3}-d{4} Enter string: (559) 555-1234 Match. Enter string: 559 555-1234 Match. Enter string: 555-1234 Match.
The first part of this pattern is a group that consists of two smaller groups separated by an | character. The first of these groups matches an area code without parentheses followed by a space, and the second matches an area code with parentheses followed by a space. So the outer group matches an area code with or without parentheses. This entire group is marked with a question mark as optional, and then the pattern continues with three digits, a hyphen, and four digits.
Using Regular Expressions in Java Programs
So far, this chapter has shown you the basics of creating regular expressions. Now, the following sections show you how to put them to use in Java programs.
The String problem
Before getting into the classes for working with regular expressions, I want to clue you in about a problem that Java has when dealing with strings that contain regular expressions. As you've seen throughout this chapter, regex patterns rely on the backslash character to mark different elements of a pattern. The bad news is that Java treats the backslash character in a string literal as an escape character. Thus, you can't just quote regular expressions in string literals, because Java steals the backslash characters before they get to the regular expression classes.
In most cases, the compiler simply complains that the string literal is not correct. For example, the following line won't compile:
String regex = "wd-wd"; // error: won't compile
The compiler sees the backslashes in the string and expects to find a valid Java escape sequence, not a regular expression.
Unfortunately, the solution to this problem is ugly: You have to double the backslashes wherever they occur. Java treats two backslashes in a row as an escaped backslash, and places a single backslash in the string. Thus, you have to code the statement shown in the previous paragraph like this:
String regex = "\w\d-\w\d"; // now it will // compile
Here, each backslash I want in the regular expression is coded as a pair of backslashes in the string literal.
Tip |
If you're in doubt about whether you're coding your string literals right, just use System.out.println to print the resulting string. Then you can check the console output to make sure you wrote the string literal right. For example, if I followed the previous statement with System.out.println(regex), the following output would appear on the console: wd-wd |
Thus, I know I coded the string literal for the regular expression correctly.
Using regular expressions with the String class
If all you want to do with a regular expression is check whether a string matches a pattern, you can use the matches method of the String class. This method accepts a regular expression as a parameter and returns a boolean that indicates whether the string matches the pattern.
For example, here's a static method that validates droid names:
private static boolean validDroidName(String droid) { String regex = "(\w\d-\w\d)|(\w-\d\w\w)"; return droid.matches(regex); }
Here the name of the droid is passed via a parameter, and the method returns a boolean that indicates whether the droid's name is valid. The method simply creates a regular expression from a string literal, and then uses the matches method of the droid string to match the pattern.
You can also use the split method to split a string into an array of String objects based on delimiters that match a regular expression. One common way to do that is to simply create a custom class of characters that can be used for delimiters. For example:
String s = "One:Two;Three|Four Five"; String regex = "[:;|\t]"; String strings[] = s.split(regex); for (String word : strings) System.out.println(word);
Here a string is split into words marked by colons, semicolons, vertical bars, or tab characters. When you run this program, here's what's displayed on the console:
Using the Pattern and Matcher classes
The matches method is fine for occasional use of regular expressions. But if you want your program to do a lot of pattern matching, you should use the Pattern and Matcher classes instead. The Pattern class represents a regular expression that has been compiled into executable form (remember, regular expressions are like little programs). Then you can use the compiled Pattern object to create a Matcher object, which you can then use to match strings.
The Pattern class itself is pretty simple. Although it has about ten methods, you usually use just these two:
- static Pattern compile (String pattern): Compiles the specified pattern. This static method returns a Pattern object. It throws PatternSyntaxException if the pattern contains an error.
- Matcher matcher(String input): Creates a Matcher object to match this pattern against the specified string.
First, you use the compile method to create a Pattern object. (Pattern is one of those weird classes that doesn't have constructors. Instead, it relies on the static compile method to create instances.) Because the compile method throws PatternSyntaxException, you must use a try/catch statement to catch this exception when you compile a pattern.
After you have a Pattern instance, you then use the matcher method to create an instance of the Matcher class. This class has more than 30 methods that let you do all sorts of things with regular expressions that aren't covered in this chapter, such as finding multiple occurrences of a pattern in an input string or replacing text that matches a pattern with a replacement string. For purposes of this book, I'm concerned only with the matches method: static boolean matches() returns a boolean that indicates whether the entire string matches the pattern.
To illustrate how to use these methods, here's an enhanced version of the validDroidName method that creates a pattern for the droid validation regex and saves it in a static class field:
private static Pattern droidPattern; private static boolean validDroidName(String droid) { if (droidPattern == null) { String regex = "(\w\d-\w\d)|" + "(\w-\d\w\w)"; droidPattern = Pattern.compile(regex); } Matcher m = droidPattern.matcher(droid); return m.matches(); }
Here the private class field droidPattern saves the compiled pattern for validating droids. The if statement in the validDroidName method checks whether the pattern has already been created. If not, the pattern is created by calling the static compile method of the Pattern class. Then the matcher method is used to create a Matcher object for the string passed as a parameter, and the string is validated by calling the matches method of the Matcher object.