Mastering Regular Expressions

4.7. Summary

If you understood everything in this chapter the first time you read it, you probably didn't need to read it in the first place. It's heady stuff, to say the least. It took me quite a while to understand it, and then longer still to understand it. I hope this one concise presentation makes it easier for you. I've tried to keep the explanation simple without falling into the trap of oversimplification (an unfortunately all-too-common occurrence which hinders true understanding). This chapter has a lot in it, so I've included a lot of page references in the following summary, for when you'd like to quickly check back on something.

There are two underlying technologies commonly used to implement a regex match engine, " regex-directed NFA" (˜ 153) and "text-directed DFA" (˜ 155). The abbreviations are spelled out on page 156.

Combine the two technologies with the POSIX standard (˜ 178), and for practical purposes, there are three types of engines:

  • Traditional NFA (gas-guzzling, power-on-demand)

  • POSIX NFA (gas-guzzling, standard-compliant)

  • DFA (POSIX or not) (electric, steady-as-she-goes)

To get the most out of a utility, you need to understand which type of engine it uses, and craft your regular expressions appropriately. The most common type is the Traditional NFA, followed by the DFA. Table 4-1 (˜ 145) lists a few common tools and their engine types, and the section "Testing the Engine Type" (˜ 146) shows how you can test the type yourself.

One overriding rule regardless of engine type: matches starting sooner take precedence over matches starting later. This is due to how the engine's "transmission" tests the regex at each point in the string (˜ 148).

For the match attempt starting at any given spot:

DFA Text-Directed Engines

Find the longest possible match, period. That's it. End of discussion (˜ 177). Consistent, very fast (˜ 179), and boring to talk about.

NFA Regex-Directed Engines

Must "work through" a match. The soul of NFA matching is backtracking (˜ 157, 162). The metacharacters control the match: the standard quantifiers (star and friends ) are greedy (˜ 151), while others may be lazy or possessive (˜ 169). Alternation is ordered (˜ 174) in a traditional NFA, but greedy with a POSIX NFA.

POSIX NFA Must find the longest match, period. But, it's not boring, as you must worry about efficiency (the subject of Chapter 6).

Traditional NFA Is the most expressive type of regex engine, since you can use the regex-directed nature of the engine to craft exactly the match you want.

Understanding the concepts and practices covered in this chapter is the foundation for writing correct and efficient regular expressions, which just happens to be the subject of the next two chapters.

Категории