Unix for Mac OS X 10.4 Tiger: Visual QuickPro Guide (2nd Edition)

Searching for Text Inside Files

It is extremely common when using Unix to want to search for specific words or strings of characters inside text files or to search the long output of some command. The main Unix command for this is grep .

Using grep

To search for a string in a text file:

To make the search case insensitive:

Tips

Not All greps Are the Same

We cover the version of grep that comes with Darwin/Mac OS X. Different flavors of Unix come with different versions of the grep family ( grep , egrep , fgrep , agrep ), so the exact behavior of each command will vary slightly depending on the version installed on your system. The best way to see the differences is to read the Unix man pages for each command.

To search for a string in multiple files:

Tip

Where grep Gets Its Name

grep gets its name from g/RE/p , which is a representation of the commands in the old Unix editor ed to " g lobally search for a r egular e xpression and p rint."

Regular expressions make up a complex and powerful system for matching patterns and are available in many Unix programs. We'll cover a small part of regular expressions in this chapter.

The grep program is the main Unix command for searching text files or a stream of text (such as the output of another program). The output of grep is every line that contains the search string. The default is for searches to be case sensitive.

To recursively search all the files in a directory:

Tip

To find the lines that do not match:

To search the output of another command:

Tip

The grep and egrep ( e for extended ) programs have a huge number of options. Table 4.1 lists some of the more common ones. See the Unix manual for more (type man grep ).

Table 4.1. Options for grep and egrep

O PTION

M EANING

- i

Ignore case.

- v

Show only lines that do not match.

- n

Add line numbers .

- l

Show only the names of files in which matches were found.

- L

Show only the names of files without a match. [*]

- r

Recursively search directories. [*]

- d skip

Skip arguments that are directories. [*]

[*] These options are not as common as the others, but Mac OS X does have them.

Using patterns in your search

You will often want to search for something more complicated than a literal string of characters. You might want to search only for lines that begin with a certain string, or for lines that contain a range of dates, such as Feb 15 or Feb 16 .

The egrep command supports an extremely powerful (and complex) pattern-matching system called regular expressions. The re in egrep stands for regular expression . (The grep command also supports a small number of regular expressions. To avoid switching back and forth, we will stick with egrep here.)

Regular expressions are used in a large number of situations in Unix, not only with the grep and egrep commands. For example, the Unix programs sed , awk , and vi all use regular expressions, as do the C, Perl, Tcl, Python, and Java programming languages. The basic syntax of regular expressions is the same or very similar in a variety of situations, so once you learn how to use them in one area, you have a head start on using them in another.

Regular expressions are built up like mathematical formulas (see the sidebar "Learning More About Regular Expressions"). You can do a lot with the few rules we'll show you here.

An important concept to grasp in using regular expressions is that when you search for "hello," you are really searching for a pattern consisting of six atoms ( h, e, l, l, and o ). In regular expressions, an atom is a part of the overall expression that matches one character. The most common kind of atom is simply a literal character, so the atom h matches the letter h . But atoms don't stop there. For example, the atom [a-d] matches one letter from the range a, b, c, or d . (The [and] are used to define a set of characters.) So when you see the word atom used in the examples below, keep in mind that an atom can be as simple as one character, or it can be a more complex notation that matches one character from a list of possibilities.

Regular expressions have a few major rules, which are demonstrated in the examples below.

Also, note that you always enclose the pattern inside single quotes; this is to prevent the shell from misinterpreting any of the characters used in regexes (as they're traditionally called) that have special meaning to the shell: [] {} . * .

Compare with Aqua

Mac OS X provides a nice graphical interface for finding strings of text in files on your Mac, and in some ways it can do a better job than grep . Spotlight (introduced in Mac OS X 10.4) is a powerful and elegantly designed graphical interface for searching your hard disk. Spotlight searches the names and contents of files and presents the results grouped by types of files. It also builds indexes of files, which speeds up searching. And, of course, it is mostly a point-and-click interface.

In practice, grep is often easier to use than Spotlight (once you get used to the command line); for example, it is rather hard in Spotlight to focus your search on a few specific files, whereas with grep you can supply any arbitrary list of files as arguments on the command line.

To find lines starting with a specific string of characters:

To find lines ending with a string:

Tip

Testing regular expressions

Regular expressions can get quite complex, and learning how to use them takes practice. Luckily there is an easy way to test them to see if they match what you think they will match.

If you use one of the grep commands ( grep or egrep ) with a pattern but without giving it a filename or input from a pipe, then it waits for you to type input and repeats back to you any lines that match.

In most of our examples we use the egrep command because it supports extended regular expression. In cases where we don't need extended regexes, we will use plain old grep .

To test a regular expression:

1.

egrep ' regular expression '

For example:

egrep '^[hH]ello'

(Using both upper- and lowercase letters means you're looking for both instances.)

Notice you are not giving egrep a file to search. When you press , you get a blank line. egrep is waiting for you to type in a line of text, which it will check against the pattern. Figure 4.26 shows the next few steps in this task with the text you type highlighted (in bold ) and the Mac's response in plain text.

Figure 4.26. Testing a regular expression.

localhost:~ vanilla$ egrep '^[hH]ello' Hello, nice to meet you. Hello, nice to meet you. Say, Hello world ^C localhost:~ vanilla$

2.

Type in a line of text that you think should (or should not) match the pattern, and press . For example:

Hello, nice to meet you.

If the shell displays (repeats back to you) the line you typed, then the expression matched (the example above should match). Otherwise, it did not match.

3.

Type in another line of text to check:

Say, Hello world

This does not match the pattern in step 1 because the line does not match the ^ anchor ("look at the beginning of the line").

4.

To exit from the test, press .

Learning More About Regular Expressions

Regular expressions are used not only with the grep program but also with multiple Unix programs and programming languages.

Here are a few places to learn more about regexes (as they are known to Unix experts):

  • Learning to Use Regular Expressions (http://gnosis.cx/publish/programming/regular_expressions.html).

    A nice online tutorial, though it assumes you are working with regular expressions in one of the many programming languages that use them.

  • Electronic Text Center: Using Regular Expressions (http://etext.lib. virginia .edu/helpsheets/regex.html).

    An introduction to regular expressions that describes the history and main concepts, and gives examples of their use.

  • Mastering Regular Expressions: Powerful Techniques for Perl and Other Tools , by Jeffrey E. F. Friedl (O'Reilly, 1997; www.oreilly.com/catalog/regex).

    Considered by many to be the standard in-depth work on regular expressions.

Tips

To find lines containing a string in which one character can vary:

To create an atom that is anything not in a list:

Tips

To create an atom from a range of numbers:

Tip

To create an atom from a range of letters:

Tip

To use a wildcard character:

To find lines in which an atom is repeated zero or more times:

Tip

More rules and tools for building regex atoms

The regex examples shown above allow you to perform some fairly sophisticated matching, but there are a lot more ways to create atoms and patterns. Table 4.2 describes several additional tools you will find useful in constructing more complex patterns. All the tools and rules in Table 4.2 require egrep . The real key is to experiment using the testing approach described above.

Table 4.2. Rules and Tools for Regex Atoms

R ULE

T OOL /M EANING

Match 1 or more

Use the + quantifier, "one or more of the preceding atom."

Match 0 or 1

Use the ? quantifier, "zero or one of the preceding atom."

Exact number

Put the number in braces; [a-c]{3} means "any character from the list a-c repeated exactly three times."

Alternatives

Put each alternative in parentheses, and separate them with the pipe character; '(Fox)(Hound)' means "match lines containing either Fox or Hound."

Match special characters

If you want to match characters that have special meanings in a regex such as [ or ^, then escape (that is, remove any special meaning from) them with a \ ; for example, \[ will match a literal [ . Inside a square-bracket list, you do not need to escape anything.

Match ^ inside a list

To include the ^ character in a square-bracket list, put it anywhere except first in the list; for example, [a-c^] matches a , b , c , or ^ .

Категории