Introduction to Pointer-Based String Processing

In this section, we introduce some common C++ Standard Library functions that facilitate string processing. The techniques discussed here are appropriate for developing text editors, word processors, page layout software, computerized typesetting systems and other kinds of text-processing software. We have already used the C++ Standard Library string class in several examples to represent strings as full-fledged objects. For example, the GradeBook class case study in Chapters 37 represents a course name using a string object. In Chapter 18 we present class string in detail. Although using string objects is usually straightforward, we use null-terminated, pointer-based strings in this section. Many C++ Standard Library functions operate only on null-terminated, pointer-based strings, which are more complicated to use than string objects. Also, if you work with legacy C++ programs, you may be required to manipulate these pointer-based strings.


8.13.1. Fundamentals of Characters and Pointer-Based Strings

Characters are the fundamental building blocks of C++ source programs. Every program is composed of a sequence of characters thatwhen grouped together meaningfullyis interpreted by the compiler as a series of instructions used to accomplish a task. A program may contain character constants. A character constant is an integer value represented as a character in single quotes. The value of a character constant is the integer value of the character in the machine's character set. For example, 'z' represents the integer value of z (122 in the ASCII character set; see Appendix B), and ' ' represents the integer value of newline (10 in the ASCII character set).

A string is a series of characters treated as a single unit. A string may include letters, digits and various special characters such as +, -, *, / and $. String literals, or string constants, in C++ are written in double quotation marks as follows:

"John Q. Doe"

(a name)

"9999 Main Street"

(a street address)

"Maynard, Massachusetts"

(a city and state)

"(201) 555-1212"

(a telephone number)

A pointer-based string in C++ is an array of characters ending in the null character (''), which marks where the string terminates in memory. A string is accessed via a pointer to its first character. The value of a string is the address of its first character. Thus, in C++, it is appropriate to say that a string is a constant pointerin fact, a pointer to the string's first character. In this sense, strings are like arrays, because an array name is also a pointer to its first element.

A string literal may be used as an initializer in the declaration of either a character array or a variable of type char *. The declarations

char color[] = "blue"; const char *colorPtr = "blue";

each initialize a variable to the string "blue". The first declaration creates a five-element array color containing the characters 'b', 'l', 'u', 'e' and ''. The second declaration creates pointer variable colorPtr that points to the letter b in the string "blue" (which ends in '') somewhere in memory. String literals have static storage class (they exist for the duration of the program) and may or may not be shared if the same string literal is referenced from multiple locations in a program. Also, string literals in C++ are constanttheir characters cannot be modified.

The declaration char color[] = "blue"; could also be written

char color[] = { 'b', 'l', 'u', 'e', '' };

When declaring a character array to contain a string, the array must be large enough to store the string and its terminating null character. The preceding declaration determines the size of the array, based on the number of initializers provided in the initializer list.


Common Programming Error 8.15

Not allocating sufficient space in a character array to store the null character that terminates a string is an error.

Common Programming Error 8.16

Creating or using a C-style string that does not contain a terminating null character can lead to logic errors.

Error-Prevention Tip 8.4

When storing a string of characters in a character array, be sure that the array is large enough to hold the largest string that will be stored. C++ allows strings of any length to be stored. If a string is longer than the character array in which it is to be stored, characters beyond the end of the array will overwrite data in memory following the array, leading to logic errors.

A string can be read into a character array using stream extraction with cin. For example, the following statement can be used to read a string into character array word[ 20 ]:

cin >> word;

The string entered by the user is stored in word. The preceding statement reads characters until a white-space character or end-of-file indicator is encountered. Note that the string should be no longer than 19 characters to leave room for the terminating null character. The setw stream manipulator can be used to ensure that the string read into word does not exceed the size of the array. For example, the statement

cin >> setw( 20 ) >> word;

specifies that cin should read a maximum of 19 characters into array word and save the 20th location in the array to store the terminating null character for the string. The setw stream manipulator applies only to the next value being input. If more than 19 characters are entered, the remaining characters are not saved in word, but will be read in and can be stored in another variable.

In some cases, it is desirable to input an entire line of text into an array. For this purpose, C++ provides the function cin.getline in header file . In Chapter 3 you were introduced to the similar function getline from header file , which read input until a newline character was entered, and stored the input (without the newline character) into a string specified as an argument. The cin.getline function takes three argumentsa character array in which the line of text will be stored, a length and a delimiter character. For example, the program segment

char sentence[ 80 ]; cin.getline( sentence, 80, ' ' );

declares array sentence of 80 characters and reads a line of text from the keyboard into the array. The function stops reading characters when the delimiter character ' ' is encountered, when the end-of-file indicator is entered or when the number of characters read so far is one less than the length specified in the second argument. (The last character in the array is reserved for the terminating null character.) If the delimiter character is encountered, it is read and discarded. The third argument to cin.getline has ' ' as a default value, so the preceding function call could have been written as follows:

cin.getline( sentence, 80 );


Chapter 15, Stream Input/Output, provides a detailed discussion of cin.getline and other input/output functions.

Common Programming Error 8.17

Processing a single character as a char * string can lead to a fatal runtime error. A char * string is a pointerprobably a respectably large integer. However, a character is a small integer (ASCII values range 0255). On many systems, dereferencing a char value causes an error, because low memory addresses are reserved for special purposes such as operating system interrupt handlersso "memory access violations" occur.

Common Programming Error 8.18

Passing a string as an argument to a function when a character is expected is a compilation error.

 

8.13.2. String Manipulation Functions of the String-Handling Library

The string-handling library provides many useful functions for manipulating string data, comparing strings, searching strings for characters and other strings, tokenizing strings (separating strings into logical pieces such as the separate words in a sentence) and determining the length of strings. This section presents some common string-manipulation functions of the string-handling library (from the C++ standard library). The functions are summarized in Fig. 8.30; then each is used in a live-code example. The prototypes for these functions are located in header file .

Figure 8.30. String-manipulation functions of the string-handling library.


Function prototype

Function description

char *strcpy( char *s1, const char *s2 );

 

Copies the string s2 into the character array s1. The value of s1 is returned.

char *strncpy( char *s1, const char *s2, size_t n );

 

Copies at most n characters of the string s2 into the character array s1. The value of s1 is returned.

char *strcat( char *s1, const char *s2 );

 

Appends the string s2 to s1. The first character of s2 overwrites the terminating null character of s1. The value of s1 is returned.

char *strncat( char *s1, const char *s2, size_t n );

 

Appends at most n characters of string s2 to string s1. The first character of s2 overwrites the terminating null character of s1. The value of s1 is returned.

int strcmp( const char *s1, const char *s2 );

 

Compares the string s1 with the string s2. The function returns a value of zero, less than zero (usually -1) or greater than zero (usually 1) if s1 is equal to, less than or greater than s2, respectively.

int strncmp( const char *s1, const char *s2, size_t n );

 

Compares up to n characters of the string s1 with the string s2. The function returns zero, less than zero or greater than zero if the n-character portion of s1 is equal to, less than or greater than the corresponding n-character portion of s2, respectively.

char *strtok( char *s1, const char *s2 );

 

A sequence of calls to strtok breaks string s1 into "tokens"logical pieces such as words in a line of text. The string is broken up based on the characters contained in string s2. For instance, if we were to break the string "this:is:a:string" into tokens based on the character ':', the resulting tokens would be "this", "is", "a" and "string". Function strtok returns only one token at a time, however. The first call contains s1 as the first argument, and subsequent calls to continue tokenizing the same string contain NULL as the first argument. A pointer to the current token is returned by each call. If there are no more tokens when the function is called, NULL is returned.

size_t strlen( const char *s );

 

Determines the length of string s. The number of characters preceding the terminating null character is returned.


Note that several functions in Fig. 8.30 contain parameters with data type size_t. This type is defined in the header file to be an unsigned integral type such as unsigned int or unsigned long.

Common Programming Error 8.19

Forgetting to include the header file when using functions from the string-handling library causes compilation errors.

 

Copying Strings with strcpy and strncpy

Function strcpy copies its second argumenta stringinto its first argumenta character array that must be large enough to store the string and its terminating null character, (which is also copied). Function strncpy is equivalent to strcpy, except that strncpy specifies the number of characters to be copied from the string into the array. Note that function strncpy does not necessarily copy the terminating null character of its second argumenta terminating null character is written only if the number of characters to be copied is at least one more than the length of the string. For example, if "test" is the second argument, a terminating null character is written only if the third argument to strncpy is at least 5 (four characters in "test" plus one terminating null character). If the third argument is larger than 5, null characters are appended to the array until the total number of characters specified by the third argument is written.


Common Programming Error 8.20

When using strncpy, the terminating null character of the second argument (a char * string) will not be copied if the number of characters specified by strncpy's third argument is not greater than the second argument's length. In that case, a fatal error may occur if the programmer does not manually terminate the resulting char * string with a null character.

Figure 8.31 uses strcpy (line 17) to copy the entire string in array x into array y and uses strncpy (line 23) to copy the first 14 characters of array x into array z. Line 24 appends a null character ('') to array z, because the call to strncpy in the program does not write a terminating null character. (The third argument is less than the string length of the second argument plus one.)

Figure 8.31. strcpy and strncpy.

1 // Fig. 8.31: fig08_31.cpp 2 // Using strcpy and strncpy. 3 #include 4 using std::cout; 5 using std::endl; 6 7 #include // prototypes for strcpy and strncpy 8 using std::strcpy; 9 using std::strncpy; 10 11 int main() 12 { 13 char x[] = "Happy Birthday to You"; // string length 21 14 char y[ 25 ]; 15 char z[ 15 ]; 16 17 strcpy( y, x ); // copy contents of x into y 18 19 cout << "The string in array x is: " << x 20 << " The string in array y is: " << y << ' '; 21 22 // copy first 14 characters of x into z 23 strncpy( z, x, 14 ); // does not copy null character 24 z[ 14 ] = ''; // append '' to z's contents 25 26 cout << "The string in array z is: " << z << endl; 27 return 0; // indicates successful termination 28 } // end main  

The string in array x is: Happy Birthday to You The string in array y is: Happy Birthday to You The string in array z is: Happy Birthday  

Concatenating Strings with strcat and strncat

Function strcat appends its second argument (a string) to its first argument (a character array containing a string). The first character of the second argument replaces the null character ('') that terminates the string in the first argument. The programmer must ensure that the array used to store the first string is large enough to store the combination of the first string, the second string and the terminating null character (copied from the second string). Function strncat appends a specified number of characters from the second string to the first string and appends a terminating null character to the result. The program of Fig. 8.32 demonstrates function strcat (lines 19 and 29) and function strncat (line 24).


Figure 8.32. strcat and strncat.

1 // Fig. 8.32: fig08_32.cpp 2 // Using strcat and strncat. 3 #include 4 using std::cout; 5 using std::endl; 6 7 #include // prototypes for strcat and strncat 8 using std::strcat; 9 using std::strncat; 10 11 int main() 12 { 13 char s1[ 20 ] = "Happy "; // length 6 14 char s2[] = "New Year "; // length 9 15 char s3[ 40 ] = ""; 16 17 cout << "s1 = " << s1 << " s2 = " << s2; 18 19 strcat( s1, s2 ); // concatenate s2 to s1 (length 15) 20 21 cout << " After strcat(s1, s2): s1 = " << s1 << " s2 = " << s2; 22 23 // concatenate first 6 characters of s1 to s3 24 strncat( s3, s1, 6 ); // places '' after last character 25 26 cout << " After strncat(s3, s1, 6): s1 = " << s1 27 << " s3 = " << s3; 28 29 strcat( s3, s1 ); // concatenate s1 to s3 30 cout << " After strcat(s3, s1): s1 = " << s1 31 << " s3 = " << s3 << endl; 32 return 0; // indicates successful termination 33 } // end main  

s1 = Happy s2 = New Year After strcat(s1, s2): s1 = Happy New Year s2 = New Year After strncat(s3, s1, 6): s1 = Happy New Year s3 = Happy After strcat(s3, s1): s1 = Happy New Year s3 = Happy Happy New Year  


Comparing Strings with strcmp and strncmp

Figure 8.33 compares three strings using strcmp (lines 21, 22 and 23) and strncmp (lines 26, 27 and 28). Function strcmp compares its first string argument with its second string argument character by character. The function returns zero if the strings are equal, a negative value if the first string is less than the second string and a positive value if the first string is greater than the second string. Function strncmp is equivalent to strcmp, except that strncmp compares up to a specified number of characters. Function strncmp stops comparing characters if it reaches the null character in one of its string arguments. The program prints the integer value returned by each function call.

Figure 8.33. strcmp and strncmp.

(This item is displayed on pages 450 - 451 in the print version)

1 // Fig. 8.33: fig08_33.cpp 2 // Using strcmp and strncmp. 3 #include 4 using std::cout; 5 using std::endl; 6 7 #include 8 using std::setw; 9 10 #include // prototypes for strcmp and strncmp 11 using std::strcmp; 12 using std::strncmp; 13 14 int main() 15 { 16 char *s1 = "Happy New Year"; 17 char *s2 = "Happy New Year"; 18 char *s3 = "Happy Holidays"; 19 20 cout << "s1 = " << s1 << " s2 = " << s2 << " s3 = " << s3 21 << " strcmp(s1, s2) = " << setw( 2 ) << strcmp( s1, s2 ) 22 << " strcmp(s1, s3) = " << setw( 2 ) << strcmp( s1, s3 ) 23 << " strcmp(s3, s1) = " << setw( 2 ) << strcmp( s3, s1 ); 24 25 cout << " strncmp(s1, s3, 6) = " << setw( 2 ) 26 << strncmp( s1, s3, 6 ) << " strncmp(s1, s3, 7) = " << setw( 2 ) 27 << strncmp( s1, s3, 7 ) << " strncmp(s3, s1, 7) = " << setw( 2 ) 28 << strncmp( s3, s1, 7 ) << endl; 29 return 0; // indicates successful termination 30 } // end main  

s1 = Happy New Year s2 = Happy New Year s3 = Happy Holidays strcmp(s1, s2) = 0 strcmp(s1, s3) = 1 strcmp(s3, s1) = -1 strncmp(s1, s3, 6) = 0 strncmp(s1, s3, 7) = 1 strncmp(s3, s1, 7) = -1  

Common Programming Error 8.21

Assuming that strcmp and strncmp return one (a true value) when their arguments are equal is a logic error. Both functions return zero (C++'s false value) for equality. Therefore, when testing two strings for equality, the result of the strcmp or strncmp function should be compared with zero to determine whether the strings are equal.

To understand just what it means for one string to be "greater than" or "less than" another string, consider the process of alphabetizing a series of last names. The reader would, no doubt, place "Jones" before "Smith," because the first letter of "Jones" comes before the first letter of "Smith" in the alphabet. But the alphabet is more than just a list of 26 lettersit is an ordered list of characters. Each letter occurs in a specific position within the list. "Z" is more than just a letter of the alphabet; "Z" is specifically the 26th letter of the alphabet.


How does the computer know that one letter comes before another? All characters are represented inside the computer as numeric codes; when the computer compares two strings, it actually compares the numeric codes of the characters in the strings.

In an effort to standardize character representations, most computer manufacturers have designed their machines to utilize one of two popular coding schemesASCII or EBCDIC. Recall that ASCII stands for "American Standard Code for Information Interchange." EBCDIC stands for "Extended Binary Coded Decimal Interchange Code." There are other coding schemes, but these two are the most popular.

ASCII and EBCDIC are called character codes, or character sets. Most readers of this book will be using desktop or notebook computers that use the ASCII character set. IBM mainframe computers use the EBCDIC character set. As Internet and World Wide Web usage becomes pervasive, the newer Unicode character set is growing rapidly in popularity. For more information on Unicode, visit www.unicode.org. String and character manipulations actually involve the manipulation of the appropriate numeric codes and not the characters themselves. This explains the interchangeability of characters and small integers in C++. Since it is meaningful to say that one numeric code is greater than, less than or equal to another numeric code, it becomes possible to relate various characters or strings to one another by referring to the character codes. Appendix B contains the ASCII character codes.

Portability Tip 8.5

The internal numeric codes used to represent characters may be different on different computers, because these computers may use different character sets.

Portability Tip 8.6

Do not explicitly test for ASCII codes, as in if ( rating == 65 ); rather, use the corresponding character constant, as in if ( rating == 'A' ).

[Note: With some compilers, functions strcmp and strncmp always return -1, 0 or 1, as in the sample output of Fig. 8.33. With other compilers, these functions return 0 or the difference between the numeric codes of the first characters that differ in the strings being compared. For example, when s1 and s3 are compared, the first characters that differ between them are the first character of the second word in each stringN (numeric code 78) in s1 and H (numeric code 72) in s3, respectively. In this case, the return value will be 6 (or -6 if s3 is compared to s1).]


Tokenizing a String with strtok

Function strtok breaks a string into a series of tokens. A token is a sequence of characters separated by delimiting characters (usually spaces or punctuation marks). For example, in a line of text, each word can be considered a token, and the spaces separating the words can be considered delimiters.

Multiple calls to strtok are required to break a string into tokens (assuming that the string contains more than one token). The first call to strtok contains two arguments, a string to be tokenized and a string containing characters that separate the tokens (i.e., delimiters). Line 19 in Fig. 8.34 assigns to tokenPtr a pointer to the first token in sentence. The second argument, " ", indicates that tokens in sentence are separated by spaces. Function strtok searches for the first character in sentence that is not a delimiting character (space). This begins the first token. The function then finds the next delimiting character in the string and replaces it with a null ('') character. This terminates the current token. Function strtok saves (in a static variable) a pointer to the next character following the token in sentence and returns a pointer to the current token.


Figure 8.34. strtok.

(This item is displayed on pages 452 - 453 in the print version)

1 // Fig. 8.34: fig08_34.cpp 2 // Using strtok. 3 #include 4 using std::cout; 5 using std::endl; 6 7 #include // prototype for strtok 8 using std::strtok; 9 10 int main() 11 { 12 char sentence[] = "This is a sentence with 7 tokens"; 13 char *tokenPtr; 14 15 cout << "The string to be tokenized is: " << sentence 16 << " The tokens are: "; 17 18 // begin tokenization of sentence 19 tokenPtr = strtok( sentence, " " ); 20 21 // continue tokenizing sentence until tokenPtr becomes NULL 22 while ( tokenPtr != NULL ) 23 { 24 cout << tokenPtr << ' '; 25 tokenPtr = strtok( NULL, " " ); // get next token 26 } // end while 27 28 cout << " After strtok, sentence = " << sentence << endl; 29 return 0; // indicates successful termination 30 } // end main  

The string to be tokenized is: This is a sentence with 7 tokens The tokens are: This is a sentence with 7 tokens After strtok, sentence = This  

Subsequent calls to strtok to continue tokenizing sentence contain NULL as the first argument (line 25). The NULL argument indicates that the call to strtok should continue tokenizing from the location in sentence saved by the last call to strtok. Note that strtok maintains this saved information in a manner that is not visible to the programmer. If no tokens remain when strtok is called, strtok returns NULL. The program of Fig. 8.34 uses strtok to tokenize the string "This is a sentence with 7 tokens". The program prints each token on a separate line. Line 28 outputs sentence after tokenization. Note that strtok modifies the input string; therefore, a copy of the string should be made if the program requires the original after the calls to strtok. When sentence is output after tokenization, note that only the word "This" prints, because strtok replaced each blank in sentence with a null character ('') during the tokenization process.

Common Programming Error 8.22

Not realizing that strtok modifies the string being tokenized and then attempting to use that string as if it were the original unmodified string is a logic error.

 

Determining String Lengths

Function strlen takes a string as an argument and returns the number of characters in the stringthe terminating null character is not included in the length. The length is also the index of the null character. The program of Fig. 8.35 demonstrates function strlen.

Figure 8.35. strlen returns the length of a char * string.

(This item is displayed on pages 453 - 454 in the print version)

1 // Fig. 8.35: fig08_35.cpp 2 // Using strlen. 3 #include 4 using std::cout; 5 using std::endl; 6 7 #include // prototype for strlen 8 using std::strlen; 9 10 int main() 11 { 12 char *string1 = "abcdefghijklmnopqrstuvwxyz"; 13 char *string2 = "four"; 14 char *string3 = "Boston"; 15 16 cout << "The length of "" << string1 << "" is " << strlen( string1 ) 17 << " The length of "" << string2 << "" is " << strlen( string2 ) 18 << " The length of "" << string3 << "" is " << strlen( string3 ) 19 << endl; 20 return 0; // indicates successful termination 21 } // end main  

The length of "abcdefghijklmnopqrstuvwxyz" is 26 The length of "four" is 4 The length of "Boston" is 6  

Категории