Splitting a String

Problem

You want to split a delimited string into multiple strings. For example, you may want to split the string "Name|Address|Phone" into three separate strings, "Name", "Address", and "Phone", with the delimiter removed.

Solution

Use basic_string's find member function to advance from one occurrence of the delimiter to the next, and substr to copy each substring out of the original string. You can use any standard sequence to hold the results; Example 4-10 uses a vector.

Example 4-10. Split a delimited string

#include #include #include #include using namespace std; void split(const string& s, char c, vector& v) { string::size_type i = 0; string::size_type j = s.find(c); while (j != string::npos) { v.push_back(s.substr(i, j-i)); i = ++j; j = s.find(c, j); if (j == string::npos) v.push_back(s.substr(i, s.length( ))); } } int main( ) { vector v; string s = "Account Name|Address 1|Address 2|City"; split(s, '|', v); for (int i = 0; i < v.size( ); ++i) { cout << v[i] << ' '; } }

 

Discussion

Making the example above a function template that accepts any kind of character is trivial; just parameterize the character type and change references to string to basic_string:

template void split(const basic_string& s, T c, vector >& v) { basic_string::size_type i = 0; basic_string::size_type j = s.find(c); while (j != basic_string::npos) { v.push_back(s.substr(i, j-i)); i = ++j; j = s.find(c, j); if (j == basic_string::npos) v.push_back(s.substr(i, s.length( ))); } }

The logic is identical.

Notice, though, that I put an extra space between the last two right-angle brackets on the last line of the function header. You have to do this to tell the compiler that it's not reading a right-shift operator.

Example 4-10 splits a string using a simple algorithm. Starting at the beginning, it looks for the first occurrence of the delimiter c, then considers everything before it and after the beginning the next meaningful chunk of text. The example uses the find member function to locate the first occurrence of a character starting at a particular index in the original string, and substr to copy the characters in a range to a new string, which is pushed onto a vector. This is the same behavior as the split function in most scripting languages, and is actually a special case of tokenizing a stream of text, which is described in Recipe 4.7.

Splitting strings based on single character delimiters is a common requirement, and it probably won't surprise you that it's in the Boost String Algorithms library. It is easy to use; see Example 4-11 to see how to split a string with Boost's split function.

Example 4-11. Splitting a string with Boost

#include #include #include #include using namespace std; using namespace boost; int main( ) { string s = "one,two,three,four"; list results; split(results, s, is_any_of(",")); // Note this is boost::split for (list::const_iterator p = results.begin( ); p != results.end( ); ++p) { cout << *p << endl; } }

split is a function template that takes three arguments. Its declaration looks like this:

template Seq& split(Seq& s, Coll& c, Pred p, token_compress_mode_type e = token_compress_off);

The types Seq, Coll, and Pred, represent the types of the result sequence, the input collection, and the predicate that will be used to determine if something is a delimiter. The sequence argument is a sequence in the C++ standard's definition that contains something that can hold pieces of what is in the input collection. So, for example, in Example 4-11 I used a list, but you could use something else like a vector. The collection argument is the type of the input sequence. A collection is a nonstandard concept that is similar to a sequence, but with fewer requirements (see the Boost documentation at www.boost.org for specifics). The predicate argument is an unary function object or function pointer that returns a bool indicating whether its argument is a delimiter or not. It will be invoked against each element in the sequence in the form f(*it), where it is an iterator that refers to an element in the sequence.

is_any_of is a convenient function template that comes with the String Algorithms library that makes your life easier if you are using multiple delimiters. It constructs an unary function object that returns true if the argument you pass in is a member of the set. In other words:

bool b = is_any_of("abc")('a'); // b = true

This makes it easy to test for multiple delimiters without having to write the function object yourself.

Категории