Problems in Strings
PROBLEM MAXIMIZE A COMBINATION UNDER CONSTRAINTS
Given n arbitrary strings S1, S2,…,Sn, maximize the function
f(Si, Si+1, …, Sj) = length(Si) + length(Si+1)+ … + length(Sj)
under the given constraints c(Sp,…,Sq), which means strings Sp,…,Sq cannot be taken together.
Program
#include #define NANSWERS 3 // max no of strings in the answer. #define MAXCONS 5 // max length of any constraint. typedef enum {FALSE, TRUE} bool; bool isAbsent(int strnum, int *answer, int nans) { /* * returns TRUE if answer[nans] does NOT contain strnum. */ int i; for(i=0; i−1. */ int i, j; for(i=0; i−1; ++j) if(isAbsent(constraints[i][j], answer, nans)) break; if(constraints[i][j] == −1) return FALSE; } return TRUE; } void findMaxComb(int *lengths, int nstr, int constraints[][MAXCONS+1], int ncons, int *answer, int nans, int *maxsum, int startstr, int startans, int currsum ) { /* * find the max sum of lengths of nans strings out of nstr strings of * lengths lengths[] satisfying ncons constraints constraints[]. * save the max sum in *maxsum and the string indices in answer[]. */ int i; if(startans < nans) { for(i=startstr; i *maxsum) { if(satisfies(answer, nans, constraints, ncons)) *maxsum = currsum; } } int main() { int lengths[] = {9, 8, 6, 5, 4, 3}; // lengths[i] = length(string[i]). int answer[NANSWERS]; // indices of strings. int constraints[][MAXCONS+1] = { // index of string starts with 1 so that {1,3,−1}, // end of constraint can be signified {2,−1}, // by 0. {1,5,−1}, // we decrement every index so that {3,4,−1}, // each constraint ends with −1. {0,4,5,−1}, {0,4,−1}, {0,3,5,−1} }; int ncons = sizeof(constraints)/sizeof(int)/(MAXCONS+1); // no of constraints. int nstr = sizeof(lengths)/sizeof(int); // no of strings. int i, j; int maxsum = 0; for(i=1; i<=NANSWERS; ++i) { findMaxComb(lengths, nstr, constraints, ncons, answer, i, &maxsum, 0, 0, 0); printf("After %d strings: maxsum=%d. ", i, maxsum); //maxsum = 0; // this will keep all length combinations separate. } return 0; }
Explanation
- We represent the given strings' lengths in an integer array. Each constraint consists of an array of indices of strings that cannot appear together in the final answer. Since each constraint can contain different number of indices, we end each constraint by −1. If the final answer we require consists of at most NANSWERS strings, we find the combination of i strings, the sum of whose lengths will be maximum, where 1<=i<=NANSWERS. The function findMaxComb() finds such a combination for given value of i, if it exists. The variable maxsum signifies the current maximum sum of the combination of strings found so far. In order to find such a combination for every length, one needs to set maxsum=0 in the loop in main(). Also, if one wants to find a combination consisting of only NANSWERS strings, then the loop in main() can be changed as for(i=NANSWERS; i<=NANSWERS; ++i). In our implementation, the loop traverses from 1 to NANSWERS and maxsum is not set to 0 inside the loop. Thus we find a combination with any number of strings, up to NANSWERS.
- The function findMaxSum() recursively finds a combination of maximum length and successively fills the answer[] array, which contains the final answer. The variable currsum contains the sum of the current strings selected in answer[]. The number of entries in answer[] to be filled is sent as a parameter from main() in the variable nans. If the nans fields in the array answer[] are filled, we check whether the current sum is greater than the current maximum sum found. If it is, then we check whether the new combination satisfies all constraints. If it satisfies all constraints, we have found a new combination that becomes the current maximum. So maxsum is updated.
- Whether a string combination satisfies all constraints is checked in the function satisfies(). If there exists a constraint, all of whose string indices are present in the answer[] array of string indices, then that means the string combination does not satisfy that constraint, because each constraint c(Sp,…,Sq) says that the strings Sp…Sq cannot be combined. Thus, if we find that the string combination in answer[] satisfies all the constraints, then the function satisifes() returns TRUE, otherwise, it returns FALSE.
- Example: Let the strings have lengths 9, 8, 6, 5, 4, and 3. The string indices are from 0 to 5. Let the constraints be { {1,3}, {2}, {1,5}, {3,4}, {0,4,5}, {0,4}, and {0,3,5} }, which means strings 1 and 3 cannot be included together, string 2 cannot be taken, strings 1 and 5 cannot appear together, strings 0, 4, and 5 cannot appear together, strings 0 and 4 cannot appear together, and strings 0, 3, and 5 cannot appear together. Note that the constraint {0,4,5} is included in the constraint {0,4}. Let NANSWERS = 3. The different steps of the algorithm are presented in the following table.
I |
string combination |
currsum |
maxsum |
currsum > maxsum |
satisfies-all-constraints? |
maxsum |
---|---|---|---|---|---|---|
1 |
0 |
9 |
0 |
yes |
yes |
9 |
1 to 5 |
<9 |
9 |
no |
— |
9 |
|
2 |
0, 1 |
17 |
9 |
yes |
yes |
17 |
other combinations |
<17 |
17 |
no |
— |
17 |
|
3 |
0, 1, 2 |
23 |
17 |
yes |
no |
17 |
0, 1, 3 |
22 |
17 |
yes |
no |
17 |
|
0, 1, 4 |
21 |
17 |
yes |
no |
17 |
|
0, 1, 5 |
20 |
17 |
yes |
no |
17 |
|
0, 2, 3 |
20 |
17 |
yes |
no |
17 |
|
0, 2, 4 |
19 |
17 |
yes |
no |
17 |
|
0, 2, 5 |
18 |
17 |
yes |
no |
17 |
|
0, 3, 4 |
18 |
17 |
yes |
no |
17 |
|
0, 3, 5 |
17 |
17 |
no |
— |
17 |
|
0, 4, 5 |
16 |
17 |
no |
— |
17 |
|
1, 2, 3 |
19 |
17 |
yes |
no |
17 |
|
1, 2, 4 |
18 |
17 |
yes |
no |
17 |
|
1, 2, 5 |
17 |
17 |
no |
— |
17 |
|
2, 3, 4 |
15 |
17 |
no |
— |
17 |
|
other combinations |
<17 |
17 |
no |
— |
17 |
Points to Remember
- The complexity of the algorithm is exponential over the number of strings.
- Some speed-up can be achieved by sorting the lengths array first, which will result in avoiding some combinations. In that case A* algorithm can be used to find the answer.
- If each constraint consists of a fixed number of strings, then this problem can be solved in polynomial time by reducing this problem to the longest path problem.
PROBLEM MAXIMIZE A COMBINATION OF STRINGS THE SECOND METHOD
Given n arbitrary strings S1, S2, …, Sn, maximize the function
f(Si, Si+1, …, Sj) = length(Si) + length(Si+1)+ … + length(Sj)
under the given constraints c(Si, Sj) which means Si and Sj cannot be taken together.
Program
#include #define MININT -1000 #define MAXVERTICES 10 #define MAXPATHVERT 3 void printCosts( int a[][MAXVERTICES], int nvert, int pathvert[][MAXVERTICES] ) { /* * prints min cost matrix a. */ int i, j; for( i=0; i a[i][j]; * and h1+h2−1 < MAXPATHVERT; * return the sum. */ int p, q; int maxsum = 0; *h1 = *h2 = −1; for(p=2; p−1 <= MAXPATHVERT && b[p][i][k]>0 && b[q][k][j]>0 && b[p][i][k]+b[q][k][j] > maxsum) maxsum=b[p][i][k]+b[q][k][j], *h1=p, *h2=q; } if(maxsum > a[i][j]) return maxsum; *h1 = *h2 = −1; return −1; } void allCosts(int cost[][MAXVERTICES], int a[][MAXVERTICES], int nvert){ int i, j, k; int pathvert[MAXVERTICES][MAXVERTICES] ; int b[MAXPATHVERT+1][MAXVERTICES][MAXVERTICES] ; int sum, h1, h2; int l; for( i=0; i−1 && h1 != −1 && a[i][j]>=0) { //printf("a[i][j]=%d, h1=%d, h2=%d. ", sum, h1, h2); a[i][j] = sum, b[h1+h2−1][i][j] = sum, pathvert[i][j]=h1+h2−1; } } printCosts(a, nvert, pathvert); } } int main() { int cost[MAXVERTICES][MAXVERTICES] = { {0,50,10,MININT,45,MININT}, {MININT,0,15,MININT,10,MININT}, {20,MININT,0,15,MININT,MININT}, {MININT,20,MININT,0,35,MININT}, {MININT,MININT,MININT,30,0,MININT}, {MININT,MININT,MININT,3,MININT,0} }; /* {20,30,40,50}, {MININT,40,50,60}, {MININT,MININT,60,70}, {MININT,MININT,MININT,80} , {2,3,4,5}, {3,4,5,6}, {4,5,6,MININT}, {5,6,MININT,8} , {0,1,2,MININT}, {3,0,MININT,4}, {MININT,MININT,0,6}, {MININT,5,MININT,0} }; */ int a[MAXVERTICES][MAXVERTICES] ; int nvert = 6; // no of vertices. /*print( cost, nvert ); */ allCosts( cost, a, nvert ); //printCosts( a, nvert ); return 0; }
Explanation
- Consider a graph represented by a cost adjacency matrix cost[n][n], where n is the number of vertices in the graph cost[i][i] == 0. cost[i][j] represents the length of the edge from vertex i to vertex j.
- We used this cost adjacency matrix to solve the problem of finding the shortest path between any pair of vertices in the graph in O(n^3). We set cost[i][j] = infinity whenever there is no edge from vertex i to vertex j.
We define a similar problem of finding the longest path between any pair of vertices in the graph, called the longest path problem. If we allow inclusion of a vertex more than once, there is a possibility of getting into an infinite loop and the procedure may not terminate! So we put a limit on the number of vertices that can be included in the longest path.
- We reduce the given problem to the longest path problem. Each string Si is mapped to a vertex Vi. In the cost adjacency matrix, an entry cost[i][j] contains length(Si)+length(Sj). Thus the matrix is symmetric. Also, cost[i][i] is no longer zero. It contains the value length(Si)+length(Si). After this, we put in the constraints given by c(Si, Sj) and we mark such entries cost[i][j] as -infinity (MININT). Thus we get the cost adjacency matrix to represent the given constraints.
- The function allCosts() finds the cost of the longest path between any pair of vertices. It contains a matrix a[][],which contains the current maximum path between any pair of vertices at any given time. It is initialized with cost[][]. In the loop, we find longest paths of lengths 2, 3, … up to the limit defined by MAXPATHVERT. It starts with length == 2 because cost[][] contains the path containing 2 vertices, that is, an edge. Thus the original three loops of the shortest path problem are enclosed in another loop that goes from 2 to MAXPATHVERT. We also maintain a 3-dimensional array b[][][]. An entry b[h][i][j] contains the length of the longest path from vertex i to vertex j of length h. We also maintain a matrix pathvert[][] in which pathvert[i][j] contains the length of the longest path from i to j. Inside the innermost loop, the function getMaxSum() is called. In this function, for a[i][j], all combinations of paths from vertex i to vertex k and from vertex k to vertex j are checked and the one with the maximum cost is returned. a[i][j], then contains this new value if it is greater than the earlier one. The program finally calculates the cost of the longest path between any pair of vertices in the graph.
- Let m=MAXPATHVERT. Then the complexity of getMaxSum() is O(m^2). The complexity of allCosts() is then O(m^3n^3) where n is the number of strings in the given problem. The complexity of getMaxSum() may be reduced to O(m logm) using a procedure similar to merge sort.
Points to Remember
- Reducing one problem to another known problem can help in reusing the code from the earlier problem, as well as in the complexity analysis.
- The complexity of the problem stated is O(m^3n^3), where n is the number of strings and m is the number of maximum strings allowed.
- If we allow the constraint function c() to contain an arbitrary number of strings, then the complexity becomes exponential O(n^n).
PROBLEM CLOSURE OF SETS
Write a program to find closure of a set of characters input as a string.
Program
#include #define MAXLEN 80 void init(char *answer, int slen) { /* * initialize first slen entries in answer[] to 0. */ int i; for(i=0; i−1; i>=0; -i) { // i represents the bit pattern. // 1 in the bit pattern means char should be displayed. fillBitwise(i, str, s, slen); init(answer, strlen(str)); printf("printComb(%s). ", str); printComb(str, strlen(str), answer); getchar(); } } int main() { char s[MAXLEN]; printf("Enter characters for closure: "); gets(s); while(*s) { //init(answer, strlen(s)); //printComb(s, strlen(s), answer); findClosure(s, strlen(s)); printf("Enter characters for closure(press enter to end): "); gets(s); } return 0; }
Explanation
- Closure of string ‘abc’ is a set of strings that are combinations of all 3- character strings plus combinations of all 2-character strings plus combinations of all 1-character strings plus combinations of all strings of length zero.
Thus, closure(abc) = comb(abc)
/* strings of length 3. */
+ comb(ab) + comb(bc) + comb(ac)
/* strings of length 2. */
+ comb(a) + comb(b) + comb(c)
/* strings of length 1. */
+ comb(‘’)
/* empty string. */
- Note that the number of times comb() gets called is 2^n where n is the length of the input string. We can visualize this using bits. For a 3-character string, invocation of comb() with different input parameters can be represented by 3-bit strings as follows:
abc
ó
111
ab
ó
110
ac
ó
101
a
ó
100
bc
ó
011
b
ó
010
c
ó
001
‘’
ó
000
Thus, we generate bit patterns as shown to represent different strings and then call comb() with these input parameters. Note further that these bit patterns are nothing but numbers 0 to 2^n−1.
- The function findClosure() contains a loop which goes from 2^n−1 to 0 generating 2^n bit patterns representing 2^n strings, as just shown. According to these bit patterns, the function fillBitwise() builds the input string. This input string is then given to function printComb(), which finds all combinations of the input string.
- The complexity of printComb() is O(n!) and it is called 2^n times in the loop. So the complexity of findClosure() is O(n!2^n).
Points to Remember
- We use bitwise operators to check whether a bit is on (1) or off (0). This speeds the processing.
- Note carefully the mapping between the bit pattern of an integer and the character string, and how this mapping helped us build the algorithm.
- Note the reuse of the procedure printComb().
- The complexity of findClosure() is O(n!2^n).
PROBLEM DISTANCE BETWEEN TWO STRINGS
Write a program to find the edit distance between two character strings.
Program
#include #define MAXLEN 80 int findMin(int d1, int d2, int d3) { /* * return min of d1, d2 and d3. */ if(d1 < d2 && d1 < d3) return d1; else if(d1 < d3) return d2; else if(d2 < d3) return d2; else return d3; } int findEditDistance(char *s1, char *s2) { /* * returns edit distance between s1 and s2. */ int d1, d2, d3; if(*s1 == 0) return strlen(s2); if(*s2 == 0) return strlen(s1); if(*s1 == *s2) d1 = findEditDistance(s1+1, s2+1); else d1 = 1 + findEditDistance(s1+1, s2+1); // update. d2 = 1+findEditDistance(s1, s2+1); // insert. d3 = 1+findEditDistance(s1+1, s2); // delete. return findMin(d1, d2, d3); } int main() { char s1[MAXLEN], s2[MAXLEN]; printf("Enter string 1: "); gets(s1); while(*s1) { printf("Enter string 2: "); gets(s2); printf("Edit distance(%s, %s) = %d. ", s1, s2, findEditDistance(s1, s2)); printf("Enter string 1(enter to end): "); gets(s1); } return 0; }
Explanation
- The edit distance between two strings is the minimum number of characters in one string to be updated, inserted, or deleted to get the second string.
Example: The edit distance between ‘abc’ and ‘abd’ is 1, as one character in ‘abc’ needs to be updated to get ‘abd’.
The edit distance between ‘abc’ and ‘bd’ is 2, as one character ‘a’ in ‘abc’ needs to be deleted and one character ‘c’ should be updated to ‘d’ to get ‘bd’. The edit distance between ‘abc’ and ‘abc’ is 0.
- We solve this problem elegantly using recursion. The function findEditDistance(s1, s2) checks whether any of its input strings, s1 and s2, are empty. If so, then it returns the length of the other string as the edit distance. If not, it checks whether the first characters of the two strings match. If they do, then a count d1 is obtained by calling findEditDistance() recursively with inputs s1+1 and s2+1. If the first two characters of s1 and s2 do not match, then it is assumed to be one updation and the count d1 is obtained by adding one (for updation) to the edit distance between strings s1+1 and s2+1. The function then calls itself recursively again to get a count d2 by adding one to the edit distance between strings s1 and s2+1, to account for the deletion of one character from s1 to get s2. A symmetrical thing is done for s2 to get the count d3. After finding d1, d2, and d3, the minimum of the three counts is the edit distance between s1 and s2. The function findMin() does this job.
- Since for each character in s1, the function calls itself recursively 3 times, the complexity can be calculated using the following recurrence relation:
T(n) = 3*T(n−1)
where n is the minimum of the two lengths of the strings. Solving this recurrence relation gives us the complexity O(3^n).
Points to Remember
- Edit distance between two strings is the minimum number of insertions, deletions, or updations required in one string to get the other string.
- Messages over a noisy channel can be compared with some approximation using the edit distance technique. This technique is also useful in voice and image recognition.
- The complexity of findEditDistance() is O(3^n) where n is the minimum of the lengths of the two strings. The factor of 3 comes into the picture because for approximately each character, the function calls itself 3 times. To see how enormously this exponential complexity grows, try inputs to this program in increasing order of lengths.
PROBLEM FINDING THE MAXIMUM MATCHING PATTERN IN THE STRING
Find the maximum matching pattern in an input string. Note that the matching pattern may be separated by some other patterns.
Program
#include #define MAXLEN 80 void findMaxPat(char *s, char *pat, int *maxpat) { int i; for(i=0; *s && *pat; ++s, ++i) if(*s == *pat) *maxpat++=i, pat++; if(!*pat) printf("whole pat found. "); else printf("whole pat NOT found. "); *maxpat = −1; // end of maxpat. } void printMaxPat(char *s, int *maxpat) { char *sptr = s; puts(s); for(; *sptr && *maxpat != −1; ++sptr) { if(sptr-s == *maxpat) { printf("^"); maxpat++; } else printf("%c", ' '); } printf(" "); } int main() { char s[MAXLEN]; char pat[MAXLEN]; int maxpat[MAXLEN]; printf("Enter main string: "); gets(s); while(*s) { printf("Enter pattern to be searched: "); gets(pat); findMaxPat(s, pat, maxpat); printMaxPat(s, maxpat); printf("Enter main string: "); gets(s); } return 0; }
Explanation
- This program finds the maximum match of a pattern in a string. The characters matching the pattern in the string may be separated by other characters which are of no interest to us. Thus the job of this program is like a noise disposal parser that parses a valid syntactic entity separated by noise. We restrict ourselves to one string and one pattern.
- Example: Let the input string be ‘hello world’ and the matching pattern be ‘lord’. Then the program searches for each character in the pattern in the input string and marks each matching character as follows:
h e l l o w o r l d ^ ^ ^ ^.
- main() iterates and asks for the input string and pattern until the input string is empty. It then calls findMaxPat() to find indices of characters in the input string that match characters in the input pattern. This array of indices is then passed to printMaxPat(), which marks the indexed characters.
- The complexity of findMaxPat() is O(n) where n is the length of the input string.
Points to Remember
- Noise disposal parsing is useful in parsing languages such as English. It is also useful in filtering of data (signal).
- The procedure findMaxPat() can be useful in approximate pattern- matching algorithms.
- The complexity of findMaxPat() is O(n) where n is the length of the input string.
PROBLEM IMPLEMENTATION OF THE SOUNDEX FUNCTION
Write a function to compare two strings using the soundex method.
Program
#include #define MAXLEN 80 #define NALPHA 26 char *soundexGroups[] = { "aeiouhyw", "kcgjqsxz", "td", "bpfv", "l", "mn", "r" }; int soundexCodes[NALPHA]; void soundexInit() { /* * build an inverted index from the global table soundexGroups[]. * the inverted index is stored in global soundexCodes[]. */ int i; char *sptr; for(i=sizeof(soundexGroups)/sizeof(char *)−1; i>=0; -i) for(sptr=soundexGroups[i]; *sptr; ++sptr) soundexCodes[*sptr-'a'] = i; } int compareCodes(int *soundex1, int *soundex2) { int *ptr1, *ptr2; for(ptr1=soundex1, ptr2=soundex2; *ptr1!=−1 && *ptr2!=−1 && *ptr1==*ptr2; ++ ptr1, ++ptr2) ; return *ptr1 == *ptr2; } void findSoundex(char *s, int *soundex, char lastchar) { /* * find the soundex code for s and save in soundex. * the stored value is the index in the array of soundex codes. * function is recursive. * start by changing multiple occurrences of chars in consecutive positions * by single occurrences. * end soundex by −1. * lastchar == −1 implies this is the first call to this function. */ if(!*s) *soundex = −1, lastchar = 0; else if(*s == lastchar) findSoundex(s+1, soundex, lastchar); else if(lastchar == −1) // *s is the first char. *soundex=soundexCodes[*s-'a'], findSoundex(s+1, soundex+1, *s); else if(soundexCodes[*s-'a'] == 0) // vowel group. findSoundex(s+1, soundex, *s); else *soundex=soundexCodes[*s-'a'], findSoundex(s+1, soundex+1, *s); } int compareSoundex(char *s1, char *s2) { /* * find soundex codes for s1 and s2. * return 1 if codes are equal else 0. */ int soundex1[MAXLEN], soundex2[MAXLEN]; findSoundex(s1, soundex1, −1); findSoundex(s2, soundex2, −1); return compareCodes(soundex1, soundex2); } int main() { char s1[MAXLEN]; char s2[MAXLEN]; soundexInit(); printf("Enter string 1: "); gets(s1); while(*s1) { printf("Enter string 2: "); gets(s2); printf("(%s == %s) = %d. ", s1, s2, compareSoundex(s1, s2)); printf("Enter string 1(enter to end): "); gets(s1); } return 0; }
Explanation
- Soundex is a technique in phonetics used to compare various phonetic elements. This technique is useful to compare voices. The soundex scheme can also help in correcting an incorrect phonetic element against its dictionary.
- The soundex scheme groups similar-sounding characters in one group. When applied to the English alphabet, the groups are as follows.
Group Number
Characters
0
aeiouhwy
1
bpfv
2
cgjkqsxz
3
dt
4
l
5
mn
6
r
Thus, the soundex code for the word ‘cross’ is ‘26022’. By soundex, the words ‘think’ and ‘thing’ will be the same as they both have the same soundex code, ‘30052’. Note that they both sound similar.
- The results become more interesting if we do the following:
- The leading character is retained.
- The consecutive duplicate characters are changed to a single character.
- The vowels' group is dropped.
Thus, the new soundex code for ‘cross’ is ‘262’ and the one for ‘alpha’ is ‘041’.
- The function soundexInit() builds an inverted index soundexCodes[] from the soundex groups in soundexGroups[]. The function compareSoundex(s1, s2) compares strings s1 and s2 using the soundex scheme. It uses functions findSoundex() to find the soundex code for a string and compareCodes() to compare the two soundex codes found. A soundex code of a string is simply an array of integers, so the function compareCodes() is straightforward. The function compareSoundex(s1, s2) returns TRUE if s1 and s2 are equal by soundex, otherwise it returns FALSE.
- The function findSoundex() finds the soundex code for its input string. The code is terminated by −1. The function is recursive and goes characterwise. The input variable lastchar is used to remove multiple consecutive occurrences of characters. If the soundex group is 0, it is not added to the code.
Points to Remember
- The soundex scheme is used to compare phonetically equal strings. This is useful in voice recognition.
- Building an inverted index such as soundCodes[] proves to be more efficient than using soundexGroups[] directly.
- There are additional rules depending on language, dialect, and accents.
Категории