The Art of Assembly Language

5.3 Character Sets

Like strings, character sets are another composite data type built upon the character data type. A character set is a mathematical set of characters. Membership in a set is a binary relation. A character is either in the set or it is not in the set; you cannot have multiple copies of the same character in a character set. Furthermore, the concept of sequence (whether one character comes before another, as in a string) is foreign to a character set. If two characters are members of a set, their order in the set is irrelevant.

Table 5-3 lists some of the more common character set functions to give you an idea of the types of operations applications typically perform on character sets.

Table 5-3: Common Character Set Functions

Function/Operator

Description

Membership (IN)

Checks to see if a character is a member of a character set (returns true/false).

Intersection

Returns the intersection of two character sets (that is, the set of characters that are members of both sets).

Union

Returns the union of two character sets (that is, all the characters that are members of either set or both sets).

Difference

Returns the difference of two sets (that is, those characters in one set that are not in the other).

Extraction

Extracts a single character from a set.

Subset

Returns true if one character set is a subset of another.

Proper subset

Returns true if one character set is a proper subset of another.

Superset

Returns true if one character set is a superset of another.

Proper superset

Returns true if one character set is a proper superset of another.

Equality

Returns true if one character set is equal to another.

Inequality

Returns true if one character set is not equal to another.

5.3.1 Powerset Representation of Character Sets

There are many different ways to represent character sets. Several languages implement character sets using an array of Boolean values (one Boolean value for each possible character code). Each Boolean value determines whether its corresponding character is or is not a member of the character set: true indicates that the specified character is a member of the set; false indicates that the corresponding character is not a member of the set. To conserve memory, most character set implementations allocate only a single bit for each character in the set; therefore, such character sets consume 16 bytes (128 bits) of memory when supporting 128 characters, or 32 bytes (256 bits) when supporting up to 256 possible characters. This representation of a character set is known as a powerset .

The HLA language uses an array of 16 bytes to represent the 128 possible ASCII characters. This array of 128 bits is organized in memory, as shown in Figure 5-5.

Figure 5-5: HLA character set representation

Bit zero of byte zero corresponds to ASCII code zero (the NUL character). If this bit is one, then the character set contains the NUL character; if this bit is zero, then the character set does not contain the NUL character. Likewise, bit one of byte eight corresponds to ASCII code 65, an uppercase A . Bit 65 will contain a one if A is a current member of the character set, it will contain zero if A is not a member of the set.

Pascal (for example, Delphi/Kylix) uses a similar scheme to represent character sets. Delphi allows up to 256 characters in a character set, so Delphi/Kylix character sets consume 256 bits (or 32 bytes) of memory.

While there are other possible ways to implement character sets, this bit vector (array) implementation has the advantage that it is very easy to implement set operations like union, intersection, difference comparison, and membership tests.

5.3.2 List Representation of Character Sets

Sometimes a powerset bitmap just isn't the right representation for a character set. For example, if your sets are always very small (no more than three or four members), using 16 or 32 bytes to represent such a set can be overkill. For very small sets, using a character string to represent a list of characters is probably the best way to go. [10] If you rarely have more than a few characters in a set, scanning through a string to locate a particular character is probably efficient enough for most applications.

On the other hand, if your character set has a large number of possible characters, then the powerset representation for the character set could become quite large (for example, Unicode character sets would require 8,192 bytes of memory to implement them as powersets). For these reasons (and more), the powerset representation isn't always the best. A list or character string representation could be more appropriate in such situations.

[10] Though it is up to you to ensure that the character string maintains set semantics. That is, you never allow duplicate characters in such a string.

Категории