Character Sets

Appendix Character Sets

The first 128 Unicode charactersthat is, characters 0 through 127are identical to the ASCII character set. 32 is the ASCII space; therefore, 32 is the Unicode space. 33 is the ASCII exclamation point; therefore, 33 is the Unicode exclamation point, and so on. Table A-1 lists this character set.

Table A-1. The first 128 Unicode characters and the ASCII character set

Code

Character

Code

Character

Code

Character

Code

Character

0

NUL (null)

32

space

64

@

96

`

1

SOH (start of header)

33

!

65

A

97

a

2

STX (start of text)

34

"

66

B

98

b

3

ETX (end of text)

35

#

67

C

99

c

4

EOT (end of transmission)

36

$

68

D

100

d

5

ENQ (enquiry)

37

%

69

E

101

e

6

ACK (acknowledge)

38

&

70

F

102

f

7

BEL (bell)

39

`

71

G

103

g

8

BS (backspace)

40

(

72

H

104

h

9

TAB (tab)

41

)

73

I

105

i

10

LF (linefeed)

42

*

74

J

106

j

11

VTB (vertical tab)

43

+

75

K

107

k

12

FF (formfeed)

44

,

76

L

108

l

13

CR (carriage return)

45

-

77

M

109

m

14

SO (shift out)

46

.

78

N

110

n

15

SI (shift in)

47

/

79

O

111

o

16

DLE (data link escape)

48

0

80

P

112

p

17

DC1 (device control 1, XON)

49

1

81

Q

113

q

18

DC2 (device control 2)

50

2

82

R

114

r

19

DC3 (device control 3, XOFF)

51

3

83

S

115

s

20

DC4 (device control 4)

52

4

84

T

116

t

21

NAK (negative acknowledge)

53

5

85

U

117

u

22

SYN (synchronous idle)

54

6

86

V

118

v

23

ETB (end of transmission block)

55

7

87

W

119

w

24

CAN (cancel)

56

8

88

X

120

x

25

EM (end of medium)

57

9

89

Y

121

y

26

SUB (substitute)

58

:

90

Z

122

z

27

ESC (escape)

59

;

91

[

123

{

28

IS4 (file separator)

60

<

92

124

|

29

IS3 (group separator)

61

=

93

]

125

}

30

IS2 (record separator)

62

>

94

^

126

~

31

is1 (unit separator)

63

?

95

_

127

del (delete)

In the first column, characters 0 through 31 are referred to as control characters because they e traditionally entered by holding down the control key and a letter key (on at least some dumb terminals). For instance, Ctrl-H is often ASCII 8, backspace. Ctrl-S is often mapped to ASCII 19, DC3, or XOFF. Ctrl-Q is often mapped to ASCII 17, DC1, or XON. Generally, each control character is entered by pressing the Control key and the printable character whose ASCII value is the ASCII value of the character you want plus 64 (or 96, if you count from the capitals). Character 127, delete, is also a control character.

The common abbreviation for the character is given first, followed by its common meaning. Some of these codes are pretty much obsolete. For instance, Im not aware of any modern system that actually uses characters 28 through 31 as file, group, record, and unit separators. Those control codes that are still used often have different meanings on different platforms. For example, character 10, the linefeed, originally meant move the platen on the printer up one line, while character 13, the carriage return, meant return the print-head to the beginning of the line. On paper-based teletype terminals, this could be used to position the print-head anywhere on a page and perhaps overtype characters that had already been typed. This no longer makes sense in an era of glass terminals and GUIs, so linefeed has come to mean a generic end-of-line character.

The next 128 Unicode charactersthat is, 128 through 255have the same values as the equivalent characters in the Latin-1 character set defined in ISO standard 8859-1. Latin-1, a slight variation of which is used by Windows, adds the various accented characters, umlauts, cedillas, upside-down question marks, and other characters needed to write text in most Western European languages. shows these characters. The first 128 characters in Latin-1 are the ASCII characters shown in Table A-2.

Table A-2. Unicode characters between 128 and 255, also the second half of the ISO 8859-1 Latin-1 character set

Code

Character

Code

Character

Code

Character

Code

Character

128

PAD (padding character)

160

non-breaking space

192

À

224

à

129

HOP (high octet preset)

161

¡

193

Á

225

á

130

BPH (break permitted here)

162

¢

194

Â

226

â

131

NBH (no break here)

163

£

195

Ã

227

ã

132

IND (index)

164

¤

196

Ä

228

ä

133

NEL (next line)

165

¥

197

Å

229

å

134

SSA (start of selected area)

166

|

198

Æ

230

æ

135

ESA (end of selected area)

167

§

199

Ç

231

ç

136

HTS (character tabulation set)

168

¨

200

È

232

è

137

HTJ (character tabulation with justification)

169

©

201

É

233

é

138

VTS (line tabulation set)

170

ª

202

Ê

234

ê

139

PLD (partial line forward)

171

«

203

Ë

235

ë

140

PLU (partial line backward)

172

¬

204

Ì

236

ì

141

RI (reverse line feed)

173

soft (optional) hyphen

205

í

237

í

142

SS2 (single-shift two)

174

®

206

Î

238

î

143

SS3 (single-shift three)

175

¯

207

Ï

239

ï

144

DCS (device control string)

176

° (degree)

208

240

145

PU1 (private use one)

177

±

209

Ñ

241

ñ

146

PU2 (private use two)

178

2

210

Ò

242

ò

147

STS (set transmit state)

179

3

211

Ó

243

ó

148

CCH (cancel character)

180

´

212

Ô

244

ô

149

MW (message waiting)

181

m

213

Õ

245

õ

150

SPA (start of guarded area)

182

214

Ö

246

ö

151

EPA (end of guarded area)

183

·

215

x

247

÷

152

SOS (start of string)

184

, (cedilla)

216

Ø

248

153

SGI (single graphic character introducer)

185

1

217

Ù

249

ù

154

SCI (single character introducer)

186

º

218

Ú

250

ú

155

CSI (control sequence introducer)

187

»

219

û

251

Û

156

ST (string terminator)

188

1/4

220

Ü

252

ü

157

OSC (operating system command)

189

1/2

221

Ý

253

158

PM (privacy message)

190

3/4

222

254

159

APC (application program command)

191

¿

223

ß

255

ÿ

Characters 128 through 159 are nonprinting control characters, much like characters 0 through 31 of the ASCII set. Unicode does not specify any meanings for these 32 characters, but their common interpretations are listed in the table. On Windows, most of these positions are used for noncontrol characters not included in Latin-1. These alternate interpretations are given in Table A-3.

Table A-3. Windows characters between 128 and 159

Code

Character

Code

Character

Code

Character

Code

Character

128

136

^

144

undefined

152

~

129

undefined

137

145

`

153

130

,

138

146

154

131

f

139

<

147

"

155

>

132

,

140

Œ

148

"

156

œ

133

...

141

undefined

149

·

157

undefined

134

142

a

150

-

158

135

143

undefined

151

159

ÿ

Values beyond 255 encode characters from various other character sets. Where possible, character blocks describing a particular group of characters map onto established encodings for that set of characters by simple transposition. For instance, Unicode characters 884 through 1011 encode the Greek alphabet and associated characters like the Greek question mark (;). This is a direct transposition by 720 of characters 128 through 255 of the ISO 8859-7 character set, which is in turn based on the Greek national standard ELOT 928. For example, the small letter delta, d, Unicode character 948, is ISO 8859-7 character 228. A small epsilon, e, Unicode character 949, is ISO 8859-7 character 229. In general, the Unicode value for a Greek character equals the ISO 8859-7 value for the character plus 720. Other character sets are included in Unicode in a similar fashion whenever possible.

As much as Id like to include complete tables for all Unicode characters, if I did so, this book would be little more than that table. For complete lists of all the Unicode characters and associated glyphs, the canonical reference is The Unicode Standard Version 4.0 by the Unicode Consortium, ISBN 0-321-18578-1. Updates to that book can be found at http://www.unicode.org/. Online charts can be found at http://unicode.org/charts.

About the Author

Elliotte Rusty Harold is originally from New Orleans, to which he returns periodically in search of a decent bowl of gumbo. However, he currently resides in the Prospect Heights neighborhood of Brooklyn with his wife, Beth, and cats Charm (named after the quark) and Marjorie (named after his mother-in-law). Hes an adjunct professor of computer science at Polytechnic University, where he teaches Java, XML, and object oriented programming. His Cafe au Lait web site (http://www.cafeaulait.org) is one of the most popular independent Java sites on the Internet, and his spin-off site, Cafe con Leche (http://www.cafeconleche.org), has become one of the most popular XML sites. Hes currently working on the XOM library for XML, the Jaxen XPath engine, and the Amateur media player. His previous books include Java Network Programming (OReilly) and Processing XML with Java (Addison-Wesley).

Категории