Efficient Storage and Use of Unicode Property Data
Intended Audience: |
Software Engineer, Systems Analyst |
Session Level: |
Intermediate |
When single byte character sets ruled the earth, C programmers had a
very limitedset of character properties to deal with. ANSI C defined a
primitive set of 11 character typing functions including isalpha(),
iscntrl(), isdigit(), isprint(), islower() and isupper(), along with the
simple case mapping funciions toupper() and tolower(). Most runtime
libraries stored each property in a single bit, so an array of 256 entries
was enough to store all the data needed for the iswhatever() functions.
Two more tables of 256 bytes each for upper and lower casing and you had
all you needed for a
Now we have a Unicode Character Database that is somewhat more
complicated. There are 15 properties in the database for Unicode 3.0.
Some of these such as character name, are seldom used in programming logic,
but most are needed for proper support of Unicode in modern programs. Many
of these properties are not simple on/off bit values. For instance, the
bidirectional property can have 11 different values. The Character
Decompostion property is variable width. And what does isdigit() mean
anyway in the Unicode world?
This paper will describe a method of storing Unicode property data and
character mapping tables in a way that is efficient for both size and
speed. It will also discuss some of the decision that must be made before
creating such tables. A program for creating optional mapping tables will
also be made available.
|