G. Adam Stanislav wrote:
> I am a bit confused about the planes in ISO-10646. Where on the web can I
> find a description of these planes?
A plane is a chunk of 65536 characters starting at 0, so
Plane 0 is 0 to 0xFFFF, Plane 1 is 0x10000 to 0x1FFFF, and so on.
The planes are allocated as follows
(hex) purpose
0 Basic Multilingual Plane (BMP)
1 Archaic and esoteric writing systems
2 Rare and ad-hoc CJK characters
3-D Reserved
E IETF language tagging
F-10 Private use
11-7FFF Almost certainly never going to be used for anything
Neither Unicode 2.x nor Unicode 3.0 (in preparation) installs any
non-private characters in planes other than 0.
> Are there any algorithms for the implementation of these functions that I
> should be aware of before trying to reinvent the wheel?
Mark Leisher and I have implemented a common API which could easily
be front-ended with the C standard API. Mark's implementation uses a
binary file a la TZ, so it is easy to extend as the Unicode Standard grows
without impacting applications. Mine uses compact (about 6K)
compiled-in tables, built by a Perl script.
Both implementations use X-style licensing, and I believe you should
adopt either one or the other or both, with whatever mods you want.
Considerable implementation-strategy effort has gone into both.
Mark's implementation is at
ftp://crl.nmsu.edu/CLR/multiling/unicode/unidata-1.9.tar.gz
with a patch at .../ucdata-1.9.patch1
Mine is at http://www.ccil.org/~cowan/uctype-2.0.tar.gz
with a patch at .../uctype-2.0.1.patch.txt
We built a new API because we wanted to represent all the categories
of Unicode, which are a much richer set than the Posix ones.
See the file THEORY in my implementation.
> Alas, this
> seems an imperfect solution as there is no way of knowing what future
> extensions will be added to ISO-10646 (I have seen quite a number of
> proposals for such extensions on your web site and there, no doubt, will be
> more).
There is no getting away from this problem: Unicode, unlike typical
8-bit coded character sets, is inherently extensible. New characters
will go on being added for years. Mark's implementation assumes
the existence and accessibility of a file; mine is for space-and-speed-tight
situations where you either don't mind compiling or are willing to
live with "unknown character" situations for new characters.
> Is there a better way? Is there a system to this? What I mean is, is there
> some way of knowing that if for example a specific bit in character code is
> set, it is a digit? Or if another bit is set, it is an alphabetic letter?
No. Tables are inevitable, and the question is, how should they be
compacted? My implementation represents a Plane 0 character as a
bit-vector of size 32 (i.e. a long) specifying its Unicode properties.
Then each *distinct* bit-vector (there are less than 512 of them)
is stored in a table, so a 9-bit index into the table can
represent the bit-vector. I then generate a run-length-encoding
of successive indexes, using 7 bits for each length, and compute 512
useful offsets (one for every 128-character half-row) into the
run-length-encoding table so that it does not have to be searched very
far.
Feel free to contact me privately for further assistance.
-- John Cowan http://www.ccil.org/~cowan cowan@ccil.org You tollerday donsk? N. You tolkatiff scowegian? Nn. You spigotty anglease? Nnn. You phonio saxo? Nnnn. Clear all so! 'Tis a Jute.... (Finnegans Wake 16.5)
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:45 EDT