From: John Cowan (jcowan@reutershealth.com)
Date: Fri Nov 15 2002 - 11:38:48 EST
This is not a proposal to change standards in any respect. It's just a
thought-out (well, somewhat) approach for people who have to represent
character codes as opposed to characters, and have 32 bits to play with.
The intent is to represent all the codes of all the registered character
sets, present and future, as individual unsigned 31-bit integers.
All further numbers in this post, except 94, 96, and 2022, are base 16.
Unicode codes are mapped onto the integers 0-10FFFF in the obvious way.
Registered character sets of ISO 2022 are represented by codes above 2000000.
The detailed roadmap is as follows:
00000000-0010FFFF: Unicode
00110000-1FFFFFFF: reserved
20000000-2003FFFF: ISO 2022 94-char, 96-char, C0, and C1 character sets
20040000-2093FFFF: ISO 2022 94x94/96x96-char character sets
20940000-5693FFFF: ISO 2022 94x94x94/96x96x96-char character sets
56940000-7FFFFFFF: reserved
Definitions for ISO 2022 character sets:
Every character set has an ISO-specified value between 40 and 7E, called F.
Some character sets have an ISO-specified value between 21 and 2F, called I.
If I is not present, it is deemed for our purposes to 20.
Individual characters in one-byte character sets have a value between 20
and 7F, called H.
Individual characters in two-byte character sets have two values between 20
and 7F, called H and L.
Individual characters in three-byte character sets have three values between 20
and 7F, called H, M, and L.
Values:
The value of a character in Unicode is its code value.
The value of a character in a 94-bit character set
is 20000000 + (I - 20) * 4000 + (F - 40) * 100 + H.
The value of a character in a 96-bit character set
is 20000000 + (I - 20) * 4000 + (F - 40) * 100 + H + 80.
The value of a character in a 94x94-char or 96x96-char character set
is 20040000 + (I - 20) * 90000 (F - 40) * 2400 +
(H - 20) * 60 + (L - 20).
The value of a character in a 94x94x94-char or 96x96x96-char character set
is 20940000 + (I - 20) * 3600000 + (F - 40) * D8000 +
(H - 20) * 2400 + (M - 20) * 60 + L.
This scheme was inspired by a related scheme by Markus Kuhn.
-- John Cowan http://www.ccil.org/~cowan <jcowan@reutershealth.com> "Any legal document draws most of its meaning from context. A telegram that says 'SELL HUNDRED THOUSAND SHARES IBM SHORT' (only 190 bits in 5-bit Baudot code plus appropriate headers) is as good a legal document as any, even sans digital signature." --me
This archive was generated by hypermail 2.1.5 : Fri Nov 15 2002 - 12:33:11 EST