From: myrkraverk@users.sourceforge.net
Date: Thu Sep 25 2003 - 20:53:07 EDT
Hi,
In a plain text environment, there is often a need to encode more than
just the plain character.  A console, or terminal emulator, is such an
environment.  Therefore I propose the following as a technical report
for internal encoding of unicode characters; with one goal in mind:
character equalence is binary equalence.
Since I'm using 64 bits, I call it Excessive Memory Usage Encoding, or
EMUE.
I thought of dividing the 64 bit code space into 32 variably wide
plains, one for control characters, one for latin characters, one for
han characters, and so on; using 5 bits and the next 3 fixed to zero
(for future expansion and alignment to an octet).
I call plain 0 control characters and won't discuss it further.
Plain 1, I had intended for latin characters with the following
encoding method in mind:
bits 63..59  58..56 55..40 39..32 31..24 23..16 15..8  7..0
    +-------+------+------+------+------+------+------+------+
    | plain | zero | attr | res  | uacc | lacc | res  | char |
    +-------+------+------+------+------+------+------+------+
* Plain     Plain                    (5 bits)
* Zero      Zero bits                (3 bits)
* Attr      Attributes               (16 bits)
* Res       Reserved                 (8 bits)
* Uacc      Upper Accent             (8 bits)
* Lacc      Lower Accent             (8 bits)
* Res	    Reserved                 (8 bits)
* Char      Character                (8 bits)
All of these fields are actually implementation defined, with just one
rule for char: don't include characters that can be made with
combinations, that's what the accent fields are for.  This allows for
255 upper and lower accents which should be enough -- for now.
For Han characters I thought of the following encoding method (with no
particular plain in mind):
bits 63..59  58..56 55..40 39..32  31         ..            0
    +-------+------+------+-------+--------------------------+
    | plain | zero | attr | style |          char            |
    +-------+------+------+-------+--------------------------+
* Plain     Plain                    (5 bits)
* Zero      Zero bits                (3 bits)
* Attr      Attributes               (16 bits)
* Style	    Stylistic Variation      (8 bits)
* Char	    Character                (32 bits)
Again, all fields are implementation defined.  Telling something like
a terminal emulator what stylistic variation to use is outside the
scope of this email, but for attributes, there are standardized escape
sequences; but I suspect language tags can be used.
I was also thinking of a plain for punctuation and symbolic characters.
I will be pleased if anyone can come up with better encoding methods
than I did, and I call upon other people to come up with encodings for
scripts I know nothing about, such as arabic and others.  Then let's
wrap it up in a technical report and be done with it ;)
Any comments?
Johann
-- Sometimes I do not think at all! Does that mean I don't exist in the mean time?
This archive was generated by hypermail 2.1.5 : Thu Sep 25 2003 - 21:39:26 EDT