From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Feb 22 2011 - 10:26:16 CST
2011/2/22 Doug Ewell <doug@ewellic.org>:
> Now if Cropley's algorithm is being presented as a replacement or
> alternative to UTF-8, then it does need to be evaluated on criteria like
> these, and Suzuki-san's observations become very relevant.
I had posted the same two observations in a prior message.
But I also explained that the BOM-like system was ill, and not
necessary. You can perfectly implement code switching without this
hack, and without breaking the UTF requirements.
Cropley has not seen that his scheme allowed more separate codes to do
that (it is safe to reuse the surrogates range for encoding such
special encoding function as a 3-byte sequence specifying that a code
page switch has occured on the previously encoded character, and
specifying where the codepage starts or how long it is and which base
page was altered, if multiple ones smaller than a range of 64
characters can be remapped) : this just requires a few data bits and
the 15 bits in the unused surrogates range is ample enough to specify
this in a single 3-bytes function code, without needing any "magic"
table, and to support all evolutions of the standard. And if more bits
are needed, there are still a lot of unused scalar values starting at
0x110000, and encodable with 4-bytes sequences.
(however, the insertion of code switching functions may expose to
problems like correctly sizing the target buffer for the worst case,
to avoid buffer overflows, something that should not occur if code
switching is used properly to effectively reduce the encoding size).
Yes there's currently a sync problem with 2-byte encoded characters
(if one byte gets deleted), but they occur in a Unicode range
(0x80..0x407F) where they extremely rarely occur in overlong sequences
(this range is used by scripts that also abondantly use spaces and
ASCII punctuations, in addition to controls and line-breaks), so the
need to resynchronize on newlines is already satisfied.
Note also that if the selected 1-byte encoded range (of 64 characters)
falls within 0x80..0x4080, then a part of this range is also encocable
as 2-bytes (but Cropley wanted to exclude this case by forcing the
shortest code). This means that the 2-byte encodable range may extend
to 0x80..0x40BF, if the selected page falls any where in this range,
so the 3-bytes encoded sequences could start at 0x41C0 instead of
0x4180 (not much an improvement).
An alternative could instead use this conditionally unused range of 64
codes (depending on the selected codepage) for some extra code
switching functions, or for no-op resync codes (in overlong sequences
of 2-byte encoded characters).
Another variant could also use the 2-byte encoded range to encode
larger scripts (of up to 4096 characters), using code switching as
well (in that case, there would still be 192 characters encoded as 1
byte, including the ASCII page and the selectable 64-character page).
It could be used for syllabaries or large alphabets (including Nko, or
basic CJK ideographs, or Hiragana+Katakana, or Hangul in decomposed
Jamos form, but also extended Latin, Cyrillic, Arabic), all other
characters still requiring 3-byte or 4-byte sequences on this case.
This archive was generated by hypermail 2.1.5 : Tue Feb 22 2011 - 10:28:14 CST