From: William_J_G Overington (wjgo_10009@btinternet.com)
Date: Sat Feb 26 2011 - 04:59:17 CST
Philippe Verdy <verdy_p@wanadoo.fr> wrote:
> | 6 bits : 11.yyxxxx
> |
> Encodes U+00C0..U+00FF (by default) :
> |
> yyxxxxx = Unicode scalar value - BASE
> |
> BASE should necessarily be a multiple of 16 (policy of ISO/IEC 10646-1 for block allocations).
> |
> BASE must then be able to store up to 15 bits if arbitrary positions in the UCS are possible
> |
> BASE is then constrained to 0x80 .. 0x10FFF0 (by step of
> 16).
> |
> Same as ISO-8859-1 only if BASE=0xC0
> |
> (BASE may be different from 0xC0 if a switch code has been explicitly used in the stream)
When a byte starting 11 is used in isolation, why is it represented as 11.yyxxxx please?
Is it because there are four possible values of BASE, namely BASE[0], BASE[1], BASE[2] and BASE[3]?
If BASE has a non-negative value less than 0x80, could that value of BASE be used to signal accessing a decoding tree so that the most common codepoints in the text from beyond the range U+0000 to U+007F could be represented using a single byte starting with 11? The contents of the decoding tree could be dynamically altered using switching codes.
If the idea of four values for BASE, in BASE[0], BASE[1], BASE[2] and BASE[3] is used, then access to a decoding tree would be possible simultanwously with one-byte access to a contiguous block of other Unicode characters if so desired, though if BASE[0], BASE[1], BASE[2] and BASE[3] are used the range of possible values of BASE would need to be 17 bits.
For example, at some particular time in some particular application of the format, BASE[0] might have a value of 0x00 and BASE[1] might have a value of 0x100.
William Overington
26 February 2011
This archive was generated by hypermail 2.1.5 : Sat Feb 26 2011 - 05:02:58 CST