Re: UTF-8 ill-formed question

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 15 Dec 2012 21:20:24 +0100

Your "Magic encoders" do not really help. Sort of magic, yes, but probably
even more difficult to see how to use it (using quaternary numbers for
computing octets) than just understanding the algorithm described (using
binary numbers for computing octets).

The way I usually think about the conversion is by thinking with binary
numbers (conversion from hex to binary is trivial, just like shifting
binary digits or changing their grouping in order to set the octet values,
and reconvert them trivially in hex; the only non trivial conversion is
between binary (or hex) to decimal : you need a mental conversion table of
octet values (which is easy to remember only up to 4-bits, or from a few
specific octets with only 1-bit set to 1 (like 0x10=16, 0x20=32, 0x40=64)
and these vaues minus 1 (like 0xFF=255). After this step, either you have
mentally remembered the full range of octet values if you want to reduce
the number of operations to mentally compute and reduce errors.

But a computer is simple to program without using such conversion table, it
converts numbers between binary, hex or decimal for you, and in fact such
conversion is not even needed to convert between codepoints and
octets-encodings using any numeric base conversion, it works directly with
binary numbers and just has to care about how to group subsequences of bits
(like octets or a full code point) into code units for storage (e.g. bytes)
and how to pad these bits in code units.

But there's still a bug (or request for enhancement) for your Pocket
converters :

- For UTF-16 you correctly exclude the range U+D800..U+DFFF (surrogates)
from the sets of convertible codepoints.

- But you don't exclude this range in the case of your UTF-8 and UTF-32
"magic encoders" which could forget this case. Of course your encoder would
create distinct sequences for these code points, but they are not valid
UTF-8 or valid UTF-32 encodings.

- So one row in the UTF-8 magic encoder concerns the whole range
U+0800..U+FFFF. This row should be split in two disjoint parts
U+0800..U+D7FF and U+E000..U+FFFF.

- Same remark about your 1-row magic encoder for UTF-32 (two rows should be
used).

2012/12/12 Otto Stolz <Otto.Stolz_at_uni-konstanz.de>

> Hello,
>
> am 2012-12-11 20:16, schrieb James Lin:
>
> If i have a code point: U+4E8C or "二"
>> In UTF-8, it's "E4 BA 8C" while in UTF-16, it's "4E8C".
>> Where is this "BA" comes from?
>>
>
> Cf. <http://skew.org/cumped/>.
>
> Enclosed are the (almost original) version of “€œCima’s Magic
> UTF-8 Pocket encoder”€ (2004), and its two followers for
> more UTFs. Display or print with a fixed-pitch font,
> such as Lucida Console or Courier New. Enjoy!
>
> Cheers,
> Otto Stolz
>
>
>
Received on Sat Dec 15 2012 - 14:24:51 CST

This archive was generated by hypermail 2.2.0 : Sat Dec 15 2012 - 14:24:52 CST