Mark,
> This is too strong a statement. Yes, UTF-FSS was designed to
> represent code
> points above FFFF in 4 bytes. But let's look at the path that the software
> would take over history. If you take the original UCS-2 to UTF-8 mechanism
> (back when UTF-8 was called UTF-FSS) and apply it to surrogates, the
> sequence D800 DC00 would map to the sequence ED A0 80 ED B0 80.
> The sequence
> D800 DC00 was changed in UTF-16 to represent U+10000. If one did
> not correct
> the UCS-2 software, and simply interpreted it according to UTF-16
> semantics,
> then one would end up with a (flawed) UTF-8 sequence representing U+10000.
> Nobody pulled this out of a hat. It is simply the natural result of not
> fixing your mapping to UTF-8 when starting to reinterpret 16-bit codes as
> UTF-16.
>
It seems to me that UTF-8 was designed to encode UCS-4 not UCS-2. I seem to
recall that using 6 bytes you can encode up to 32 bits. Until surrogates,
you could assume that the numeric values of the UCS-2 and UCS-4 were the
same therefore you could write UTF-8 encoders and decoders that only handled
a limited range of codes.
I agree with you, the problem is that the D800 to DFFF codes were never
defined as valid Unicode characters. Encoding these into ED xx xx codes has
never produced valid Unicode code points in UTF-8. Thefore any of these
codes in the database were never valid Unicode characters at any point in
the Unicode standard. As a consequence there is no backwards compatibility
issue.
UTF-8s is not a Unicode encoding. It is a UTF-8 encoding of the numeric
values of UTF-16 encoded Unicode. If you don't double decode (UTF-8s ->
UTF-16 -> Unicode) you are limited to the UCS-2 Unicode code points.
Because you must decode UTF-8s to UTF-16 before you use it, I don't
understand why they don't just use UTF-16 in the first place. There does
not seen that there is any real point to UTF-8s.
I could UTF-8 encode EUC code page data but what value does it add to the
data?
Carl
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT