Re: UTF-16 Encoding Scheme and U+FFFE from Richard Wordingham on 2014-06-04 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Wed, 4 Jun 2014 20:01:48 +0100

On Wed, 4 Jun 2014 00:23:53 +0000
"Whistler, Ken" <ken.whistler_at_sap.com> wrote:

> You cannot even be "very confident" of not finding actual ill-formed
> UTF-16, like unpaired surrogates, in an external file, let alone
> noncharacters.

I though unpaired surrogates were normally mojibake, broken
characters, or sabotage attempts.

> Any one of those test strings could be
> trivially turned into a text file by piping out that one UTF-16
> string to a file.

At that point, you should be in detailed control of the Unicode encoding
scheme. Also, would not the system be using one of UTF16 with byte
order marks, UTF-16BE and UTF-16LE?

> And I could then write conformant test software
> that would read UTF-16 string input data from that file and run it
> through the UCA algorithm to construct sortkeys for it.

Given the number of control characters in that file, I wouldn't be
confident of getting the output back the same as it went out unless the
input were controlled at a binary level.

> As Peter said, the main thing that prevents running into these is
> that it isn't very *useful* to start off files (or strings) with
> U+FFFE.

Actually, for sorting records using the CLDR collation algorithm, it
may be very useful to use U+FFFE as a field separator. If the most
significant field for sorting is sometimes empty (e.g. surname in a list
of contacts), then the field separator could very easily be the first
non-BOM character after sorting. I suppose one had better use
something like <COMMA, U+FFFE> as a field separator instead.

> (And, additionally, in the case of UTF-16 text data files, it
> would be confusing and possibly lead to misinterpretation of byte
> order, if you were somehow depending solely on initial BOMs -- which
> I wouldn't advise, anyway.)

Interesting. Goodbye UTF-16 encoding scheme and hello automatic
encoding detection. I'm not sure how automatic detection is supposed
to work with a file consisting of just a test string from the
collation test.

> Basically, the rules of standards (e.g., you shouldn't try to
> publicly interchange noncharacters) are not like laws of
> physics. Just because the standard says you shouldn't do
> it doesn't mean it doesn't happen.

Just as theft happens.

Richard.
_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Wed Jun 04 2014 - 14:02:59 CDT

This archive was generated by hypermail 2.2.0 : Wed Jun 04 2014 - 14:03:00 CDT