UTF-16 Encoding Scheme and U+FFFE
richard.wordingham at ntlworld.com
Wed Jun 4 14:01:48 CDT 2014
On Wed, 4 Jun 2014 00:23:53 +0000
"Whistler, Ken" <ken.whistler at sap.com> wrote:
> You cannot even be "very confident" of not finding actual ill-formed
> UTF-16, like unpaired surrogates, in an external file, let alone
I though unpaired surrogates were normally mojibake, broken
characters, or sabotage attempts.
> Any one of those test strings could be
> trivially turned into a text file by piping out that one UTF-16
> string to a file.
At that point, you should be in detailed control of the Unicode encoding
scheme. Also, would not the system be using one of UTF16 with byte
order marks, UTF-16BE and UTF-16LE?
> And I could then write conformant test software
> that would read UTF-16 string input data from that file and run it
> through the UCA algorithm to construct sortkeys for it.
Given the number of control characters in that file, I wouldn't be
confident of getting the output back the same as it went out unless the
input were controlled at a binary level.
> As Peter said, the main thing that prevents running into these is
> that it isn't very *useful* to start off files (or strings) with
Actually, for sorting records using the CLDR collation algorithm, it
may be very useful to use U+FFFE as a field separator. If the most
significant field for sorting is sometimes empty (e.g. surname in a list
of contacts), then the field separator could very easily be the first
non-BOM character after sorting. I suppose one had better use
something like <COMMA, U+FFFE> as a field separator instead.
> (And, additionally, in the case of UTF-16 text data files, it
> would be confusing and possibly lead to misinterpretation of byte
> order, if you were somehow depending solely on initial BOMs -- which
> I wouldn't advise, anyway.)
Interesting. Goodbye UTF-16 encoding scheme and hello automatic
encoding detection. I'm not sure how automatic detection is supposed
to work with a file consisting of just a test string from the
> Basically, the rules of standards (e.g., you shouldn't try to
> publicly interchange noncharacters) are not like laws of
> physics. Just because the standard says you shouldn't do
> it doesn't mean it doesn't happen.
Just as theft happens.
More information about the Unicode