Re: Corrigendum #9 clarifies noncharacter usage in Unicode

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Thu, 21 Feb 2013 15:26:09 -0800

On Thu, Feb 21, 2013 at 2:12 PM, Richard Wordingham <
richard.wordingham_at_ntlworld.com> wrote:

> Microsoft chose WEOF=0xffff. I don't think it can easily be changed to
> a better value until an incompatible processor architecture is used.
> Changing it is likely to break existing executables and object
> libraries.
>

If this is true, it's certainly a poor choice, and might violate the C
standard. (I have not checked the actual standard for wgetc(), wint_t &
WEOF.)

16-bit wchar_t doesn't exactly support 21-bit Unicode.

Right -- that's why the standard library uses a separate type, wint_t,
which can be wider if necessary.

Nothing requires a library that processes 16-bit Unicode strings to have a
16-bit type for a single-character return value. Just like the C standard
getc() returns a *negative* EOF value, in an integer type that is wider
than a byte.

The UTC is now applying additional pressure for the making of the
> distinction between UTF-16 and UTF-16LE.

The UTC is doing no such thing. Nothing has changed with regard to the
UTF-16 encoding scheme and the BOM.

U+FFFE has always been a code point that will never have a real character
assigned to it, that's why it is *unlikely* to appear as the first
character in a text file and thus useful as a "reverse BOM". However, it
was never forbidden from occurring in the text.

Best practice for file encodings has always been to declare the encoding.

Second best for UTF-16 is to always include the BOM, even if the byte order
is big-endian. And since most computers are little-endian, they need to
include the BOM in UTF-16 file encodings anyway (if they use their native
endianness).

markus
Received on Thu Feb 21 2013 - 17:30:09 CST

This archive was generated by hypermail 2.2.0 : Thu Feb 21 2013 - 17:30:10 CST