Re: Corrigendum #9 clarifies noncharacter usage in Unicode from Richard Wordingham on 2013-02-21 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 21 Feb 2013 22:12:51 +0000

On Thu, 21 Feb 2013 11:52:07 -0800
Markus Scherer <markus.icu_at_gmail.com> wrote:

> On Thu, Feb 21, 2013 at 11:06 AM, Richard Wordingham <
> richard.wordingham_at_ntlworld.com> wrote:

> "fgetwc returns, as a
> wint_t<http://msdn.microsoft.com/en-us/library/323b6b3k.aspx>,
> the wide character that corresponds to the character read or returns
> WEOF to indicate an error or end of file. For both functions, use
> feof orferror to distinguish between an error and an end-of-file
> condition." http://msdn.microsoft.com/en-us/library/c7sskzc1.aspx

> In other words, the wint_t value WEOF is supposed to be out-of-range
> for normal characters, and if in doubt, the API docs tell you to call
> feof().

Actually, you have to call both! If both return zero, then you have
U+FFFF. Just calling feof() would lead one, by UTC ruling, to
misdiagnose an error.

> On my Ubuntu laptop, wchar.h defines WEOF=0xffffffffu which is
> thoroughly out of range for Unicode.

Microsoft chose WEOF=0xffff. I don't think it can easily be changed to
a better value until an incompatible processor architecture is used.
Changing it is likely to break existing executables and object
libraries.

> The comment for *wint_t* says
> /* Integral type unchanged by default argument promotions that can
> hold any value corresponding to members of the extended character
> set, as well as *at least one value that does not correspond to
> any*
> * member of the extended character set*. */
>
> I don't have a Windows system handy to check for the value there. I
> assume that it follows the standard:

16-bit wchar_t doesn't exactly support 21-bit Unicode. Hitherto, one
could always have tried claiming that reading U+FFFF when expecting
ordinary characters was tantamount to interchanging code containing it,
or claimed that this part of internal usage was one of the restrictions
of the system. The 'correction' destroys that defence. One can still
note that U+FFFF is not an assigned character and never will be!

>> U+FFFE at the start of a UTF-16 file should also cause some
>> headaches!
>> Doesn't Microsoft Windows still interpret this as a byte-order mark
>> without asking whether there may be a byte-order mark?

> In the UTF-16 *encoding scheme*, such as in an otherwise unmarked
> file, the leading bytes FF FE and FE FF have special meaning. Again,
> this has nothing to do with the first character in a string in code.
> None of this has changed.

Those believing the restrictive interpretation would not expect UTF-16LE
or UTF-16BE files to start with U+FFFE, so if the first character
appeared to be U+FFFE, they could get away with assuming it was actually
a UTF-16 file and deducing that it was not in the default endianity
assigned by the higher protocol.

The UTC is now applying additional pressure for the making of the
distinction between UTF-16 and UTF-16LE. To be precise, if the text of
a file using the UTF-16 encoding scheme with x-endian content is to
start with U+FFFE as its first character, it must start with what would
be interpreted as U+FEFF U+FFFE if it were declared to be in the
UTF-16xE encoding scheme. What has changed is that before such a file
could be regarded as erroneous - it should not have escaped from the
application that spawned it. Now the question of whether it is in
the UTF-16 encoding scheme or the UTF-16xE encoding scheme needs to be
resolved.

Richard.
Received on Thu Feb 21 2013 - 16:16:41 CST

This archive was generated by hypermail 2.2.0 : Thu Feb 21 2013 - 16:16:46 CST