richard.wordingham at ntlworld.com
Wed Jun 25 12:58:55 CDT 2014
On Tue, 24 Jun 2014 09:16:00 -0400
CE Whitehead <cewcathar at hotmail.com> wrote:
> ME: if two sequences are canonically equivalent except that one has
> noncharacters in it, are these still canonically equivalent?
Canonical equivalences are defined for all sequences of scalar values;
it is just that it changes from version to version for most unassigned
Non-characters only decompose to themselves and do not
occur in the canonical (or indeed compatibility) decomposition of
anything else, so a sequence containing a non-character cannot be
canonically equivalent to a seqeunce not containing a non-character.
> Regarding the sentinels; I am an outsider but assume that with
> Corrigendum 9 U+FFFE will continue to be mentioned as having
> generally (not always?) standard use throughout; in Chapter 16.7 it
> is currently mentioned; I assume it will still be -- according to
> info. in the FAQ and elsewhere:
> http://www.unicode.org/faq/private_use.html "U+FFFE. The 16-bit
> unsigned hexadecimal value U+FFFE is not a Unicode character value,
> and should be taken as a signal that Unicode characters should be
> byte-swapped before interpretation. U+FFFE should only be intepreted
> as an incorrectly byte-swapped version of U+FEFF"
There is a lot of untruth in that FAQ entry, alas. I think U+FFFE
and possibly U+FFFF should be treated differently to the other 64
non-characters. At present there is no certainty as to whether
an interchanged file in the UTF-16 encoding scheme that appears to
contain a BOM contains a BOM or starts with U+FFFE. The only
promise is that such a file contains an even number of data bytes.
Any such sequence is valid! Will the UTF-16 encoding scheme be
More information about the Unicode