From: Richard T. Gillam (rgillam@las-inc.com)
Date: Fri Jan 21 2005 - 13:26:10 CST
Jill--
>[Jill's Important Question 1]:
>So the first question I must ask is: Which of these two clauses takes
>precedence, C8 or C12b?
>
>If C12b takes precedence, then when a process interprets a byte
sequence which
>purports to be in the Unicode Encoding Scheme UTF-8, it shall interpret
that
>byte sequence according to the specifications for the use of the byte
order
>Mark established by the Unicode Standard for the Unicode Encoding
Scheme UTF-8.
>
>But if C8 takes precedence, then a process shall not assume that it is
required
>to interpret U+FEFF.
>
>They can't both be right.
Peter Kirk had this one right. Certain encoding SCHEMES treat the byte
sequence FEFF (or some variant of it) as a byte order mark when it
appears at the beginning of a text stream. In these contexts, it's not
a character at all; it's part of the communication protocol. A process
operating on the actual text, after it's been deserialized and converted
into an in-memory representation (an encoding FORM), doesn't see it.
Other encoding schemes don't treat FEFF as special. A process operating
on the actual text after it's been deserialized will see this as the
character U+FEFF, the ZWNBSP.
>[Jill's Important Question 2]:
>And the second question I must ask is: if a file is labelled by some
higher
>level protocol (for example, Unix locale, HTTP header, etc) as "UTF-8",
should
>a conformant process interpret that as UTF-8, the Unicode Encoding FORM
(which
>prohibits a BOM) or as UTF-8, the Unicode Encoding SCHEME (which allows
one)?
UTF-8 is both an encoding form and an encoding scheme, and it doesn't do
anything special with EF BB BF. It always comes through as U+FEFF, the
ZWNBSP. Applications that use EF BB BF as a signal that the text stream
is in UTF-8 and not some other encoding are implementing a higher-level
protocol based on UTF-8. UTF-8 itself doesn't treat this sequence as
special.
For that matter, applications that use the full panoply of
signature-byte sequences (0000FEFF for UTF-32BE, FFFE0000 to UTF-32LC,
FEFF for UTF-16BE, FFFE for UTF-16LE, EF BB BF for UTF-8, etc.) to
determine whether a byte stream is Unicode and what Unicode encoding
scheme it is are also implementing a higher-level protocol based on
Unicode.
>What with all the BOM difficulties, and the fact that U+FEFF doubles up
as ZERO
>WIDTH NO-BREAK SPACE, a new possibility occured to me.
>
>Imagine if the codepoint U+D7FD were reserved as NOP, having properties
which
>essentially made it completely ignorable and invisible. It could simply
be
>thrown away, whereever it were encounted.
This isn't a bad idea, but it's pretty much unnecessary. With Unicode
3.2, the meaning of U+FEFF as ZWNBSP was deprecated and a new character,
U+2060 WORD JOINER, was introduced to fulfill the ZWNBSP function. Over
time, this means you'll see more and more applications that use U+2060
to glue things together and treat U+FEFF as a no-op. These applications
will have some backward-compatibility problems (older documents will
have some "glued" sequences coming "unglued"), but this will die out.
In fact, I think the more recent versions of Unicode make it legal to
turn U+FEFF into U+2060 without documenting you're changing the text.
--Rich Gillam
Language Analysis Systems
This archive was generated by hypermail 2.1.5 : Fri Jan 21 2005 - 13:29:16 CST