Re: UTF-8N?

From: Peter_Constable@sil.org
Date: Wed Jun 21 2000 - 17:24:40 EDT


On 06/21/2000 04:36:35 PM <jcowan@reutershealth.com> wrote:

[various snips]

>Encodings are mappings between sequences of characters and sequences
>of bytes. Suppose we have a character sequence that begins with
>the character U+0020. Here are some possible encodings of that sequence
into
>bytes:
>
>UTF-16: 0xFE 0xFF 0x00 0x20 ...
>UTF-16: 0xFF 0xFE 0x20 0x00 ...
>UTF-8B: 0xEF 0xBB 0x BF 0x20 ...

Here you're wrong. The BOM is explicitly not to be interpreted as part of
the text stream. D35 (U3, p47) states (at least for UTF-16):

"The byte order mark is not considered part of the content of the text."

So, the character sequence consisting of an initial U+0020 gets encoded in
UTF-16 as 0x00 0x20 or 0x20 0x00. The standard doesn't ever discuss the BOM
in the context of UTF-8, but it would be a logical extension to say that a
BOM in a UTF-8 text stream (if such can be defined - but that's the crux of
the problem, and I'll return to it below) should not be considered part of
the content of the text. So, U+0020 gets encoded as UTF-8 as 0x20, whether
or not the file begins with a BOM.

By the way, I don't know why you singled out U+0020 here; your claim could
equally have been made about any other character (and would have been
equally inaccurate).

>Now suppose we have a character sequence beginning with U+FEFF U+0020.
This
>would be encoded as follows...

An unlikely initial character sequence, and the same objections raised
above still apply.

>Without distinct labels UTF-8N and UTF-8B (or whatever), we cannot tell if
the
>byte sequence 0xEF 0xBB 0xBF 0x20 should be decoded as U+0020 or U+FEFF
>U+0020. This is exactly analogous to the statement that without distinct
>labels UTF-16 and UTF-16BE, we cannot tell if the byte sequence 0xFE 0xFF
0x00
>0x20 should be decoded as U+0020 or U+FEFF U+0020.

Now, this is valid, at least for UTF-8. (Again, though, it would be valid
for U+0020 or any other character.) This isn't analogous to UTF-16 since
D33 - D35 spell out how an initial U+FEFF is to be interpreted (though it
would be analogous if D33 - D35 didn't make that clear - perhaps that's
what you meant). But you are right about UTF-8, because there is no
definition for the BOM in the context of UTF-8.

Perhaps the unstated intention of the authors of the standard is that, as
with UTF-16BE and UTF-16LE, an intial sequence in UTF-8 corresponding to
U+FEFF is interpreted as ZWNBSP (D33, D34); in other words, that there is
no BOM in the context of UTF-8. But if so, they certainly didn't make this
clear, and obviously there is confusion on this issue. We've seen that some
specifically put this sequence at the beginning of a UTF-8 file to identify
it as such, and that others specifically assume it will not be there.

Summarising:

- Initial U+0020 gets serialized in UTF-8 as 0x20, regardless of whether or
not the file begins with a BOM. (On this we disagree.)

- A UTF-8 file that begins with the byte sequence 0xEF 0xBB 0x BF 0x20 ...
could be interpreted as either < ZWNBSP U+0020 ... >, or as BOM < U+0020
... > (where I'm using angle brackets to denote the start and end of the
content of text). Furthermore, there is nothing to indicate which
interpretation is correct. (On this we agree.)

- U3 does not indicate whether or not the notion of the BOM is valid in the
context of UTF-8.

I think the second point is probably the point you were trying to make all
along. This is a problem, and I'm inclined to say that it ought to be
resolved by UTC.

>The counterargument is that the sequence U+FEFF U+0020 simply makes no
sense,
>and the case is not worth worrying about. The rejoinders to *that* are:
1) it
>can be represented in UTF-16 of any flavor, and the mapping from UTF-16 to

>UTF-8 must be 1-1 and reversible, and 2) there is no such thing in Unicode
as a
>forbidden sequence of characters.

I agree.

Peter Constable



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:04 EDT