UTF-16 Encoding Scheme and U+FFFE
petercon at microsoft.com
Wed Jun 4 10:54:37 CDT 2014
How did the word “prohibited” enter this conversation?
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Philippe Verdy
Sent: June 3, 2014 11:54 PM
To: Richard Wordingham
Cc: unicode at unicode.org
Subject: Re: UTF-16 Encoding Scheme and U+FFFE
U+FFFE is prohibited in interchanges because if interchanges specify a UTF-16 encoding (not UTF16-BE or URF16-LE) it would be interpreted as a BOM where it occurs at start of a stream (with the consequence of reparsing it as U+FEFF with bytes swapped). In all other positions where it cannot be a BOM.
BOM are *normally* only authorized in interchanges at "start" of streams.
But this is a problem for "live" streams that have no defined "start" but can be synced at random positions (such as on the next newline, or the start of a network datagram, but note that some network layers may fragment them so that BOM could be repeated, and also reunite them, leaving multiple BOMs in the same datagram) so we can assume that U+FFFE anywhere in a UTF16 "live" stream, not a UTF16-BE or UTF16-LE stream, is each time a BOM and not a BOM or legacy ZWNBSP or a non-character)
Streams that are known to be UTF16-BE or UTF16-LE are also not recommanded for interchanged if these files or live streams may be transmitted without metadata specifying its encoding explicitly (so many remote readers will interpret them instead as UTF16, possibly with multiple BOMs in resynchronizable live streams).
The problem of live streams is also a good reason why WZNBSP (U+FEFF) has been strongly discouraged in interchanges in vafor of word joiner (and this also applies to all other conforming UTFs (including UTF-8, UTF16-BE, UTF16-LE, UTF32, UTF32-LE, UTF32-BE) where it is strongly recommended not to use U+FEFF and U+FFFE except for BOMs (possibly repeated on live streams).
You should note that conforminf processes working in interchanges (or storage) should always be allowed to switch from one standard UTF to
another. And the same encoded streams may be consumed by various clients having different native order. It is now become difficult to define what is a "local" system, when applications are converted to work in a cloud with more and more heterogeneous clients and more intermediate third parties (providing things like caching, archiving, proxying, backup of data and restauration on another system...).
For long term reusability of data, we are strongly encouraged not to use U+FFFE and U+FEFF except for BOMs, and we should be tolerant about the number of BOMs found (an in my opinion, UCA implementations should ignore discard them on input, treating them as fully ignorable, except for delimiting combining base characters for the prupose of normalisation, that conforming applications or intermediate filters should be allowed to perform as they want. And we should absolutely forget the legacy semantic of ZWNBSP.
But this complexity and tolerance for one or more BOMs also means that all UTFs not based on 8-bit code units should be also discouraged in interchanges. This means that UTF-16 and UTF-32 should be discouraged, leaving only UTF-16BE or UTF-16LE or UTF-32BE not for storage or networking, but for temporary streams in memory used the "blackbox" internally implementing each conforming process. For all the rest, most applications now use UTF-8, possibly packaged within a generic compressed stream (binary compression of live streams remains possible, even if you cannot predict in the text encoding where the resynchronization points will occur: it's up to the protocol using this transport compression to properly define the resynchronization points).
In UTF-8 streams we can completely omit U+FFFE, U+FEFF, either as BOMs, ZWNSP or non-characters (and we can also expect that many applications will just discard them silently, as they only have a "no-op" role as BOMs in 8-bit streams). If an application ouputs an 8-bit stream that is not UTF-8, it wil drop all U+FEFF and U+FFFE found in input, and will often ouput its encoding of U+FEFF its non-UTF-8 encoding generated, frequently as a "magic" signature of this encoding. Secure digital signatures of text streams should also ignore these code units silently as these code units won't be relevant elsewhere in the chain of producers or consumers of this data (these secure digital signatures should be computed by dropping these discarvable U+FEFF and U+FFFE, normaling that data for example to NFC or NFD, and producing a specific UTF (the easiest one to avoid complications being to use UTF-32BE or UTF-32LE with a predetermined byte order, as specified by the digital signature algorithm).
Additionally it will be very easy to use as many U+FEFF code units as needed as ignorable extra BOMs, for cases where a protocol needs a safe "padding filler" f they want to use fixed-size block I/O with random access and easy resynchronization (in live streams), when the producer safely breaks data blocks at boundary of combining sequences (allowing these blocks to be normalized separately and reunited later witout creating problem.
2014-06-04 1:50 GMT+02:00 Richard Wordingham <richard.wordingham at ntlworld.com<mailto:richard.wordingham at ntlworld.com>>:
On Tue, 3 Jun 2014 21:28:05 +0000
Peter Constable <petercon at microsoft.com<mailto:petercon at microsoft.com>> wrote:
> There's never been anything preventing a file from containing and
> beginning with U+FFFE. It's just not a very useful thing to do, hence
> not very likely.
Well, while U+FFFE was apparently prohibited from public interchange,
one could be very confident of not finding it in an external file. As
an internally generated file, it would then be much more likely to be
in the UTF-16BE or UTF-16LE encoding scheme.
Unicode mailing list
Unicode at unicode.org<mailto:Unicode at unicode.org>
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode