From: verdy_p (verdy_p@wanadoo.fr)
Date: Sun Dec 27 2009 - 17:26:56 CST
"Doug Ewell" wrote:
> "verdy_p" wrote:
> > If I look at UTF-32BE or UTF-32LE, it has only 4 states (you have to
> > merge the final states with the initial state). Mixing them and
> > supporting the optional BOM requires adding 3 other states so you have
> > finally 11 states for UTF-32. With UTF-8 you only 10 states (if you
> > count them for each possible length, andmerge the final states with
> > the initial state), one less than UTF-32. So UTF-8 still wins : it is
> > LESS stateful than UTF-32...
>
> Usually, at least on this list, the transient information needed while
> parsing multiple bytes into a single code point isn't thought of as
> "state." When you parse multiple bytes into an integer value of some
> sort, and still have to apply additional knowledge to turn THAT into a
> code point (as in ISO 2022 or UTF-16), that is state.
I disagree: I am not just counting the additional integer state variables needed for storing the semantic values of
the first bytes in a sequence, but purely the enumerated states coming from the finite state automata needed to
parse the stream.
Clearly UTF-16LE (or BE) wins. Unfortunately, HTML5 now WANTS absolutely the encoding of BOM (meaning that it will
have to handle its states, and verify them, and this means that UTF-16BE and UTF-16LE are disqualified.
This means that you have to handle the BOM, and the two BE and LE alternatives of UTF-16 in the same automata. Who
said that HTML5 wanted to promote the simplest implementations?
Even UTF-32 does not even needs any BOM (because it is self-ordered by the position of the NUL byte). The bad thing
is that it is clearly a waste of space, and UTF-32 does not work at all within null-terminated C/C++ Strings (the
same is true for UTF-16 in its three variants, because the code units are 16-bits, and the low-order byte of each
code unit can be null.
But given that null-terminated strings are really considered as unsafe datatypes (adding their own complication to
handle BOTH the null-termination of strings, AND the allocated length of buffers, when a single representation of
the length would be enough. Pascal-style strings are safe (but limited in their length, if it is stored as a single
byte). Java-style strings are safe and do not require any specific handling of the null byte (which can be
eliminated very early from input stream decoders, where NULL bytes are already invalid in conforming documents,
including all HTML4, HTML5, and XML, when using a multibyte decoder for encodings that do not use the NULL byte to
represent code-units larger than 8-bits).
BOCU-1 is also compatible with the safe encoding requirement (where NUL bytes are rejected to avoid securities
issues related to unexpected string truncation, for example in SQL requests where the encoded string could be
injected, if using embedded strings instead of the variable binding mechanism)
> > Clearly, UTF-16BE and UTF-16LE are the simplest encodings, with less
> > states, it will probably be more secure and definitiely fasterto
> > compute for very large volumes at high rates (such as in memory).
>
> Because of the surrogate mechanism, there is no way I personally would
> consider UTF-16 to be "simpler" than UTF-32. In the best case, it is
> "as simple as" UTF-32. It has other advantages, mostly related to size,
> but simplicity over UTF-32 is not one of them.
Really, you can't make a distinction between states like you do here. A state is a state (in terms of finate state
automata: it is NOT an integer value but an arbitrarily numbered enumerated value). And all multibyte encodings need
at least one additional integer value to store the weighted values of the previous bytes componsing a sequence, or
at least a small buffer for storing these bytes that will be decoded only at end of the operation separating the
byte sequences.
Separating byte sequences making a single character for UTF-16 (even with the surrogates which are easy to pair and
count directly within the finite state automata) is definitely simpler than with UTF-8. The relative source code
size for the decoder of the three UTF-16 variants in ICU is even smaller than UTF-8 (this is a good indicator as
well for code correctness and its security, as a longer code requires more complex tests to reach the full code
coverage). It is also much simpler to reject valid sequences representing forbidden characters with UTF-16 than with
UTF-8 (considering the subset of UCS characters allowed in HTML4, HTML5, XHTML, XML, CSS, Javascript...).
Using BOM-less UTF-32 (with local native byte ordering as it will just use 32-bit code units as a whole instead of
multiple bytes, and aligning them in memory for performance reason) will still remain less efficient than UTF-16.
Note that applications do not simply have to consider only the code units for treating characters isolately: very
often, they have to consider sequences of characters, for handling string normalization or just because a single
character ios not meaningful enough linguistically:
There's no use in actual data for defective sequences, when applications will need to work at least at the level of
grapheme clusters. At this level, everything will have variable length (independantly of the encoding chosen). What
is chosen at the single chjaracter level is not relevant for application design and its security. We are always
speaking about how to handle variable-length strings, and it is most often at this level that security issues
appear: the fact that this text will use UTF-16 (for compactness and so for faster processing with higher efficiency
of data caches, as long as memory remains addressable at least at every 16-bit boundary) or UTF-32 will not change
this.
UTF-8 will still keep its avantages for transmission in an heterogeneous environments like networks and storage
(including storage on network services like database servers, or on removable medias and mounted filesystems, which
can both be used directly by external applications including for purely local administration purpose with alternate
tools), but only because of the independant of its byte ordering (but it is really poor for Asian texts), when it is
used for storing reallatively small documents. But for massive storage of many texts or very large texts, UTF-8 will
remain quite poor: you'll still need an external compressor. For their transmission over a relatively slow network
like an Internet link, you'll still use classic binary compressors (either within an archive file format, or within
the transmission protocol).
Asian users will hate HTML5 if it forbids them to use BOCU-1 or SCSU and forces them to use the costly UTF-8
encoding (it's possibly not a problem in Japan or South Korea where the Internet speed is much higher than in the
rest of the world, but billions of users in China and India will hate HTML5 : one third of the whole humanity, isn't
that important enough?). Now consider users in Russia, or South-Eastern Europe and in the Middle-East. Their
presence on the Internet is also very developed, but HTML5 will not be for them. Clearly HTML5 is extremly highly
biased in favor of countries that are mostly speaking English only (or some languages that use a Latin alphabet with
a relatively low usage of non ASCII characters) that are still quite late at delivering decent Internet speed for
their whole territory because they have very large rural areas with low population density (this includes USA,
Canada, but also Brasil, and in fact many European countries as well, except the smallest ones without complicated
geographies like in the Benelux).
Most small islands countries that have also very slow or costly Internet, use English of French. They won't be
impacted much by the encoding bias chosen in HTML5. But they represent a very tiny market and a small population.
Ignoring the middle-East, China, India, Russia, Thailand, Indonesia (at least) is a severe error. HTML5 is just
saying to them: simply don't use any Unicode-based encoding, keep your existing national encodings (or use one of
the legacy Windows encodings). I really think it is stupid to forbid any Unicode-based encoding in HTML5.
In fact I would have much prefered to see HTML5 forbid all non-Unicode based encodings or those that are not free of
patents or that may be mapped ambiguously: this would have meant, forbidding the use of Windows encodinds, including
US-ASCII, ISO 8859 encodings, BOCU-1, ISCII, VISCII, GB2312 and GB18030, TIS-620, and all EBCDIC variants, as well
as various PC codepages made by IBM, Microsoft, Apple, Adobe, ... This would have really saved a lot of programmers
time.
There's absolutely no time lost when accepting all the standard UTF's or SCSU, as this effort will benefit to ALL
and will allow interesting alternatives for specialized environments or computing architectures where alternate
encodings could be better. The rejection of SCSU in HTML5 is completely stupid and counter-productive.
This archive was generated by hypermail 2.1.5 : Sun Dec 27 2009 - 17:33:27 CST