Re: HTML5 encodings

From: verdy_p (verdy_p@wanadoo.fr)
Date: Fri Jan 01 2010 - 10:13:14 CST

  • Next message: Doug Ewell: "Re: HTML5 encodings"

    "Doug Ewell" wrote:
    > "verdy_p" wrote:
    >
    > > The [BOCU-1] reset byte can be used for something more useful: it can
    > > be used as a key separator when sorting for example lists of
    > > multicolumn output with priority between columns, even if each column
    > > is sorted in binary codepoint order. The separator is actully not a
    > > character, but represents a metacharacter that will be higher than
    > > everything else, so it can effectively terminate all binary encoded
    > > strings (when they are differentà, and maintain their relative
    > > ordering; the following sort keys (further data columns) appended
    > > after it will not break the sort order of distinct level-1 keys, but
    > > you'll be able to binary sort on the second column when two rows have
    > > binary identical first columns...
    >
    > Unicode, and even ASCII, contains plenty of seldom-used control
    > characters, with defined semantics if that is desirable, which an
    > internal process can safely insert, use, and remove for purposes like
    > this.

    No, you're wrong, there's no such character. If it existed, then this character would also have a use within normal
    strings that would be part of a primary key, and that would break the logic. If it is "seldom used", it does not
    qualify as it will conflict with this seldom use, so it will unavoidably be UNUSABLE to insert/use/remove for such
    purpose.

    The BOCU-1 RESET code is NOT a character, and what I wrote was exactly the kind of use where it can be beneficial,
    because BOCU-1 was designed with the express purpose of being a binary-ordered encoding suitable for collation
    according to code point's scalar values. If you keep one Unicode character anywhere in the UCS for such a separator,
    you'll have to parse the string to exclude it (= filter it), in order to spit the string into multiple components,
    then sort hierarchically the data columns in separate binary string compares.
    BOCU-1 was made explicitly to allow the encoded string to be ALSO a collation key (for the binary Unicode collation
    order).

    I DID NOT say that a RESET code neded to be inserted in the plain-text, but its insertion with a collation key as a
    key separator DOES NOT violate the rule, as we can completely warranty that it will:
    - never present in encoded plan-texts
    - will always sort AFTER any valid Unicode character
    - will not be ignored.

    Even within the UCA algorithm, there's a clear indication that such code higher than every other collation elements
    for characters, can be used for separating multi-level parts of the full collation key; it is also given as the
    simplest way to create a single compound collation key from multiple hierarchical collation elements.
    Just think about the use in a SQL "ORDER BY" clause followed by at least two columns: if these two columns can
    contain arbitrary Unicode plain-text, and the collation order must still be Unicode binary, you CANNOT reserve any
    valid character for this purpose, and you'll need an additional separator code for something that is not a
    character. This code MUST be binary higher than everything else, and the REST code in BOCU-1 fulfills this
    requirement.

    Compound collation keys are also useful for sorted indexes in a database, they are simpler than than uisng
    hierarchical compares, notably if the index is compressed (for example in representations that store common prefixes
    only once). You'll need each time a string terminator or separator which MUST be different from all the rest. And if
    you need binary ordering, this cannot be the NULL character.

    An I still maintain that the special RESET code in BOCU-1 should NEVER be present in any encoded plain-text (as
    effectively it has the potential of creating multiple distinct encodings for equivalent texts). But I still think
    that it can safely be used as a VERY SAFE separator or terminator of plain-text strings (much safer than the common
    NULL character used in C strings which is confused, and that does not have the correct binary encoding order
    requirement if it was reserved only for this special role and was forbidden from plain-text by a security check ;
    however this NULL byte is kept unchanged in BOCU-1 and remains reserved for encoding the NULL character as well and
    is also a reset byte, except that it has the wrong value as a separator for the binary sort order of multiple keys).

    The RESET code of BOCU-1 can of course be a security problem, but you can safely exclude it early as it is
    absolutely not needed for plain-text so it will not alter it (so, forbidding it will not cause more problems by
    creating new interpretations, but filtering it silently will cause problems such as turning Basic Latin into Greek).

    As a MIME encoding (for interchange), BOCU-1 should have never be registered the way it is: the RESET code should
    have been excluded (and reserved for purely internal purpose within processes like collation and sort).

    "RESET-less BOCU-1", on the opposite, is as safe as UTF-8, and like all other UTF's it can contain null bytes. But
    like UTF-8 it is independant of the byte order.

    So it does not absolutely need a leading BOM (My opinion is that, for interchange purpose, BOMS should be allowed in
    ALL encodings if they can represent the U+FEFF codepoint, and that this codepoint should also exclusively represent
    a BOM and no ZWNSP semantic: if needed one could replace all ZWNBSP by ZWJ, making sure that all final renderers
    will either be able to render it).

    All the legacy problems about the BOM would have been much simpler if it had been mapped to a non-character (exactly
    like also U+FFFE) instead of a legacy control format (like U+FEFF), but now it is too late to change it or recommand
    some other codepoint. All that can be done is to make sure that U+FEFF will be used exclusively as a BOM in
    interchange formats (even anywhere in a stream and even if this looks superfluous, not just at its begining), in a
    way that allows it to be freely inserted/deleted anywhere in any internal or external text processing.



    This archive was generated by hypermail 2.1.5 : Fri Jan 01 2010 - 10:16:51 CST