Re: IJ joint in spaced lettering

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Jan 11 2006 - 17:30:58 CST

  • Next message: Kenneth Whistler: "Re: IJ joint in spaced lettering"

    From: "Kenneth Whistler" <kenw@sybase.com>
    >> Another related question: Why isn't there a standard 16-bit UTF
    >> that preserves the binary ordering of codepoints?
    >> (I mean for example UTF-16 modified simply by moving all
    >> code units or code points in E000..FFFF down to D800..F7FF
    >> and moving surrogate code units in D800..DFFF up to F800..FFFF).
    >
    > Huh? Because it would confuse the hell out of everybody and lead
    > to problems, just like any other putative fixes by proliferation
    > of UTF's.
    >
    > Sorting UTF-16 in binary order is easy. See "UTF-16 in UTF-8 Order",
    > p. 136 of TUS 4.0.

    I don't say it is not easy to do. What I just indicated is that there are applications where onereally wants pure binary sort order, where it would also begood that it preserves the order of codepoints (like with UTF-8 and UTF-32, but not in UTF-16).

    May be what you are replying there is that Unicode doesnot want to add more standard UTFs, and instead prefer to insist that such UTFs should remain private (requiring explicit agreements between users, or using private internal interfaces and APIs, so that no public standard will need to be standardized).

    Hey! I did not want to name such variant "UTF-16", which is clearly permanently reserved. It's just that alternative UTFs are still possible without affecting full conformance with the Unicode standard: with the same required properties for all UTFs that they MUST preserve the exact encoding of all valid codepoints between U+0000 and U+10FFFF, including non-characters, and that they must not change their relative encoding order in strings so that all all normalization forms and denormalizations are preserved, allthis meaning there must exist a bijection beween all UTFs applied to all Unicode strings.

    If this is still not clear enough, the standard should insists that it documents 3 UTFs explicitly with several byte ordering options for endianness, but this still does not restrict full conformance only to these. In fact Unicode approves also SCSU and BOCU-8, and because they respect the bijection rule, they are already compliant UTFs. (Last year the case of GB18030 was discussed, and it was proven that it was not a compliant UTF without first specifying its exact version explicitly).

    But it should be clear in the standard that they are just examples of valid UTFs,recommandedfor interchange across heterogenous systems or networks, and that applications can use their own alternate representation, as needed to comply with other needs (for example any attempt to make any standard UTF fit on platforms with 64-bit or 80-bit word size would already require an extension, which cannot strictly be equal to any standardized UTF, even if it's just a simple zero-bit padding, that requires an additional specification for the validity of binary interfaces).



    This archive was generated by hypermail 2.1.5 : Wed Jan 11 2006 - 17:36:39 CST