RE: Roundtripping Solved

From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Dec 15 2004 - 09:49:20 CST

  • Next message: Lars Kristan: "RE: Roundtripping in Unicode"

    Arcane Jill wrote:

    > solution, again without breaking the Unicode model. If I have

    > It is for reasons of requirement (4) that Lars proposes the
    > introduction of
    > 128 BMP codepoints. His intention is that they be marked as
    > "reserved - do
    > not use", so that requirement 4 is met.

    Actually, Jill, they are not reserved. No more than U+0041 is reserved.
    They are simply dedicated for a particular use. Which is not true for my PUA
    solution.

    And my solution does not break the Unicode model. The proposal would break
    the Unicode model if my conversion would replace the now-standard
    conversion. I can even show that the consequences of that would be no more
    serious than the filesystem problem I am solving. But at this point, I am
    not proposing that. I am proposing merely that these codepoints be assigned.

    Breaking the model is not why UTC is rejecting to consider this proposal. A
    couple of possible reasons:

    * UTC feel that allowing (well, encouraging) a new way of handling invalid
    sequences might slow down the transition.
    * UTC feel that allowing (well, encouraging) a new way of handling invalid
    sequences might lead to late detection of mislabelled data.
    * UTC feel that the problem in question has nothing to do with Unicode.
    * UTC feel that by stating filenames are binary data, they have solved the
    problem. Ignoring the cost they may be causing.
    * UTC should have realized the need for these codepoints years ago, but now
    prefer to stick with the original decision.

    As for your solution, I didn't really analyze it. But it is escaping, isn't
    it? With a lot of overhead. Filesystems have limitations. Say up to 255
    characters for a filename. Representing a 255 (Unicode) characters long
    filename from Windows on UNIX (in UTF-8) is not always possible. There is
    not much we can do about it. But representing a 255 characters (chars) long
    filename from UNIX on a Windows system? Currently always possible. An
    escaping technique with a lot of overhead breaks that. Hence my pleeds to
    consider assigning the 128 codepoints in BMP, because otherwise an invalid
    sequence consisting of a single Latin 1 character maps to 2 UTF-16 shorts.
    And if filesystem limitions can be seen as somewhat unnecessary goal, there
    is transmission overhead and one other thing: in C, you can guess (for
    performance resons) the maximum amount of memory you need for a certain
    conversion. And the multipliers are typically around 2 (bytes per byte).
    Even a plane other than BMP raises that to 4, other escaping techniques are
    far worse.

    Lars



    This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 09:56:59 CST