RE: Roundtripping Solved

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Thu Dec 16 2004 - 02:07:38 CST

  • Next message: Lars Kristan: "RE: Roundtripping in Unicode"

    -----Original Message-----
    From: Lars Kristan [mailto:lars.kristan@hermes.si]

    >As for your solution, I didn't really analyze it. But it is escaping, isn't
    >it?

    Yes

    >With a lot of overhead.

    If you call string length "overhead", yes. This was to provide reasonable
    assurance that an escape sequence won't be encountered by accident.

    >Filesystems have limitations. Say up to 255 characters for a filename.
    >Representing a 255 (Unicode) characters long filename from Windows on UNIX
    >(in UTF-8) is not always possible. There is not much we can do about it.
    >But representing a 255 characters (chars) long filename from UNIX on a
    >Windows system? Currently always possible. An escaping technique with a lot
    >of overhead breaks that.

    Now that's frustrating, Lars. Each time I make a suggestion, you come up
    with a new requirement. So now, in addition to all previous requirements,
    you have this additional requirement:

    # for all possible octet sequences s:
    # length of (UTF-8(f(s)) <= length of s,

    And yet, your own scheme, in which f(x) = { U+EE00 + x } for non-UTF-8 bytes
    x, does not meet that requirement.

    >Hence my pleeds to consider assigning the 128 codepoints in BMP,

    I'm aware that you're trying to deal with a real issue here, and I have
    sympathy for that, but you really need to drop these pleas. It will never
    happen. (I suspect that campaigning to get the automobile banned in the USA
    is probably a more achieveable goal). That's why I've been trying to help
    out by suggesting alternatives. That's why you need to start thinking along
    those lines.

    >because otherwise an invalid sequence consisting of a single Latin 1
    >character maps to 2 UTF-16 shorts. And if filesystem limitions can be seen
    >as somewhat unnecessary goal,

    I kind of got the impression that only "escape-aware" applications would
    need to actually get filehandles on the files. "Escape-unaware" processes
    were intermediate things like: being stored in a database; roundtripping
    through UTF-16; etc.. And an "escape-aware" application can simply unescape
    before opening the file. Have I misunderstood you?

    > in C, you can guess (for performance resons) the maximum amount of memory
    > you need for a certain conversion.

    You can do better than guess. You figure it out exactly, to the byte. And
    then allocate exactly the right amount of memory you require. (I am actually
    a programmer).

    Anyway, sorry to start sounding negative on this occasion. I have one more
    suggestion for you. I think it's better than your U+EE00+n solution, but
    Doug is likely to tell me it's non-conformant (though I think it should have
    just the same status as an escape sequence, being a private thing outside
    the realm of Unicode). Here we go:

    DEFINITION (1): let H(x) be the function:
    # H( 8) = U+0001
    # H( 9) = U+0002
    # H(10) = U+0003
    # H(11) = U+0004
    # H(12) = U+0005
    # H(13) = U+0006
    # H(14) = U+000E
    # H(15) = U+000F
    with H(x) undefined for all other values of x

    DEFINITION (2): let L(x) be the function L(x) = U+0010 + x

    RULE: to escape an isolated byte in the range 0x80 to 0xFF: let h = the high
    nibble; let l = the low nibble; emit the sequence { H(h), L(l) }. For
    example - to escape the byte 0x9F, you would emit { U+0002, U+001F }. In
    UTF-8 that's { 02 1F }, which is one byte shorter than the UTF-8 encoding of
    U+EE9F.

    RATIONALE: The characters produced by H() and L() are ASCII control
    characters with no defined meaning. Therefore, they shouldn't appear in
    ASCII text. They also shouldn't appear in any encoding which is a superset
    of ASCII, which is almost all of them. This is a pretty high guarantee that
    the characters won't appear in plain text in any encoding. (If you want a
    better guarantee, I imagine you'll need a longer escape sequence).

    I'm starting to run out of ideas now. I hope you find something useful soon.
    Jill



    This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 02:09:08 CST