From: Arcane Jill (arcanejill@ramonsky.com)
Date: Thu Dec 16 2004 - 02:07:38 CST
-----Original Message-----
From: Lars Kristan [mailto:lars.kristan@hermes.si]
>As for your solution, I didn't really analyze it. But it is escaping, isn't
>it?
Yes
>With a lot of overhead.
If you call string length "overhead", yes. This was to provide reasonable
assurance that an escape sequence won't be encountered by accident.
>Filesystems have limitations. Say up to 255 characters for a filename.
>Representing a 255 (Unicode) characters long filename from Windows on UNIX
>(in UTF-8) is not always possible. There is not much we can do about it.
>But representing a 255 characters (chars) long filename from UNIX on a
>Windows system? Currently always possible. An escaping technique with a lot
>of overhead breaks that.
Now that's frustrating, Lars. Each time I make a suggestion, you come up
with a new requirement. So now, in addition to all previous requirements,
you have this additional requirement:
# for all possible octet sequences s:
# length of (UTF-8(f(s)) <= length of s,
And yet, your own scheme, in which f(x) = { U+EE00 + x } for non-UTF-8 bytes
x, does not meet that requirement.
>Hence my pleeds to consider assigning the 128 codepoints in BMP,
I'm aware that you're trying to deal with a real issue here, and I have
sympathy for that, but you really need to drop these pleas. It will never
happen. (I suspect that campaigning to get the automobile banned in the USA
is probably a more achieveable goal). That's why I've been trying to help
out by suggesting alternatives. That's why you need to start thinking along
those lines.
>because otherwise an invalid sequence consisting of a single Latin 1
>character maps to 2 UTF-16 shorts. And if filesystem limitions can be seen
>as somewhat unnecessary goal,
I kind of got the impression that only "escape-aware" applications would
need to actually get filehandles on the files. "Escape-unaware" processes
were intermediate things like: being stored in a database; roundtripping
through UTF-16; etc.. And an "escape-aware" application can simply unescape
before opening the file. Have I misunderstood you?
> in C, you can guess (for performance resons) the maximum amount of memory
> you need for a certain conversion.
You can do better than guess. You figure it out exactly, to the byte. And
then allocate exactly the right amount of memory you require. (I am actually
a programmer).
Anyway, sorry to start sounding negative on this occasion. I have one more
suggestion for you. I think it's better than your U+EE00+n solution, but
Doug is likely to tell me it's non-conformant (though I think it should have
just the same status as an escape sequence, being a private thing outside
the realm of Unicode). Here we go:
DEFINITION (1): let H(x) be the function:
# H( 8) = U+0001
# H( 9) = U+0002
# H(10) = U+0003
# H(11) = U+0004
# H(12) = U+0005
# H(13) = U+0006
# H(14) = U+000E
# H(15) = U+000F
with H(x) undefined for all other values of x
DEFINITION (2): let L(x) be the function L(x) = U+0010 + x
RULE: to escape an isolated byte in the range 0x80 to 0xFF: let h = the high
nibble; let l = the low nibble; emit the sequence { H(h), L(l) }. For
example - to escape the byte 0x9F, you would emit { U+0002, U+001F }. In
UTF-8 that's { 02 1F }, which is one byte shorter than the UTF-8 encoding of
U+EE9F.
RATIONALE: The characters produced by H() and L() are ASCII control
characters with no defined meaning. Therefore, they shouldn't appear in
ASCII text. They also shouldn't appear in any encoding which is a superset
of ASCII, which is almost all of them. This is a pretty high guarantee that
the characters won't appear in plain text in any encoding. (If you want a
better guarantee, I imagine you'll need a longer escape sequence).
I'm starting to run out of ideas now. I hope you find something useful soon.
Jill
This archive was generated by hypermail 2.1.5 : Thu Dec 16 2004 - 02:09:08 CST