From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Dec 15 2004 - 09:49:20 CST
Arcane Jill wrote:
> solution, again without breaking the Unicode model. If I have
> It is for reasons of requirement (4) that Lars proposes the
> introduction of
> 128 BMP codepoints. His intention is that they be marked as
> "reserved - do
> not use", so that requirement 4 is met.
Actually, Jill, they are not reserved. No more than U+0041 is reserved.
They are simply dedicated for a particular use. Which is not true for my PUA
solution.
And my solution does not break the Unicode model. The proposal would break
the Unicode model if my conversion would replace the now-standard
conversion. I can even show that the consequences of that would be no more
serious than the filesystem problem I am solving. But at this point, I am
not proposing that. I am proposing merely that these codepoints be assigned.
Breaking the model is not why UTC is rejecting to consider this proposal. A
couple of possible reasons:
* UTC feel that allowing (well, encouraging) a new way of handling invalid
sequences might slow down the transition.
* UTC feel that allowing (well, encouraging) a new way of handling invalid
sequences might lead to late detection of mislabelled data.
* UTC feel that the problem in question has nothing to do with Unicode.
* UTC feel that by stating filenames are binary data, they have solved the
problem. Ignoring the cost they may be causing.
* UTC should have realized the need for these codepoints years ago, but now
prefer to stick with the original decision.
As for your solution, I didn't really analyze it. But it is escaping, isn't
it? With a lot of overhead. Filesystems have limitations. Say up to 255
characters for a filename. Representing a 255 (Unicode) characters long
filename from Windows on UNIX (in UTF-8) is not always possible. There is
not much we can do about it. But representing a 255 characters (chars) long
filename from UNIX on a Windows system? Currently always possible. An
escaping technique with a lot of overhead breaks that. Hence my pleeds to
consider assigning the 128 codepoints in BMP, because otherwise an invalid
sequence consisting of a single Latin 1 character maps to 2 UTF-16 shorts.
And if filesystem limitions can be seen as somewhat unnecessary goal, there
is transmission overhead and one other thing: in C, you can guess (for
performance resons) the maximum amount of memory you need for a certain
conversion. And the multipliers are typically around 2 (bytes per byte).
Even a plane other than BMP raises that to 4, other escaping techniques are
far worse.
Lars
This archive was generated by hypermail 2.1.5 : Wed Dec 15 2004 - 09:56:59 CST