From: Lars Kristan (lars.kristan@hermes.si)
Date: Sat Dec 11 2004 - 13:05:38 CST
Marcin 'Qrczak' Kowalczyk wrote:
> Lars Kristan <lars.kristan@hermes.si> writes:
> 
> > All assigned codepoints do roundtrip even in my concept.
> > But unassigned codepoints are not valid data.
> 
> Please make up your mind: either they are valid and programs are
> required to accept them, or they are invalid and programs are required
> to reject them.
I don't know what they should be called. The fact is there shouldn't be any.
And that current software should treat them as valid. So, they are not valid
but cannot (and must not) be validated. As stupid as it sounds. I am sure
one of the standardizers will find a Unicodally correct way of putting it.
> 
> > Furthermore, I was proposing this concept to be used, but not
> > unconditionally. So, you can, possibly even should, keep using
> > whatever you are using.
> 
> So you prefer to make programs misbehave in unpredictable ways
> (when they pass the data from a component which uses relaxed rules
> to a component which uses strict rules) rather than have a clear and
> unambiguous notion of a valid UTF-8?
I am not particulary thrilled about it. In fact it should be discussed.
Constructively. Simply assuming everything will break is not helpful. But if
you want an answer, yes, I would go for it. Actually, there are fewer
concerns involved than people think. Security is definitely an issue. But
again, one shouldn't assume it breaks just like that. Let me risk a bold
statement: security is typically implicitly centralized. And if comparison
is always done in the same UTF, it won't break. A simple fact that two
different UTF-16 strings compare equal in UTF-8 (after relaxed conversion),
does not introduce a security issue. Today, two invalid UTF-8 strings
compare the same in UTF-16, after a valid conversion (using a single
replacement char, U+FFFD) and they compare different in their original form,
if you use strcmp. But you probably don't. Either you do everything in
UTF-8, or everything in UTF-16. Not always, but typically. If comparisons
are not always done in the same UTF, then you need to validate. And not
validate while converting, but validate on its own. And now many designers
will remember that they didn't. So, all UTF-8 programs (of that kind) will
need to be fixed. Well, might as well adopt my broken conversion and fix all
UTF-16 programs. Again, of that kind, not all in general, so there are few.
And even those would not be all affected. It would depend on which
conversion is used where. Things could be worked out. Even if we would start
changing all the conversions. Even more so if a new conversion is added and
only used when specifically requested.
There is cost and there are risks. Nothing should be done hastily. But let's
go back and ask ourselves what are the benefits. And evaluate the whole.
> 
> > Perhaps I can convert mine, but I cannot convert all filenames on
> > a user's system.
> 
> They you can't access his files.
Yes, this is where it all started. I cannot afford not to access the files.
I am not writing a notepad.
> 
> With your proposal you couldn't as well, because you don't make them
> valid unconditionally. Some programs would access them and some would
> break, and it's not clear what should be fixed: programs or filenames.
It is important to have a way to write programs that can. And, there is
definitely nothing to be fixed about the filenames. They are there and
nobody will bother to change them. It is the programs that need to be fixed.
And if Unicode needs to be fixed to allow that, then that is what is
supposed to happen. Eventually.
Lars
This archive was generated by hypermail 2.1.5 : Sat Dec 11 2004 - 13:10:12 CST