From: Lars Kristan (lars.kristan@hermes.si)
Date: Mon Dec 13 2004 - 10:00:53 CST
Marcin 'Qrczak' Kowalczyk wrote:
> UTF-8 is painful to process in the first place. You are making it
> even harder by demanding that all functions which process UTF-8 do
> something sensible for bytes which don't form valid UTF-8. They even
> can't temporarily convert it to UTF-32 for internal processing for
> convenience.
My point exactly. I am proposing to provide a conversion so you can. All you
need is to assign 128 codepoints and define their properties. They would be
printable characters, non-spaces, would have no upper/lower case properties,
would collate (for example) after all letters but before any special
characters, and so on. Then you don't need to fix anything. Not in the
functions. You just need to convert (and even convert from byte stream to
UTF-8) on boundaries where you expect such data. And decide whether you need
to prevent anything due to security reasons. If not, then you're done.
So, no, I am not demanding that UTF-8 functions need to behave differently.
Existing functions work perfectly well, assuming you convert to UTF-8 (so,
use three bytes to represent each invalid byte as a valid codepoint). It
would be beneficial if they would, but that is a separate issue. It would
need to be determined which functions could do so. Maybe all could, maybe
only some could, maybe none should. It needs to be investigated before
anything is changed. This is in line with what I said about validation.
Processing functions may do validation implicitly. But this is not a
requirement. Unless you make it so. But in my opinion, it is better to
separate validation from processing. In that case you can even prescribe
exactly what they should do with invalid data. And in this case they should
do exactly what they would do if the data was converted to UTF-8 according
to my conversion. But again, this is the next step, that needn't be done at
all.
>
> > Listing files in a directory should not signal anything. It MUST
> > return all files and it should also return them in a way that this
> > list can be used to access each of the files.
>
> Which implies that they can't be interpreted as UTF-8.
>
> By masking an error you are not encouraging users to fix it.
> Using non-UTF-8 filenames in a UTF-8 locale is IMHO an error.
Failure to process such files is also an error. Think virus scanners and
backup.
> > The interesting thing is that if you do start using my conversion,
> > you can actually get rid of the need to validate UTF-8 strings
> > in the first scenario. That of course means you will allow users
> > with invalid UTF-8 sequences, but if one determines that this is
> > acceptable (or even desired), then it makes things easier. But the
> > choice is yours.
>
> For me it's not acceptable, so I will not support declaring it valid.
I said, the choice is yours. My proposal does not prevent you from doing it
your way. You don't need to change anything and it will still work the way
it worked before. OK? I just want 128 codepoints so I can make my own
choice. And once and for all, you can treat those 128 codepoints just as you
do today.
Lars
This archive was generated by hypermail 2.1.5 : Mon Dec 13 2004 - 10:07:01 CST