RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Dec 08 2004 - 04:23:39 CST

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

Previous message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Maybe in reply to: Doug Ewell: "Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

> Needless to say, these systems were badly designed at their
> origin, and
> newer filesystems (and OS APIs) offer much better
> alternative, by either
> storing explicitly on volumes which encoding it uses, or by
> forcing all
> user-selected encodings to a common kernel encoding such as
> Unicode encoding
> schemes (this is what FAT32 and NTFS do on filenames created
> under Windows,
> since Windows 98 or NT).
>
The UNIX (I also call it variant) principle has a problem of not knowing the
encoding.
The Windows (I also call it invariant) principle has a problem that it HAS
to know the encoding.

The Windows principle has another problem, it can store data from any
encoding, and it also does a good job of trying to represent the data in any
encoding, but it cannot guarantee identification in just any encoding. An
invariant store can be implemented as UTF-8 or UTF-16. Windows uses UTF-16
and guranteed indentification used to be only possible in UTF-16. Due to
UTF-8, now it can also be done in 8-bit (console, telnet). But for some
reason, support for UTF-8 is still limited in some areas. And the missing
rountrip capability may have something to do with it.

I basically agree that the variant approach is not a good one. But the
invariant one is not an easy path. It was easier for the Windows to take it,
because at the time transition was made, those systems were still single
user. Hence, typically all data was in a single encoding.

Lars

Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"
Previous message: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Maybe in reply to: Doug Ewell: "Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Next in thread: Lars Kristan: "RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 04:24:53 CST