RE: Invalid UTF-8 sequences (was: Re: Nicest UTF)

From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Dec 08 2004 - 04:23:39 CST

  • Next message: Marcin 'Qrczak' Kowalczyk: "Re: Nicest UTF"

    > Needless to say, these systems were badly designed at their
    > origin, and
    > newer filesystems (and OS APIs) offer much better
    > alternative, by either
    > storing explicitly on volumes which encoding it uses, or by
    > forcing all
    > user-selected encodings to a common kernel encoding such as
    > Unicode encoding
    > schemes (this is what FAT32 and NTFS do on filenames created
    > under Windows,
    > since Windows 98 or NT).
    >
    The UNIX (I also call it variant) principle has a problem of not knowing the
    encoding.
    The Windows (I also call it invariant) principle has a problem that it HAS
    to know the encoding.

    The Windows principle has another problem, it can store data from any
    encoding, and it also does a good job of trying to represent the data in any
    encoding, but it cannot guarantee identification in just any encoding. An
    invariant store can be implemented as UTF-8 or UTF-16. Windows uses UTF-16
    and guranteed indentification used to be only possible in UTF-16. Due to
    UTF-8, now it can also be done in 8-bit (console, telnet). But for some
    reason, support for UTF-8 is still limited in some areas. And the missing
    rountrip capability may have something to do with it.

    I basically agree that the variant approach is not a good one. But the
    invariant one is not an easy path. It was easier for the Windows to take it,
    because at the time transition was made, those systems were still single
    user. Hence, typically all data was in a single encoding.

    Lars



    This archive was generated by hypermail 2.1.5 : Wed Dec 08 2004 - 04:24:53 CST