Re: Unicode and end users

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Feb 14 2002 - 12:08:17 EST

Previous message: Michael Everson: "Re: This spoofing and security thread"
In reply to: Lars Kristan: "RE: Unicode and end users"
Next in thread: David Hopwood: "Re: Unicode and end users"
Next in thread: David Hopwood: "Re: Unicode and end users"
Next in thread: Juliusz Chroboczek: "Re: Unicode and end users"
Reply: David Hopwood: "Re: Unicode and end users"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Lars Kristan" <lars.kristan@hermes.si> wrote:

> AFAIK, UTF-8 files are NOT supposed to have a BOM in them.

Different operating systems and applications have different preferences.
There is no universal "right" or "wrong" about this. This is
unfortunate, but true.

> Why is UTF-16 percieved as UNICODE? Well, we all know it's because
UCS-2
> used to be the ONLY implementation of Unicode. But there is another
> important difference between UTF-16 and UTF-8. It is barely possible
to
> misinterpret UTF-16, because it uses shorts and not bytes. On the
other
> hand, UTF-8 and ASCII are in extreme cases identical.

At the risk of being mistaken for juuitchan by citing a Japanese
example: A non-BOM file that starts with the bytes 0x30 0x42 could be
the UTF-8 characters "0B", or it could be the UTF-16BE character
HIRAGANA LETTER A. (A similar situation applies for UTF-16LE.) Now,
"0B" might not be the first two characters of many novels, but in a
techie Unix environment it could easily be the start of a text-format
data file.

Two common heuristics for determining whether a file is UTF-16 are to
check whether every other byte is 0x00, or whether every other byte is
the same. The former fails for non-Latin scripts, the latter fails
(less frequently) for scripts that are not part of a smallish alphabet.

That's the problem with no BOM: you have to resort to heuristics, or
external tagging.

> Why not have BOM in UTF-8? Probably because of the applications that
don't
> really need to know that a file is in UTF-8, especially since it may
be pure
> ASCII in many cases (e.g. system configuration files). And if Unicode
is THE
> codeset to be used in the future, then at some point in time all files
would
> begin with a UTF-8 BOM. Quite unnecessary. Further problems arise when
you
> concat files or start reading in the middle.

That's why U+2060 WORD JOINER is being introduced in Unicode 3.2.
Hopefully it will take over the ZWNBSP semantics from U+FEFF, which can
then be used *solely* as a BOM. Eventually, if this happens, it will
become safe to strip BOM's as they appear. (Of course, if you are
splitting or concatenating files, you should not do any interpretation
anyway.)

I have never seen a non-pathological example where stripping a file- or
stream-initial U+FEFF was harmful because of the possibility that it was
intended as ZWNBSP. ZWNBSP (or WORD JOINER) affects the behavior of the
characters before and after it. If there is no character before ZWNBSP,
it doesn't belong there.

> [O]n UNIX, it is essential that the user is aware of the codeset that
is being
> used.

Unix users are accustomed to dealing with such details.

> Anyway, some invalid sequences will be encountered by the
> editor, but then hopefully it will simply display some replacement
> characters (or ask if it can do so). Hopefully it will allow me to
save the
> file, with invalid sequences intact. Editing invalid sequences (or
inserting
> new ones) would be too much to ask, right?
>
> What bothers me a little bit is that I would not be able to save such
a file
> as UTF-16 because of the invalid sequences in it. Why would I? Well,
Windows
> has more and more suppport for UTF-8, so maybe I don't really need to.
I
> still wish I had an option though.
>
> This again makes me think that UTF-8 and UTF-16 are not both Unicode.
Maybe
> UTF-16 is 'more' Unicode right now, because of the past. But maybe
UTF-8
> will be 'more' Unicode in the future, because it can contain invalid
> sequences and these can be properly interpreted by someone at a later
time.
> Unless UTF-16 has that same ability, it will lose the battle of being
an
> 'equally good Unicode format'.

I don't think the fact that invalid sequences are possible in UTF-8 and
not in UTF-16 makes UTF-8 inferior, or any less "Unicode." It was
designed that way. Invalid sequences always represent a problem, just
like line noise. They should not be treated as a normal situation.

-Doug Ewell
Fullerton, California

Previous message: Michael Everson: "Re: This spoofing and security thread"
In reply to: Lars Kristan: "RE: Unicode and end users"
Next in thread: David Hopwood: "Re: Unicode and end users"
Next in thread: David Hopwood: "Re: Unicode and end users"
Next in thread: Juliusz Chroboczek: "Re: Unicode and end users"
Reply: David Hopwood: "Re: Unicode and end users"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Thu Feb 14 2002 - 11:54:07 EST