Re: Unicode and end users

From: Doug Ewell (dewell@adelphia.net)
Date: Sat Feb 16 2002 - 15:37:20 EST


David Hopwood <david.hopwood@zetnet.co.uk> wrote:

> [I've thought about this a bit more, and I'm now convinced that it's
> useful to have a separate, standardised code for this - say
> U+FDEF ILL-FORMED INPUT MARKER. (Can noncharacters have names?)

Nope. They're noncharacters. They do not exist; they never existed.

Why would anyone, faced with a UTF-8 file that contains invalid
sequences, want to retain the invalid sequences, much less convert the
file to another encoding form that either (a) preserves the invalid
sequences or (b) leaves a marker showing where they were? Invalid
sequences are garbage. They don't represent anything, and you can't
always even tell what they were supposed to represent.

>> That's why U+2060 WORD JOINER is being introduced in Unicode 3.2.
>> Hopefully it will take over the ZWNBSP semantics from U+FEFF, which
can
>> then be used *solely* as a BOM. Eventually, if this happens, it will
>> become safe to strip BOM's as they appear.
>
> No it won't: silently stripping characters without considering that to
be
> a change to the string is a potential security problem. It's unlikely
> that this would be a problem at the start of a *file*, but "UTF-16" in
> the sense of the IANA-registered charset of that name (i.e. swap byte
> order every time you see "U+FFFE", and strip U+FEFF anywhere it
appears),
> is simply a bad idea IMHO.

You can never strip or convert anything in complete blindness, of
course; even converting LF to CRLF when moving a file from a Unix system
to a Windows system would affect the CRC, which might cause some alarms
to go off.

This is where I agree with you about silently converting non-initial
ZWNBSP to WORD JOINER, as strongly as I support removing the ZWNBSP
semantics from U+FEFF. If we are talking about a system or application
that only needs to preserve certain semantics for the human reader, it's
fine (as are LF->CRLF, stripped BOM's, and maybe even some edge cases
like converting between tabs and spaces). If there are any security or
spoofing concerns, it's best to leave everything completely untouched.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Sat Feb 16 2002 - 15:05:29 EST