Re: Variations of UTF-16

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Apr 25 2002 - 10:57:28 EDT


Shlomi Tal <shlompi@hotmail.com> wrote:

> If you're going to take the trouble of making text tools 16-bit
> aware, then you can afford to make them BOM-aware too.
>
> type a.txt b.txt c.txt > d.txt
>
> on Windows 2000, assuming that they are all UTF-16 (with an FFFE at
> the beginning of each, as is usual in MS-Windows Unicode files),
> strips every BOM except the last, so that d.txt has only the usual
> one initial FFFE. So it's not an immovable obstacle.

Someone will undoubtedly claim that this breaks data integrity in the
case of files that start with a genuine zero-width no-break space. This
scenario makes no sense to me, since the whole purpose of ZWNBSP is to
affect the breaking and spacing behavior *between* two characters, but
it seems to be legal Unicode nonetheless.

When U+2060 WORD JOINER becomes widespread enough that Unicode version X
(for some X >= 4.0) can strongly deprecate the use of U+FEFF as a
zero-width no-break space, then it will make more sense for all
Unicode-aware text tools (regardless of UTF) to handle BOMs in the way
Shlomi describes.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Thu Apr 25 2002 - 12:02:49 EDT