From: Peter Kirk (peterkirk@qaya.org)
Date: Sat Jan 22 2005 - 11:58:46 CST
On 22/01/2005 16:50, Lars Kristan wrote:
> ...
>
> > ... system's default code page. This cannot be
> > UTF-8, and so these files cannot start with a BOM
>
> Actually, they're not that far from it. Try "mode CON CP
> SELECT=65000". It is unsupported. Why?
>
> ...
>
> Now consider that user's (!) default code page is UTF-8 (so 65000).
> You would get proper output and no dropping for Unicode data. But what
> happens is that applications start dropping data on the stdin. Because
> invalid sequences are dropped. And with dropped I make no distinction
> between skipping them and replacing them with U+FFFD. It is dropping data.
>
> It would be nice to have UTF-8 as a default code page, wouldn't it?
> Someone must have realized that dropping data on the stdin is more
> than users would be willing to accept. Well, we can wait a couple of
> years to get all the out of band data sorted out. Or clutter
> everything with BOMs. Maybe then we'll know when the data is UTF-8 and
> when it is not. Maybe we will, maybe we won't. How about defining how
> to convert invalid UTF-8 sequences to codepoints? It would start
> working. Indeed no better than things work today. But the "current
> code page" concept did not differentiate between different encodings.
> Why should we differentiate UTF-8 from the rest? Of course it would be
> useful, but can it be done reliably? Can it be done in near future?
>
>
This is interesting speculation. But with any code page there are bytes
or combinations of bytes which are illegal or undefined in that code
page. When Windows (NT/2000/XP and so internally Unicode, represented as
UTF-16) reads code page files as text, they are converted to Unicode.
The correct behaviour when an illegal or undefined byte is found is to
replace it with U+FFFD, and I think this is what Windows does. This you
might also call dropping of data, although in fact it is not data but
garbage, or data wrongly labelled and so misinterpreted as garbage.
And if, speculatively, Windows were to support UTF-8 as a code page, the
situation would be unchanged. Byte sequences which are illegal UTF-8 are
garbage in that code page and so would correctly be replaced by U+FFFD.
But then even if UTF-8 were supported as a code page I think I would
keep Windows 1252 as my system code page. There is too much Windows 1252
legacy data around which would be treated as garbage if UTF-8 were my
system code page. The code page is used only by obsolescent legacy
applications, and by modern applications reading legacy data. Windows
Unicode support is adequate without trying to reinterpret legacy data as
Unicode. And rather than try to trick old applications into supporting
Unicode through UTF-8, the Windows strategy has rightly been to update
the applications for proper Unicode support.
...
> ... Very Windows-like. Much like hiding the extensions in Explorer. ...
>
This is optional. An option which anyone who knows anything much about
computers should switch off.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/ -- No virus found in this outgoing message. Checked by AVG Anti-Virus. Version: 7.0.300 / Virus Database: 265.7.2 - Release Date: 21/01/2005
This archive was generated by hypermail 2.1.5 : Sat Jan 22 2005 - 12:57:23 CST