Re: cp1252 decoder implementation from Martin J. Dürst on 2012-11-27 (Unicode Mail List Archive)

From: Martin J. Dürst <duerst_at_it.aoyama.ac.jp>
Date: Tue, 27 Nov 2012 18:50:35 +0900

On 2012/11/17 12:54, Buck Golemon wrote:
> On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell<doug_at_ewellic.org> wrote:
>
>> Buck Golemon wrote:
>>
>> Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
>>> to map it to the equally-non-semantic U+81 ?

U+0081 (there are always at least four digits in this notation) just by
chance doesn't have any definition. But if we take the next of the
"holes" in windows-1258, 0x8D, we get "REVERSE LINE FEED". This isn't
exactly non-semantic (although of course browsers and quite a bit of
other software ignores that meaning).

> Why do you make this conditional on targeting html5?
>
> To me, replacement and error is out because it means the system loses data
> or completely fails where it used to succeed.

There are cases where one wants to avoid as many failures as possible,
at the cost of GIGO (garbage in, garbage out). Browsers are definitely
in that category.

There are other cases where one wants to catch garbage early, and not
let it pollute the rest of the data.

> Currently there's no reasonable way for me to implement the U+0081 option
> other than inventing a new "cp1252+latin1" codec, which seems undesirable.

Well, the above two cases cannot be met with one and the same codec
(unless of course in the case where there are additional options that
allow to switch between one and the other).

> I feel like you skipped a step. The byte is 0x81 full stop. I agree that it
> doesn't matter how it's defined in latin1 (also it's not defined in latin1).
> The section of the unicode standard that says control codes are equal to
> their unicode characters doesn't mention latin1. Should it?
> I was under the impression that it meant any single-byte encoding, since it
> goes out of its way to talk about "8-bit" control codes.

I'd say it intends to apply to any single-byte encoding with a full C1
range, or in other words, any single-byte encoding conforming to the ISO
C0/G0/C1/G1 model (that's used if not defined in ISO 2022). So that
would include any encoding of the ISO-8859-X family but not windows-XXXX
or macintosh encodings.

In other words, the C1 range isn't just a dumping ground for cases where
the conversion would fail otherwise.

Regards, Martin.
Received on Tue Nov 27 2012 - 03:53:02 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 27 2012 - 03:53:03 CST