Re: cp1252 decoder implementation

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Mon, 19 Nov 2012 00:58:28 +0100

The same chapter makes a normative reference to ISO/IEC 2022 for C0
controls, it does not say that this concerns ISO/IEC 8859 (which does not
reference itself ISO/IEC 2022 as being normative, but only informational
just to day that it is compatible with it, as well as with ISO 6429, and a
wide range of other international or national norms and various private
standards, but not all of them : e.g. the VISCII national standard is not
compatible with ISO/IEC 2022).

2012/11/17 Buck Golemon <buck_at_yelp.com>

> > So don't say that there are one-for-one equivalences.
>
> I was just quoting this section of the standard:
> http://www.unicode.org/versions/Unicode6.2.0/ch16.pdf
>
> > There is a simple, one-to-one mapping between 7-bit (and 8-bit) control
> codes and the Unicode control codes: every 7-bit (or 8-bit) control code is
> numerically equal to its corresponding Unicode code point.
>
> A one-to-one equivalency between bytes and unicode-points is exactly what
> is specified here, limited to the domain of "8-bit control codes".
>
>
> On Fri, Nov 16, 2012 at 9:48 PM, Philippe Verdy <verdy_p_at_wanadoo.fr>wrote:
>
>> If you are thinking about "byte values" you are working at the encoding
>> scheme level (in fact another lower level which defines a protocol
>> presentation layer, e.g. "transport syntaxes" in MIME). Unicode codepoints
>> are conceptually not an encoding scheme, just a coded character set
>> (independant of the encoding scheme).
>>
>> Separate the levels of abstraction and you'll be much more fine. Forget
>> the apparent homonymies that exist between distinct layers of abstraction
>> and use each standard in what it is designed for (including the Unicode
>> "character/glyph model" which is not defining an encoding scheme).
>>
>> So don't say that there are one-for-one equivalences. This is wrong : the
>> adaptation layer must exist between abstraction levels and between separate
>> standards, but the Unicode standard does not specify them completely (with
>> the only exception of standard UTF encodings schemes, which is just one
>> possible adaptation across some abstraction levels, but is not made to
>> adapt alone to other standards than what is in the Unicode standard itself).
>>
>>
>>
>> 2012/11/17 Buck Golemon <buck_at_yelp.com>
>>
>>> On Fri, Nov 16, 2012 at 4:11 PM, Doug Ewell <doug_at_ewellic.org> wrote:
>>>
>>>> Buck Golemon wrote:
>>>>
>>>> Is it incorrect to say that 0x81 is a non-semantic byte in cp1252, and
>>>>> to map it to the equally-non-semantic U+81 ?
>>>>>
>>>>> This would allow systems that follow the html5 standard and use cp1252
>>>>> in place of latin1 to continue to be binary-faithful and reversible.
>>>>>
>>>>
>>>> This isn't quite as black-and-white as the question about Latin-1. If
>>>> you are targeting HTML5, you are probably safe in treating an incoming 0x81
>>>> (for example) as either U+0081 or U+FFFD, or throwing some kind of error.
>>>
>>>
>>> Why do you make this conditional on targeting html5?
>>>
>>> To me, replacement and error is out because it means the system loses
>>> data or completely fails where it used to succeed.
>>> Currently there's no reasonable way for me to implement the U+0081
>>> option other than inventing a new "cp1252+latin1" codec, which seems
>>> undesirable.
>>>
>>>
>>>> HTML5 insists that you treat 8859-1 as if it were CP1252, so it no
>>>> longer matters what the byte is in 8859-1.
>>>
>>>
>>> I feel like you skipped a step. The byte is 0x81 full stop. I agree that
>>> it doesn't matter how it's defined in latin1 (also it's not defined in
>>> latin1).
>>> The section of the unicode standard that says control codes are equal to
>>> their unicode characters doesn't mention latin1. Should it?
>>> I was under the impression that it meant any single-byte encoding, since
>>> it goes out of its way to talk about "8-bit" control codes.
>>>
>>
>>
>
Received on Sun Nov 18 2012 - 18:05:59 CST

This archive was generated by hypermail 2.2.0 : Sun Nov 18 2012 - 18:06:02 CST