Re: UTF-8 signature in web and email

From: John Cowan (cowan@mercury.ccil.org)
Date: Wed May 23 2001 - 07:30:11 EDT


Marco Cimarosti scripsit:

> Now, imagine a compiler for some C-like language. If it supports UTF-8 (or
> Latin-1, or EUC-GB), when it receive strings like these:
>
> int \0xEF\0xBB\0xBF i; /* Unicode (UTF-8) */
> int \0xA0 i; /* ISO-8859-1, aka "Latin 1" */
> int \0xA1\0xA1 i; /* GB12345-80 (EUC) */
>
> It will correctly interpret the sequences of bytes >= 0x80 as being "white
> space" in the respective encoding, so it will parse the expression as "int
> i;".
>
> On the other hand, if the compiler does NOT understand these encodings, it
> will parse it as "int WHATSTHAT i;", and issue a syntax error.

Well, "C-like language" is a hedge. IIRC, C99 thinks everything above U+007F is
a letter.

> DOS users always had this fastidious problems importing Unix text files,
> because of Unix's fantasy reinterpretation of ASCII control 0x0A as "line
> break".

The ambiguity of 0x0A as "line feed" versus "new line" was present from the
beginning: at least some Teletypes had a mode to treat 0x0A as "new line".
For this reason, the C1 control characters discriminate: 0x84 is unambiguously
"index line count", whereas 0x85 is unambiguously "next line".
Unfortunately, this never caught on.

> If Unix designers followed standards, they would have seen that the only
> standard way of having a "line break" in ASCII is to combine 0x0A (meaning
> "move the cursor down one line") with 0x0D (meaning "move the cursor at the
> beginning of the line"), and today we wouldn't have this cross-system
> inconsistency.

Nope. The standard just wasn't that precise. Unix folk wanted a single
line separator character, and 0x0A was the obvious choice. And it
worked on their Model 37 Teletypes, whereas Windows descends
(Windows > DOS > CP/M > RT-11 > TOPS-10, OS-8) from operating systems
that expected Model 33 Teletypes, which required both CR and LF to reach the
next line.

-- 
John Cowan                                   cowan@ccil.org
One art/there is/no less/no more/All things/to do/with sparks/galore
	--Douglas Hofstadter



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT