Re: Why Nothing Ever Goes Away from Sean Leonard on 2015-10-09 (Unicode Mail List Archive)

From: Sean Leonard <lists+unicode_at_seantek.com>
Date: Fri, 9 Oct 2015 11:05:54 -0700

Satisfactory answers, thank you very much.

Going back to doing more research. (Silence does not imply abandoning
the C1 Control Pictures project; just a lot to synthesize.)

Regarding the three points U+0080, U+0081, and U+0099: the fact that
Unicode defers mostly to ISO 6429 and other standards before its time
(e.g., ANSI X3.32 / ISO 2047 / ECMA-17) means that it is not
particularly urgent that those code points get Unicode names. I also do
not find that their lack of definition precludes pictorial
representations. In the current U+2400 block, the Standard says: "The
diagonal lettering glyphs are only exemplary; alternate representations
may be, and often are used in the visible display of control codes",
and, Section 22.7.

I am now in possession of a copy of ANSI X3.32-1973 and ECMA-17:1968
(the latter is available on ECMA's website). I find it worthwhile to
point out that the Transmission Controls and Format Effectors were not
standardized by the time of ECMA-17:1968, but the symbols are the same
nonetheless. ANSI X3.32-1973 has the standardized control names for
those characters.

Sean

On 10/6/2015 6:57 AM, Philippe Verdy wrote:
>
> 2015-10-06 14:24 GMT+02:00 Sean Leonard <lists+unicode_at_seantek.com
> <mailto:lists+unicode_at_seantek.com>>:
>
> 2. The Unicode code charts are (deliberately) vague about
> U+0080, U+0081,
> and U+0099. All other C1 control codes have aliases to the ISO
> 6429
> set of control functions, but in ISO 6429, those three control
> codes don't
> have any assigned functions (or names).
>
>
> On 10/5/2015 3:57 PM, Philippe Verdy wrote:
>
> Also the aliases for C1 controls were formally registered in
> 1983 only for the two ranges U+0084..U+0097 and U+009B..U+009F
> for ISO 6429.
>
>
> If I may, I would appreciate another history lesson:
> In ISO 2022 / 6429 land, it is apparent that the C1 controls are
> mainly aliases for ESC 4/0 - 5/15. ( @ through _ ) This might vary
> depending on what is loaded into the C1 register, but overall, it
> just seems like saving one byte.
>
> Why was C1 invented in the first place?
>
>
> Look for the history of EBCDIC and its adaptation/conversion with
> ASCII compatible encodings: round trip conversion wasneeded (using a
> only a simple reordering of byte values, with no duplicates). EBCDIC
> has used many controls that were not part of C0 and were kept in the
> C1 set. Ignore the 7-bit compatiblity encoding using pairs, they were
> only needed for ISO 2022, but ISO 6429 defines a profile where those
> longer sequences are not needed and even forbidden in 8-bit contexts
> or in contexts where aliases are undesirable and invalidated, such as
> security environments.
>
> With your thoughts, I would conclude that assigning characters in the
> G1 set was also a duplicate, because it is reachable with a C0
> "shifting" control + a position of the G0 set. In that case ISO 8859-1
> or Windows 1252 was also an unneeded duplication ! And we would live
> today in a 7-bit only world.
>
> C1 controls have their own identity. The 7-bit encoding using ESC is
> just a hack to make them fit in 7-bit and it only works where the ESC
> control is assumed to play this function according to ISO 2022, ISO
> 6429, or other similar old 7-bit protocols such as Videotext (which
> was widely used in France with the free "Minitel" terminal, long
> before the introduction of the Internet to the general public around
> 1992-1995).
>
> Today Videotext is definitely dead (the old call numbers for this slow
> service are now definitely defunct, the Minitels are recycled wastes,
> they stopped being distributed and replaced by applications on PC
> connected to the Internet, but now all the old services are directly
> on the internet and none of them use 7-bit encodings for their HTML
> pages, or their mobile applications). France has also definitely
> abandoned its old French version of ISO 646, there are no longer any
> printer supporting versions of ISO 646 other than ASCII, but they
> still support various 8-bit encodings.
>
> 7-bit encodings are things of the past (they were only justified at
> times where communication links were slow and generated lots of
> transmission errors, and the only implemented mecanism to check them
> was to use a single parity bit per character. Today we transmit long
> datagrams and prefer using checks codes for the whole (such as CRC, or
> autocorrecting codes). 8-bit encodings are much easier and faster to
> process for transmitting not just text but also binary data.
>
> Let's forget the 7-bit world definitely. We have also abandonned the
> old UTF-7 in Unicode ! I've not seen it used anywhere except in a few
> old emails sent at end of the 90's, because many mail servers were
> still not 8-bit clean and silently transformed non-ASCII bytes in
> unpredictable ways or using unspecified encodings, or just siltently
> dropped the high bit, assuming it was just a parity bit : at that
> time, emails were not sent with SMTP, but with the old UUCP protocol
> and could take weeks to be delivered to the final recipient, as there
> was still no global routing infrastructure and many hops were
> necessary via non-permanent modem links. My opinion of UTF-7 is that
> it was just a temporary and experimental solution to help system
> admins and developers adopt the new UCS, including for their old 7-bit
> environments.

On 10/6/2015 8:33 AM, Asmus Freytag (t) wrote:
> On 10/6/2015 5:24 AM, Sean Leonard wrote:
>> And, why did Unicode deem it necessary to replicate the C1 block at
>> 0x80-0x9F, when all of the control characters (codes) were equally
>> reachable via ESC 4/0 - 5/15? I understand why it is desirable to
>> align U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF with
>> Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and all the
>> other non-ISO-standardized 8-bit encodings got this much right:
>> duplicating control codes is basically a waste of very precious
>> character code real estate
>
> Because Unicode aligns with ISO 8859-1, so that transcoding from that
> was a simple zero-fill to 16 bits.
>
> 8859-1 was the most widely used single byte (full 8-bit) ISO standard
> at the time, and making that transition easy was beneficial, both
> practically and politically.
>
> Vendor standards all disagreed on the upper range, and it would not
> have been feasible to single out any of them. Nobody wanted to follow
> the IBM code page 437 (then still the most widely used single byte
> vendor standard).
>
>
> Note, that by "then" I refer to dates earlier than the dates of the
> final drafts, because may of those decisions date back to earlier
> periods where the drafts were first developed.Also, the overloading of
> 0x80-0xFF by Windows did not happen all at once, earlier versions had
> left much of that space open, but then people realized that as long as
> you were still limited to 8 bits, throwing away 32 codes was an issue.
>
> Now, for Unicode, 32 out of 64K values (initially) or 1114112 (now),
> don't matter, so being "clean" didn't cost much. (Note that even for
> UTF-8, there's no special benefit of a value being inside that second
> range of 128 codes.
>
> Finally, even if the range had not been dedicated to C1, the 32 codes
> would have had to be given space, because the translation into ESC
> sequences is not universal, so, in transcoding data you needed to have
> a way to retain the difference between the raw code and the ESC
> sequence, or your round-trip would not be lossless.
>
> A./
Received on Fri Oct 09 2015 - 13:07:21 CDT

This archive was generated by hypermail 2.2.0 : Fri Oct 09 2015 - 13:07:21 CDT