On 10/6/2015 5:24 AM, Sean Leonard
wrote:
And,
why did Unicode deem it necessary to replicate the C1 block at
0x80-0x9F, when all of the control characters (codes) were equally
reachable via ESC 4/0 - 5/15? I understand why it is desirable to
align U+0000 - U+007F with ASCII, and maybe even U+0000 - U+00FF
with Latin-1 (ISO-8859-1). But maybe Windows-1252, MacRoman, and
all the other non-ISO-standardized 8-bit encodings got this much
right: duplicating control codes is basically a waste of very
precious character code real estate
Because Unicode aligns with ISO 8859-1, so that
transcoding from that was a simple zero-fill to 16 bits.
8859-1 was the most widely used single byte (full 8-bit) ISO
standard at the time, and making that transition easy was
beneficial, both practically and politically.
Vendor standards all disagreed on the upper range, and it would
not have been feasible to single out any of them. Nobody wanted to
follow the IBM code page 437 (then still the most widely used
single byte vendor standard).
Note, that by "then" I refer to dates
earlier than the dates of the final drafts, because may of those
decisions date back to earlier periods where the drafts were first
developed. Also, the overloading of
0x80-0xFF by Windows did not happen all at once, earlier versions
had left much of that space open, but then people realized that as
long as you were still limited to 8 bits, throwing away 32 codes
was an issue.
Now, for Unicode, 32 out of 64K values (initially) or 1114112
(now), don't matter, so being "clean" didn't cost much. (Note that
even for UTF-8, there's no special benefit of a value being inside
that second range of 128 codes.
Finally, even if the range had not been dedicated to C1, the 32
codes would have had to be given space, because the translation
into ESC sequences is not universal, so, in transcoding data you
needed to have a way to retain the difference between the raw code
and the ESC sequence, or your round-trip would not be lossless.
A./
Received on Tue Oct 06 2015 - 10:35:09 CDT