Concise term for non-ASCII Unicode characters
lists+unicode at seantek.com
Tue Sep 22 05:18:46 CDT 2015
On 9/22/2015 1:45 AM, Philippe Verdy wrote:
> I would not use the "clumsy 7-bit ASCII" due to the confusion created
> since long when it could refer to any national version of ISO 646,
> which reassign some code positions in the rande 0x00 to 0x07F to other
> characters outside the range U+0000 to U+007F, while still remaining
> 7-bit encodings.
> So insead of "7-bit ASCII" I highly prefer the term "US-ASCII" to make
> sure it refers to the encoding of 7-bit code positions effectively to
> So for code positions outside 0x00..0x7F, I would call them "not
> US-ASCII" (none of them are bound to any Unicode "character" or "code
> point" or "scalar value", they are just "code positions" or more
> precisely "octet values with their most significant bit set to 1"
> which is really long: "not US-ASCII" is fine as a shorter term).
Again having just read through ANSI X3.4-1986 (R1997), I would like to
clarify some things.
The standard itself is titled:
American National Standard for Information Systems - Coded Character
Sets - 7-Bit American National Standard Code for Information Interchange
However, Clause 1.1 states:
This standard specifies a set of 128 characters (control characters and
graphic characters, such as letters, digits, and symbols) with their
coded representation. The American National Standard Code for
Information Interchange may also be identified by the acronym ASCII
(pronounced ask-ee). To explicitly designate a particular (perhaps
prior) edition of this standard, the last two digits of the year of
issue may be appended, as in "ASCII 68" or "ASCII 86".
According to the title, "7-Bit ASCII" is proper. However, according to
the text, "ASCII" is sufficient. The "7-Bit" part really just emphasizes
the fact that it is a 7-bit standard. The eighth bit is outside the
scope of the standard (but see clause 2.1.1). (Incidentally, Clause 1.1
is not Y2K compliant! Thus you should '86 that part of ASCII 86...hehe)
The term "US-ASCII" (see also RFC 2046 for a lot of discussion) is
similarly redundant. After all, it is the *American* *National* Standard
Code for Information Interchange. Even if you remove the term "National"
(which does not appear in ASCII 68 or ASCII 63), it's still American.
However, ASCII 68 (partially reprinted in RFC 20:
<https://tools.ietf.org/html/rfc20>) actually permits "the notation
ASCII (pronounced as'-key) or USASCII (pronounced you-sas'-key) [...] to
mean the code prescribed by the latest issue of the standard". That is
probably the genesis of US-ASCII. I wasn't alive at the time so I don't
know. My suspicion is that "US-ASCII" was meant to disambiguate ASCII 86
from ASCII 68 (which is referred to as "ASCII" in RFC 821) without
referring to the year, and since 68 and 86 are transposed numerals,
"US-ASCII" eliminates possible mix-ups.
My conclusion here is that "ASCII" is sufficient when talking about the
range of (code or character) positions 0 - 127, regardless of how they
are encoded, so long as they logically evaluate to the bit combinations
of the 7-bit code described in ANSI X3.4-1986.
"Basic Latin" also works if you want to avoid the historic reference.
But there are many systems in use that are ASCII-based (including the
Internet, as RFC 20 is still in force), and the term "ASCII" is peppered
throughout the Unicode Standard 8.0 with greater frequency than "Basic
Latin" (which is acknowledged to be a synonym for "ASCII" in Sections
5.7 and 6.2).
More information about the Unicode