Concise term for non-ASCII Unicode characters
petercon at microsoft.com
Mon Sep 21 19:17:28 CDT 2015
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Sean Leonard
Sent: Monday, September 21, 2015 1:22 AM
> Well what I am getting at is that when writing standards documents in various SDOs (or any other
> computer science text, for that matter), it is helpful to identify these characters/code points.
> However, in contexts where ASCII is getting extended or supplemented (e.g., in the DNS or in e-mail),
> one needs to be really > clear that the octets 0x80 - 0xFF are Unicode (specifically UTF-8, I suppose),
> and not something else.
Well, if you are writing standards that "extend ASCII", then you need to be completely clear that what is being discussed is _not ASCII_. In that sense, I agree with Tony Jollans comments: be clear about what it is that is being discussed — including what coded character set, or what encoding form for what coded character set.
> FWIW, the term "non-ASCII" is used in e-mail address internationalization ("EAI") in the IETF; its
> opposite is "all-ASCII" (or simply "ASCII"). (RFCs 6530, 6531, 6532). The term also appears in RFC
> 2047 from November 1996 but there it has the more expansive meaning (i.e., not limited or
> targeted to Unicode).
Glancing at the Introduction for RFC 6530, it seems to have clear terminology:
" Without the extensions specified in this document, the mailbox name is restricted to a subset of 7-bit ASCII [RFC5321]. Though MIME [RFC2045] enables the transport of non-ASCII data..."
Here, "ASCII" means ASCII — the 7-bit encoding originally defined as ANSI X3.4. And "non-ASCII data" appears to mean data involving any characters other than those in the ASCII coded character set, or any data represented in any other encoded representation but ASCII. The term "all-ASCII" is used in section 4.2, but it is immediately defined:
"In this document, an address is "all-ASCII", or just an "ASCII address", if every character in the address is in the ASCII character repertoire [ASCII]; an address is "non-ASCII", or an "i18n-address", if any character is not in the ASCII character repertoire."
So, it seems like they had a similar terminology need to what you describe, and the handled it in a satisfactory, clear way.
If what you need to describe is UTF-8 sequences of two or more bytes, then I would be clear that the context is Unicode UTF-8, not ASCII or any other coded character set / encoding form; and I would say, "Unicode UTF-8 code unit sequences of two to four bytes" or "Unicode UTF-8 multi-byte sequences" or something along those lines.
If you think it's a serious problem that there isn't one conventional term for "characters outside the ASCII repertoire" or "UTF-8 multi-code-unit encoded representations" (since different authors could devise different terminology solutions), then I suggest you submit a document to UTC explaining why it's a problem, documenting inconsistent or unclear terminology that's been used in some standards / public specifications, and requesting that Unicode formally define terminology for these concepts. I can't guarantee that UTC will do it, but I can predict with confidence that it _won't_ do anything of that nature if nobody submits such a document.
More information about the Unicode