Re: Concise term for non-ASCII Unicode characters

From: Ken Whistler <kenwhistler_at_att.net>
Date: Tue, 29 Sep 2015 11:50:40 -0700

On 9/29/2015 10:30 AM, Sean Leonard wrote:
> On 9/29/2015 9:40 AM, Daniel Bünzli wrote:
>> I would say there's already enough terminology in the Unicode world
>> to add more to it. This thread already hinted at enough ways of
>> expressing what you'd like, the simplest one being "scalar values
>> greater than U+001F". This is the clearest you can come up with and
>> anybody who has basic knowledge of the Unicode standard
> Uh...I think you mean U+007F? :)

I agree that "scalar values greater than U+007F" doesn't just trip off
the tongue,
and while technically accurate, it is bad terminology -- precisely
because it
begs the question "wtf are 'scalar values'?!" for the average engineer.

>
> Perhaps it's because I'm writing to the Unicode crowd, but honestly
> there are a lot of very intelligent software engineers/standards folks
> who do not have the "basic knowledge of the Unicode standard" that is
> being presumed. They want to focus on other parts of their systems or
> protocols, and when it comes to the "text part", they just hand-wave
> and say "Unicode!" and call it a day. ...

Well, from this discussion, and from my experience as an engineer, I
think this comes down
to people in other standards, practices, and protocols dealing with the
ages old problem
of on beyond zebra for characters, where the comfortable assumptions
that byte=character
break down and people have to special case their code and documentation.
Where buffers
overrun, where black hat hackers rub their hands in glee, and where
engineers exclaim, "Oh gawd! I
can't just cast this character, because it's actually an array!"

And nowadays, we are in the age of universal Unicode. All (well, much,
anyway) would be cool
if everybody were using UTF-32, because then at least we'd be back to
32-bit-word=character,
and the programming would be easier. But UTF-32 doesn't play well with
existing protocols
and APIs and storage and... So instead, we are in the age of "universal
Unicode and almost
always UTF-8."

So that leaves us with two types of characters:

1. "Good characters"

These are true ASCII. U+0000..U+007F. Good because they are all single
bytes in UTF-8
and because then UTF-8 strings just work like the Computer Science God
always intended,
and we don't have to do anything special.

2. "Bad characters"

Everything else: U+0080..U+10FFFF. Bad because they require multiple
bytes to represent
in UTF-8 and so break all the simple assumptions about string and buffer
length.
They make for bugs and more bugs and why oh why do I have to keep
dealing with
edge cases where character boundaries don't line up with allocated
buffer boundaries?!!

I think we can agree that there are two types of characters -- and that
those code point
ranges correctly identify the sets in question.

The problem then just becomes a matter of terminology (in the standards
sense of
"terminology") -- coming up with usable, clear terms for the two sets.
To be good
terminology, the terms have to be identifiable and neither too generic
("good characters"
and "bad characters") or too abstruse or wordy ("scalar values less than
or equal to U+007F" and
"scalar values greater than U+007F").

They also need to not be confusing. For example, "single-byte UTF-8" and
"multi-byte UTF-8"
might work for engineers, but is a confusing distinction, because UTF-8
as an encoding
form is inherently multi-byte, and such terminology would undermine the
meaning of UTF-8
itself.

Finally, to be good terminology, the terms needs to have some reasonable
chance of
catching on and actually being used. It is fairly pointless to have a
"standardized way"
of distinguishing the #1 and #2 types of characters if people either
don't know about
that standardized way or find it misleading or not helpful, and instead
continue groping
about with their existing ad hoc terms anyway.

>
> In the twenty minutes since my last post, I got two different
> responses...and as you pointed out, there are a lot of ways to express
> what one would like. I would prefer one, uniform way (hence,
> "standardized way").

Mark's point was that it is hard to improve on what we already have:

1. ASCII Unicode [characters] (i.e. U+0000..U+007F)

2. Non-ASCII Unicode [characters] (i.e. U+0080..U+10FFFF)

If we just highlight that terminology more prominently, emphasize it in the
Unicode glossary, and promote it relentlessly, it might catch on more
generally,
and solve the problem.

More irreverently, perhaps we could come up with complete neologisms that
might be catchy enough to go viral -- at least among the protocol
writers and
engineers who matter for this. Riffing on the small/big distinction and
connecting
it to "u-*nichar*" for the engineers, maybe something along the lines of:

1. skinnichar

2. baloonichar

Well, maybe not those! But you get the idea. I'm sure there is a budding
terminologist
out there who could improve on that suggestion!

At any rate, any formal contribution that suggests coming up with
terminology for
the #1 and #2 sets should take these considerations under advisement.
And unless
it suggests something that would pretty easily gain consensus as
demonstrably better than
the #1 and #2 terms suggested above by Mark, it might not result in any
change in actual usage.

--Ken
Received on Tue Sep 29 2015 - 13:51:50 CDT

This archive was generated by hypermail 2.2.0 : Tue Sep 29 2015 - 13:51:50 CDT