Concise term for non-ASCII Unicode characters

Daniel Bünzli daniel.buenzli at
Mon Sep 21 06:55:04 CDT 2015

Le lundi, 21 septembre 2015 à 09:22, Sean Leonard a écrit :
> I think we can limit our inquiry to "characters" and "code points". Both
> of those are well-defined in Unicode (see  
> <>).  

I wouldn't say so. If you actually have a look at the definition for character on this page. There are at least 4 different definitions for the notion of character and if you take the one that has formal one attached, i.e. synonym for abstract character (D7), then an abstract character can actually be represented by a *sequence* of Unicode scalar values.

If you are operating in the context of a standard or technical documentation please do use either code points (D9, D10) or scalar values (D76). These notions have precise definitions which makes up for saner discussions and understandings.  

> I wish that "non-ASCII characters" and "non-ASCII code points" (and  
> non-ASCII scalar values) were sufficient for me. Maybe they can be.  
> However, in contexts where ASCII is getting extended or supplemented  
> (e.g., in the DNS or in e-mail), one needs to be really clear that the  
> octets 0x80 - 0xFF are Unicode (specifically UTF-8, I suppose), and not  
> something else.

So it seems that you want terminology to talk about the *encoding* of Unicode scalar values, rather than scalar values themselves. Then I think you should specifically avoid terminology like "octets of 0x80-0xFF are Unicode" since this doesn't really make sense, there no Unicode property on octets. You should rather say something like "these octets may belong to the UTF-8 encoding scheme (D95) of Unicode scalar values greater than U+001F".



More information about the Unicode mailing list