Concise term for non-ASCII Unicode characters
richard.wordingham at ntlworld.com
Mon Sep 21 13:18:29 CDT 2015
On Mon, 21 Sep 2015 12:46:48 +0100
"Tony Jollans" <Tony at jollans.com> wrote:
> These days, it is pretty sloppy coding that cares how many bytes an
> encoding of something requires, although there may be many
> circumstances where legacy support is required.
Wow! Are you saying that code chopping up arbitrary character sequences
for legibility (and editability!) and to avoid buffering issues should
generally assume it will be read as UTF-8, and avoid splitting
well-formed UTF-8 characters? (If the text is actually Windows-1252,
there may be a lot of apparently ill-formed UTF-8 characters/gibberish.)
> You say that, in some
> contexts, one needs to be really clear that the octets 0x80 - 0xFF
> are Unicode. Either something "is" Unicode, or it isn't. Either
> something uses a recognised encoding, or it doesn't. Using these
> octets to represent Unicode code points is not ASCII, is not UTF-8,
> and is not UCS-2/UTF-16; it could, perhaps, be EBCDIC.
But most of these octets *are* used to represent non-ASCII scalar
values. It's just that they have to operate in combinations for UTF-8.
More information about the Unicode