Re: What does it mean to "not be a valid string in Unicode"? from Stephan Stiller on 2013-01-06 (Unicode Mail List Archive)

From: Stephan Stiller <stephan.stiller_at_gmail.com>
Date: Sun, 6 Jan 2013 12:59:53 -0800

On Sun, Jan 6, 2013 at 12:34 PM, Mark Davis ☕ <mark_at_macchiato.com> wrote:

> [...]
>

What you write and that the UTFs have historical artifact in their design
makes sense to me.

(There are many, many discussions of this in the Unicode email archives if
> you have more questions.)
>

Okay. I am fine with ending this thread. *But ...*

I do want to rephrase what baffled me just now. After sleeping over this,
it's clearer what the issue was: Most Unicode discourse is about code
points and talks about them, with the implication (everywhere, pretty much)
that we're encoding *code points* in encoding forms. Maybe I've just read
this into the discourse, but if Unicode discussions used the expression
"scalar value" more, there would be no potential for such misunderstanding.
(1) Any expression containing "surrogate" *should* be relevant only for
UTF-16.
(2) The notion of "code point" covers scalar values *plus* U+<surrogate
value>.
(3) The expression "code point" is used in an encoding form–independent
context, for the most part.
(4) So, it's very confusing to ever write surrogate values (say, D813_hex)
in "U+"-notation. Surrogate values are UTF-16-internal byte values. Nobody
should be thinking about them outside of UTF-16. Now the terminology is a
jumble.

Stephan
Received on Sun Jan 06 2013 - 15:01:27 CST

This archive was generated by hypermail 2.2.0 : Sun Jan 06 2013 - 15:01:27 CST