[AF:]
> It is the wording in your posts that adds to the confusion.
My fundamental point is, has been, and continues to be that whenever 
people use the more general word "code point" instead of the more 
appropriate "scalar value", that will "add to the confusion". If you 
make the presupposition <http://en.wikipedia.org/wiki/Presupposition> 
that your sequence of "code points" or "scalar values" contains no 
surrogate values, then, yes, this will be
> [DE:] truly a distinction without a difference
but if you're using these word without an explicitly stated 
presupposition, then one will assume that when you mean "code point" you 
do (surprise, surprise) actually mean "code point", which /according to 
the official definitions/ will include "surrogate code points". I 
mentioned this a while ago in a question about ICU, and KenW replied 
that the real world contains bad data. I also think that this
> [DE:] it is very unlikely that Twitter and others are storing and interchanging loose surrogates
is incorrect. Not sure whether the Twitter hack I linked to made use of 
/loose/ surrogates, but it was based on encoding and storing surrogates.
[AF:]
> [some paragraphs terminating in:]
> Some people writing end user materials may have shown terminological 
> muddle
Sorry to say, but that's apparently the way Twitter misconstrued it. The 
alternative to a characterization of the way they've interpreted the 
word "code point" (which is rather un-crazy, but then you're minimizing 
in your email the extent to which such interpretations or 
"mis"construals exist online) is to say that Twitter has been, for a 
long time, /blatantly/ wrong in their official attempt at clarifying the 
details of the distinguishing feature of their product, after having the 
product out for an even longer time.
 From time to time I will encounter products that appear to handle 
Unicode but whose string handling gets deeply confused once you 
enter/paste anything beyond the BMP; you can blame this on confusing 
"code point" with "code unit" instead, but if the first word didn't 
exist (because it shouldn't), there would be no confusion.
This qualification
> [AF:] by those who have the requisite technical background
of this statement
> [AF:] to insinuate that the definitions are widely confused
of course makes it true. As long as "high-surrogate code point" and 
"low-surrogate code point" aren't officially deprecated, confusion will 
persist. They should be deprecated, because, /as you say/:
> [AF:] Once you add the UTF-prefix, you are, by force, speaking of code 
> units.
So the high-low distinction for "surrogate" code points is misleading, 
and the "surrogate" attribute for "code point" shouldn't be there, 
because, as I've in fact written in a much earlier thread and as people 
know, surrogates are UTF-16-specific.
Stephan
Received on Tue Sep 17 2013 - 16:58:40 CDT
This archive was generated by hypermail 2.2.0 : Tue Sep 17 2013 - 16:58:42 CDT