David Hopwood <email@example.com> wrote, and rewrote:
> IMHO it's the definition of "Unicode code point" that is problematic,
> not "Unicode scalar value". They should be synonyms and should have
> domain 0..0xD7FF union 0xE000..0x10FFFF. (This is consistent with the
> definition of "code point" for other CCSs, where there is no
> for the domain to be a contiguous range of integers. The fact that
> there are properties in the UCD for 0xD800..0xDFFF is just a
> It is UTF-16 (CEF) code units that can include 0xD800..0xDFFF.
I agree, mostly. The definition of "Unicode scalar value" as
"nonsurrogate code point" is completely appropriate.
Another way to look at it is that Unicode scalar values are those values
that could, architecturally, have characters assigned to them. The word
"architecturally" is the key here. The code points U+xxFFFE and
U+xxFFFF, plus U+FDD0 through U+FDEF, are noncharacters by
administrative rule only. There is nothing in the architecture of
Unicode that precludes their use as character points. By contrast,
U+D800 through U+DFFF could not be characters because of the
architecture of UTF-16, which reserves them for use as surrogate code
I'm not so sure that "code point" needs to be redefined to match "USV."
I suggest that the subtle distinction between the two concepts be
retained. Just edit the sentence in D28 that equates "USV" with "code
point," to explain the difference between the two.
As usual, I believe the culprit in all of this is that troublesome
paragraph after D29, which states that all UTFs must be able to
round-trip unpaired surrogates. This wasn't true when it was written,
as UTF-16 can't do this (it converts the two unpaired surrogate code
points U+D800 U+DC00 to a single code point U+10000, which is not
round-tripping), and now with the tightened definition of UTF-8
introduced in Unicode 3.2, it can't be done in UTF-8 either. I've been
tempted to ignore that paragraph, on the technicality that it appears
outside the numbered definitions and hence isn't normative (the
indenting style bolsters this argument), but I'd rather see the UTC just
remove the unpaired-surrogate reference based on a real-life examination
of the way UTFs actually work.
This archive was generated by hypermail 2.1.2 : Wed May 08 2002 - 12:44:16 EDT