Surrogates and noncharacters

Philippe Verdy verdy_p at
Sun May 10 14:19:52 CDT 2015

The wy I read D77 (code unit) it is not bound to any Unicode encoding form;
"The minimal bit combination that can represent a unit of encoded text for
processing or interchange" can beany bit length and can even use non binary
repreentation (not bit-based; it could be ternary; or floatting point, or
base ten with the remaining bit patterns posibly used for other functions
(such as clock synchronization!calibration, polarization balancing; lieving
only some patterns distinctable but not necessarily an exact power of
I don't see why a 32-bit code unit or 8-bit code unit has to be bound to
UTF-32 or UTF-8 in D77; the code unit is just a code unit; it does not have
to be assigned any Unicode scalar value or exist in a specific pattern
valid for UTF-32 or UTF-8 (in addition these two UTF's are not the only two
ones supported; look as SCSU for example; or GB18030 which are also
conforming UTF's):
The code unit is just one element within an enumerable and finite set of
elements that is transmissible to some interface and interchangeable.

It's up to each UTF to define how they can use them: these UTF's are usable
on these stes provided that these sets are large nuitto contain at least a
the number of code units required for this UTF to be supported (which means
that the actual bitcount of the transported code units does not matter;
this is out of scope of TUS which jsut requires sets with sufficient

For these reasons I absolutely do nt see why you argue that 0xFFFFFFFF
cannot be a valid 32-bit code unit and then why <0xFFFFFFFF> cant be a
valid 32-bit string (or "Unicode 32-bit string> liek TUS renames it in
D80-D83 in a way that is really unproductive (and in fact confusive).

As well nothing prohibits supportng the UTF-32 encoding form over a 21-bit
stream, using another "encding scheme" (which cannt be named also UTF-32 or
UT-32BE or UTF-32LE" but could be named 'UTF-32-21": the result witll be a
21-bit strng; but still the 21(bit code unit 0x1FFFFF will still be valid.

2015-05-10 12:23 GMT+02:00 Richard Wordingham <
richard.wordingham at>:

> On Sun, 10 May 2015 07:42:14 +0200
> Philippe Verdy <verdy_p at> wrote:
> I as replying out of order for greater coherence of my reply.
> > However I wonder what would be the effect of D80 in UTF-32: is
> > <0xFFFFFFFF> a valid "32-bit string" ? After all it is also
> > containing a single 32-bit code unit (for at least one Unicode
> > encoding form), even if it has no "scalar value" and then does not
> > have to validate D89 (for UTF-32)...
> The value 0xFFFFFFFF cannot appear in a UTF-32 string.  Therefore it
> cannot represent a unit of encoded text in a UTF-32 string.  By D77
> paragraph 1, "Code unit:  The minimal bit combination that can
> represent a unit of encoded text for processing or interchange", it is
> therefore not a code unit.  The effect of D77, D80 and D83 is that
> <0xFFFFFFFF> is a 32-bit string but not a Unicode 32-bit string.
> > - D80 defines "Unicode string" but in fact it just defines a generic
> > "string" as an arbitrary stream of fixed-size code units.
> No - see argument above.
> > These two rules [D80 and D82 - RW] are not productive at all, except
> > for saying that all values of fixed size code units are acceptable
> > (including for example 0xFF in 8-bit strings, which is invalid in
> > UTF-8)
> Do you still maintain this reading of D77?  D77 is not as clear as it
> should be.
> > <snip> D80 and D82 have no purpose, except adding the term "Unicode"
> > redundantly to these expressions.
> I have the cynical suspicion that these definitions were added to
> preserve the interface definitions of routines processing UCS-2
> strings when the transition to UTF-16 occurred.  They can also have the
> (intentional?) side-effect of making more work for UTF-8 and UTF-32
> processing, because arbitrary 8-bit strings and 32-bit strings are not
> Unicode strings.
> Richard.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the Unicode mailing list