Re: Origin of the U+nnnn notation

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Nov 08 2005 - 08:04:58 CST

Next message: Hohberger, Clive: "RE: Origin of the U+nnnn notation"

Previous message: Dominikus Scherkl: "RE: Origin of the U+nnnn notation"
In reply to: Dominikus Scherkl: "RE: Origin of the U+nnnn notation"
Next in thread: Antoine Leca: "Re: Origin of the U+nnnn notation"
Reply: Antoine Leca: "Re: Origin of the U+nnnn notation"
Reply: Hans Aberg: "Re: Origin of the U+nnnn notation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

From: "Dominikus Scherkl" <lyratelle@gmx.de>
>> I have been unable to hunt down the historical origin of the
>> notation U+nnnn (where nnnn are hexadecimal digits) that we
>> use to refer to characters (and code points).
>> Presumably "U" stands for "UCS" or for "Unicode", but where
>> does the plus sign come from?
> Maybe it was thought of as an offset from the unit (character null)
> like in ETA+5 minutes (expected time of arrival was passed five minutes
> ago - an euphemism for beeing 5 minutes late).

U-nnnn already exists (or I should say, it has existed). It was refering to
16-bit code units, not really to characters and was a fixed-width notation
(with 4 hexadecimal digits). The "U" meant "Unicode" (1.0 and before).

U+[n...n]nnnn was created to avoid the confusion with the past 16-bit only
Unicode 1.0 standard (which was not fully compatible with ISO/IEC 10646 code
points). It is a variable-width notation that refers to ISO/IEC 10646 code
points. The "U" means "UCS" or "Universal Character Set". At that time, the
UCS code point range was up to 31 bits wide.

The U-nnnn notation is abandoned now, except for references to Unicode 1.0.
If one uses it, it will refer to one or more 16-bit code units needed to
encode each codepoint (possibly with surrogate pairs). It does not
designates abstract characters or codepoints unambiguously.

Later, the variable-width U+[n...n]nnnn notation was restricted to allow
only codepoints in the 17 first planes of the joined ISO/IEC 10646-1 and
Unicode standards (so the only standard codepoints are between U+0000 and
U+10FFFF, some of them being permanently assigned to non-characters).

The references to larger code points with U+[n...n]nnnn is discouraged, as
they no longer designate valid code points in both standards. Their
definition and use is then application-specific.

There are '''no''' negative codepoints in either standards (U-0001 does not
designate the 32-bit code unit that you could store in a signed wide-char
datatype, but in past standard it designated the same codepoint as U+0001
now). Using "+" makes the statement about signs clear: standard code points
all have positive values.

So if you want a representation for negative code units, you need another
notation (for example N-0001 to represent the negative code unit with
negative value -1): this notation is application-specific.

Next message: Hohberger, Clive: "RE: Origin of the U+nnnn notation"
Previous message: Dominikus Scherkl: "RE: Origin of the U+nnnn notation"
In reply to: Dominikus Scherkl: "RE: Origin of the U+nnnn notation"
Next in thread: Antoine Leca: "Re: Origin of the U+nnnn notation"
Reply: Antoine Leca: "Re: Origin of the U+nnnn notation"
Reply: Hans Aberg: "Re: Origin of the U+nnnn notation"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Nov 08 2005 - 08:09:13 CST