From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Sep 12 2007 - 15:55:48 CDT
Philippe Verdy wrote:
> Kenneth Whistler wrote:
> > Note, however, as regards names in particular, that some
> > Unicode characters (e.g., noncharacters, private-use characters) don't
> > have character names, ...)
>
> I won't discuss the case of CJK and Hangul ranges, because they do have
> complete properties including standard names.
> But I still don't understand why the assigned controls and PUAs don't have
> at least one default character name, at least computed algorithmically (like
> Hangul and CJK ideographs).
>
> For the stability of applications using these characters, it seems that
> these controls and PUAs should still have a standard name (may be this name
> is "U+xxx"...)
That is the "short identifier" (ISO/IEC 10646, Clause 6.5), not
the "standard name".
And short identifiers don't follow the name syntax restrictions,
because they allow one character, "+", that is not allowed in character
names.
> to avoiud any possible future conflicts with other characters
> that will get their own standard names,
How can there be a future conflict between a character that
has no name (noncharacter, private-use character) and
a character that gets a name in the future?
> if the application needs to define a
> name property for these characters instead of retuning a non unique empty
> name or raising an exception (as if the characters were unassigned).
Bad programming assumptions lead to bad program behavior. The
fix for this is the test:
if (name==NULL)
{
// do something interesting, instead of terminating with access fault
}
> The most obvious missing names that we frequently encounter in texts encoded
> with valid UTF are with controls.
And that is a problem because... ?
> Why Unicode still does not endorse the existing ISO 646 and ISO 8859 names
> for these C0 and C1 controls?
Have you read ISO 646 or ISO 8859-1 (or any other part) recently?
They do not contain any character names for C0 or C1 controls. They
define characters (with names) for the G0 and G1 sets, 0x20..0x7E
and 0xA0..0xFF (in the case of ISO 646 just G0). ISO 8859-1 depends
on ISO 2022 and ISO 4873 (normatively) for its use of control
cods, and the control functions are defined elsewhere by other
standards.
In short, there is no such thing as "ISO 8859 names for ... C0 and
C1 controls."
> Why would it be a problem to assign such name
Well, one problem might be that they don't exist.
But I'll cut you some slack. Presumably you have in mind
ISO 6429:1992 names. But even ISO 6429 doesn't have names
for *all* C1 controls. And ISO 6429 simply specifies one
widely-used definition of C0 and C1 controls -- it isn't
their exclusive definition.
> (a name is just a name, not a description of its semantic or intended use in
> applications).
In which case, why go down the road of specifying a name,
when not all applications in fact use the same control
function definitions for C0 and C1 controls? Where does
that lead except into trouble and confusion?
> So:
> * instead of having just "<control>" for U+001B, why not having "<control>
> ESC" for the ASCII escape character
"<control>" is a metalabel used in the generation of
code charts, just like "<reserved>" and "<not a character>"
are. None of those are character names; they violate
both the uniqueness requirement for character names and
the syntax for character names -- intentionally.
> * instead of having just "<private use>" for U+E000, why not having
> "<private use> E000" computed algorithmically for the standard name?
1. Because it isn't necessary.
2. Because it violates character name syntax.
> As an alternative, you could say that some applications could generate the
> comment field or use it algorithmically, so that the strict compatibility
> will be preserved for the existing name field. This would give the extended
> names (respectively for the examples above):
> * "<control> #ESC"
> * "<private use> #E000"
And "#" isn't allowed in character names, either.
> I don't see which other standard it will break.
Well, the Unicode Standard and ISO/IEC 10646 for starters.
See the character name specifications for both.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Sep 12 2007 - 16:00:42 CDT