Character repertoire and Graphic character repertoire
(Was: Re: Repertoire, encoding, and representation;
formerly Was: Charsets + encoding + codesets)
Keld had written via unicode@unicode.org:
> > You can both have a 10646 encoding and an 10646 repertoire...
> >
> > The trouble is that the "repertoire" of Unicode and 10646 is different.
Ken Whistler wrote, in response to Keld Simonsen, in message
<9710070110.AA03013@unicode.org> via unicode@unicode.org:
> I'll state this one more time, because Keld keeps claiming it isn't
> so:
>
> The repertoire of the Unicode Standard and of ISO/IEC 10646 are
> *exactly* the same.
The argument between Ken Whistler and Keld Simonsen surely derives
from different (and equally valid) understandings of the word
repertoire, and possibly of character and graphic character.
ISO/IEC 10646 defines repertoire thus:
Repertoire: a specified set of characters that are represented in a
coded character set (clause 4.28)
ISO/IEC 10646 defines character thus:
character: a member of a set of elements used for the organisation,
control or representation of data (clause 4.6). To me, this seems to
corespond to the term code-point, rather than including abstract
characters.
However, the definition in ISO/IEC 10646 of graphic character does
seem to include the notion of abstract characters:
graphic character: A character, other than a control function, that
has a visual representation normally handwritten, printed or
displayed.
* * * * * * * *
Perhaps to avoid such problems in the future, we ought to use the
following terms:
- character repertoire (to specify what Keld means) and
- graphic character repertoire (to specify what Ken means).
To expand:
Ken Whistler wrote, in response to Keld Simonsen, in message
<9710070110.AA03013@unicode.org> via unicode@unicode.org:
Keld had written:
> > You can both have a 10646 encoding and an 10646 repertoire...
> >
> > The trouble is that the "repertoire" of Unicode and 10646 is different.
> > 10646 is clear on what is the repertoire: it is the characters of all
> > its code points. Unicode is clear on "abstract characters" that
> > you can make abstract characters by combining a number of characters
> > such as a base letter and then one or more combining accents.
> > But the combinations are not defined or limited, so for Unicode
> > you have an unlimited repertoire of Unicode abstract characters.
> >
Ken Whistler replied:
> I'll state this one more time, because Keld keeps claiming it isn't
> so:
>
> The repertoire of the Unicode Standard and of ISO/IEC 10646 are
> *exactly* the same.
Ken - surely not: rather
The _code_points_ of the Unicode Standard and of ISO/IEC 10646 are
*exactly* the same.
Ken resumes:
> WG2 and the Unicode Technical Committee go to great lengths to ensure
> that this is and remains the case. Additions to the repertoire of 10646
> are matched by additions to the repertoire of the Unicode Standard,
Again, surely it is the case that Additions to the _code_points_ of 10646
are matched by additions to the _code_points_ of the Unicode Standard,
> and the two standards groups work together to synchronize the various
> steps of balloting and publication, so that publication of the Unicode
> Standard can be directly correlated with a known sequence of approved
> and published amendments to 10646.
And this works extremely well, so much so that for most users of
ISO/IEC 10646 and Unicode, this argument will not be of great
significance.
* * * * * * * *
My main point, broadly supporting Keld Simonsens's point, is this:
ISO/IEC 10646's repertoire includes the sum of its code points - both
standalone and combining characters. ISO/IEC 10646 does not specify
anything about the _repertoire_ of combined characters that may
result from using combined characters alongside other characters.
Abstract characters (what might be called combined characters, or the
results of using combining characters) are beyond its scope, and
beyond the definition of character in ISO/IEC 10646.
Unicode's repertoire includes all the above _and_ the abstract
characters (or graphic characters) available through using Unicode.
The basic fact is that if the definition of repertoire comprised a
set of _graphic_characters_, Ken Whistler would be absolutely
correct.
In terms of ISO/IEC 10646, however, Keld is completely right, as
ISO/IEC 10646 defines repertoire as a specified set of _characters_
that are represented in a coded character set (clause 4.28).
I do not have a copy of Unicode 2.0 to hand: are there any
differences of emphases in its definitions?
Perhaps to avoid such problems in the future, we ought to use the
following terms:
- character repertoire (to specify what Keld means) and
- graphic character repertoire (to specify what Ken means).
Any other comments are welcome.
Best wishes
John Clews
-- John Clews (Chair of ISO/TC46/SC2: Conversion of Written Languages)SESAME Computer Projects, 8 Avenue Road, Harrogate, HG2 7PG, England Email: 10646er@sesame.demon.co.uk; tel: +44 (0) 1423 888 432
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT