Re: Repertoire, encoding, and representation (Was: Charsets + encoding + codesets)

From: Keld J|rn Simonsen (keld@dkuug.dk)
Date: Tue Oct 07 1997 - 16:46:47 EDT


Kenneth Whistler writes:

> Keld responded to Yve:
>
> >
> > You can both have a 10646 encoding and an 10646 repertoire.
> > The canonical encoding of 10646 is UCS-4. That means if you are not
> > more specific than saying "10646 coded character set" then you mean
> > UCS-4.
>
> I can't let that one go by.
>
> The "encoding" (sense 4 of my last note) is the specification of all
> the numbers associated with the characters in the repertoire. It is
> neither UCS-4 nor UCS-2.

Depends on your terminology. This process is seldomly done explicitely
while you may may of cause do it virtually when you do the coded
character set. I was not thalking abuot the 10646 encoding, but the
10646 coded character set.

> UCS-2 and UCS-4 are defined in Clause 14 of 10646 as "coded representation
> forms of the UCS". UCS-2 is called the "Two-octet BMP form", and
> UCS-4 is called the "Four-octet canonical form." What is "canonical" about
> the UCS-4 form is that it enables the representation of any character
> encoded in 10646, whether or not it is encoded on the Basic Multilingual
> Plane (BMP), whereas UCS-2 only enables the representation of characters
> encoded on the BMP.
>
> However, canonical form does *not* mean default form, as implied by
> Keld's statement above. 10646 does not define any concept of default
> form of use. Instead, 10646 defines the alternatives, and then states
> that the mechanism for identifying it is outside the scope of the
> standard:
>
> "The identification of ISO/IEC 10646 (including the form), the
> implementation level, and any subset of the coding space that
> have been adopted by the originator must also be available to
> the recipient. The route by which such identification is
> communicated is outside the scope of ISO/IEC 10646." -- 17.1
>
> 10646 then goes on to say that *if* you are using ISO/IEC 2022 escape
> sequences, one of a specified list of escape sequences can be used
> to identify the form and implementation level, and other escape
> sequences can be used to identify designated subsets of the repertoire.
>
> The same applies to the specification of one of the two "transformation
> formats", UTF-8 or UTF-16 (both of which are "encoding schemes" in the
> sense identified earlier), which can also be identified by ISO/IEC 2022
> escape sequences. Either or both can, however, be designated by other
> means not involving 2022.

This text is for when you have to specify the data that you are
exchanging. I was not addressing that. So your text is out of
context.

I was using my wording in a context where I was using normal ISO
terminology, such as "coded character set", and in that sense
I said that UCS-4 must be understood if you are talking about
the 10646 coded character set. Only UCS-2 and UCS-4 serves as
coded character sets in ISO/IEC 10646, and UCS-2 only covers a subset of
that standard, so that leaves UCS-4 if you do not specify anything
else, in common talk like what we do here on this list.

C'mon Ken, I know that Unicode is in love with 16 bits, but
you must admit that 10646 is canonically defined as a 32 (31) bit
coded character set.

I did not say anything about what is the default encoding in 10646.

> The Unicode Standard can be considered a profile of 10646 that
> designates UTF-16 as the preferred encoding scheme. In that sense it
> clearly *does* designate a default encoding scheme, unlike 10646.

That must be 2.0. Was it not different in 1.0?
Why did you not chose UTF-8 - everybody else seems to go that way?
>
> >
> > The trouble is that the "repertoire" of Unicode and 10646 is different.
> > 10646 is clear on what is the repertoire: it is the characters of all
> > its code points. Unicode is clear on "abstract characters" that
> > you can make abstract characters by combining a number of characters
> > such as a base letter and then one or more combining accents.
> > But the combinations are not defined or limited, so for Unicode
> > you have an unlimited repertoire of Unicode abstract characters.
> >
>
> I'll state this one more time, because Keld keeps claiming it isn't
> so:
>
> The repertoire of the Unicode Standard and of ISO/IEC 10646 are
> *exactly* the same.

That is possible, but then the definitions of "repertoire"
are different for the two specifications. "I have 3 apples and
you have 3 oranges. We have the same." :-) And what about the
"surrogates"? These are genuine characters in Unicode
but not so in 10646.

> WG2 and the Unicode Technical Committee go to great lengths to ensure
> that this is and remains the case. Additions to the repertoire of 10646
> are matched by additions to the repertoire of the Unicode Standard,
> and the two standards groups work together to synchronize the various
> steps of balloting and publication, so that publication of the Unicode
> Standard can be directly correlated with a known sequence of approved
> and published amendments to 10646.
>
> So what is Keld talking about? Combining marks, of course.
>
> So once more, into the breach.
>
> The Unicode Standard talks about abstract characters. <a-acute> is an
> example of an abstract character in the Latin script. <d-dental-voiceless>
> is another example of an abstract character in the Latin script.
>
> <a-acute> is a part of the repertoire of 10646 (and Unicode). It is
> encoded at U+00E1 (= U-000000E1). The name of this encoded character is
> LATIN SMALL LETTER A WITH ACUTE.
>
> <d-dental-voiceless> is not part of the repertoire of 10646 (or of Unicode).
> It is not encoded. It has no name in 10646 (or Unicode).
>
> <a-acute> can also be represented (note, *not* encoded) by a combining
> character sequence. In particular, it can be represented by:
> U+0061 LATIN SMALL LETTER A + U+0301 COMBINING ACUTE ACCENT
>
> <d-dental-voiceless> can be represented (note, *not* encoded) by a
> combining character sequence. In particular, it can be represented by:
> U+0064 LATIN SMALL LETTER D + U+032A COMBINING BRIDGE BELOW +
> U+0325 COMBINING RING BELOW
>
> The Unicode Standard recognizes that for most purposes, the two different
> representations of <a-acute> should be treated as identical. Users neither
> know nor care what the underlying representation is, and will expect
> that any <a-acute> they see will be the same as any other <a-acute>.
> Because that is the case, the Unicode Standard defines a concept of
> canonically equivalent sequences. The two representations of <a-acute>
> are an example of a canonically equivalent sequence. The details for
> Unicode conformance include treating canonically equivalent sequences
> correctly. (Note that this is a stricter specification for conformance
> with Unicode than for conformance with 10646 itself. 10646 does not define
> canonical equivalence; nor does it specify many other aspects of the
> "semantics" of the characters it encodes.)
>
> Note that canonical equivalence does *not* mean duplicate encoding of
> characters. It means two different representations of the same abstract
> character--representations which under most circumstances should be
> *interpreted* the same.

Ken, you are doing tricks with words. Your "represented" term is
what others would call "encoding" of the abstract character.

> Note also that canonical equivalence also does not mean exact identity.
> If your software process is allocating buffer space, it better not
> treat U+00E1 the same as the sequence U+0061 + U+0301, or it will
> overrun memory.

But on the semantic level, abstract character level, I understand the
two "representations" to be equivalent by definiton in Unicode.
Am I correct?

> Keld is, of course, correct that the repertoire of abstract characters
> is open. I just gave an example of an abstract character that could have
> meaningful use in the transcription of a language, but it has never (to
> my knowledge) been brought up before or discussed as a candidate to
> be *encoded* as a character in 10646. That is not because it has two
> accents; there are already such characters encoded in 10646, e.g.
> U+01DF LATIN SMALL LETTER A WITH DIAERESIS AND MACRON. But the nature
> of the Latin script is that it allows relatively free application of
> accent marks to letter baseforms, either as diacritics to create new
> "letters" for a particular orthography, or as accents to modify in various
> ways the sounds represented by letters.

So Unicode has an open repertoire of abstracts characers, while
10646 has a finite repertoire of (abstract) characters?

Keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT