Re: Charsets + encoding + codesets

From: Keld J|rn Simonsen (keld@dkuug.dk)
Date: Tue Oct 07 1997 - 15:27:52 EDT


Kenneth Whistler writes:

> Yves asked some follow-up questions:
>
> > with this in mind I can't help but have still questions:
> >
> > -- If UNICODE is an "encoded character set" what is the name of the
> > "character set" it implements? (UNICODE as well?). In other words, how
> > should I call the character repertoire that UNICODE and 10646 encode?
>
> The character repertoire for Unicode and ISO/IEC 10646 is the
> "Universal Character Set", or UCS for short. Unlike most other
> encoded character sets, which in principle, at least, start with a limited
> set of characters, the stated goal of Unicode/10646 is to serve as a
> universal encoding of all characters required for information technology.

I would rather say that the character set of 10646 is the repertoire
of 10646 which is the characters in the codepoints of 10646. This
is a finite repertoire, although it may differ which each
amendment. But you can always count the characters in there.

For Unicode it is a different story. Unicode can represent an
undefined number of "abstract characters" which is the Unicode
equivalent term to the ISO term "character". (I even use that
term to clarify the difference to a "coded character").
Unicode's repertoire is thus infinite.

> The content of the UCS (the enumeration of the members of the set, to
> use Keld's more mathematical approach) is a moving target. Each
> amendment to 10646 which encodes more characters also expands the
> repertoire officially covered by the standard. Each publication of
> a new version of the Unicode Standard likewise expands the repertoire,
> since many new characters are added to the overall encoded set.

True.
> >
> "Encoding" actually has several meanings.
>
> 3. An entire encoded character set. This is a synonym for "encoded character
> set", or "code page".
>
> "The encoding we used for that data was cp437."

I think in that case "encoding" actually means a more complex structure
than "coded character set", viz the encoding of UTF-8 or EUC.

> 4. The mathematical relation (a unique and symmetric mapping function) between a
> character repertoire and coded representations. This is synonymous with the
> term "coded character set" as defined in 10646: "A set of unambiguous
> rules that establishes a character set and the relationship between
> the characters of the set and their coded representations." [The important
> thing here is that each character is associated with a number, and each
> numerical value is unambiguously related to a character.]

I think there are some subtle differences here. I believe that
the coded character set do imply a binary representation.
All coded character sets that I know of have a binary representation.

Also the numbering is not done normally, and even if you say
there is an implied numbering, a number of coded character sets
have smaller or bigger holes in this numbering.

> "The Unicode Standard uses a 16-bit encoding for characters."
>
> 5. The mathematical relation (non-unique and asymmetric) between bit values
> used in character data representation for information interchange and
> the characters that data represents. This is synonymous with the term
> "character encoding scheme" as used by the Internet Architecture Board.
> It also seems to be what Keld is defining above. [The important
> distinction is that for some encoding schemes, such as ISO 2022, the
> relation between any particular sequence of bits and characters may
> be non-unique in both directions. The "encoding" in this sense defines
> how to get from the bits to the characters, but not necessarily the
> reverse.]
>
> "Each different encoding requires registration of a different
> MIME charset."

I do distinquish between "encoding" and "encoding scheme", see my
paper.
>
> Maybe the easiest way to clarify this is to quote from some mail
> I sent out privately a few months ago regarding a better way to specify
> a character set registry.
>
> <start quote>
>
> A form I would like to see a consistent registry expressed in would
> include the following information:
>
> Standard(s)/PAS Repertoire CCS CES Short Tag Etc.
>
> Where the first 5 fields are *all* obligatory for a encoded character
> set entry to be complete. e.g.:
>
> ISO 8859-1:1987 Latin-1 8859-1 8bit-I iso8859-1
> ISO 10646-1:1993 UCS 10646 UCS-4 ucs4
> ISO 10646-1:1993, USV2 UCS 10646 UTF-8 utf8
> ISO 10646-1:1993, USV2 UCS 10646 UTF-16 utf16
> ISO 10646-1:1993 BMP 10646 UCS-2 ucs2
> CDRA (CCSID 00437) CS01212 CP437 8bit cp437
> CDRA (CCSID 00850) CS01106 CP850 8bit cp850
> CDRA (CCSID 00037) CS00697 CP037 8bit-E cp037
> CDRA (CCSID 00938) CS00103+ CP904+ DBCS-M cp938
> CS00935 CP927
> CDRA (CCSID 00937) CS01175+ CP037+ SISO cp937
> CS00935 CP835
> Mac OS Cyrillic Mac Cyrillic MacCyr 8bit mac-cyr
> Microsoft tables JIS X-0201+ CP932 DBCS-M cp932ms
> JIS X-0208+
> IBM extensions+
> MS extensions
>
> and so on. The etc. columns would contain all the other useful
> information about the coded character set (its usage, and all the
> various crossmappings to vendor and ISO and InterNet id's, etc.)

I think there are a number of problems in this, as noted
above. Anyway how would this work for Unicode? Unicode has an
infinite repertoire, so you cannot number the characters.

keld



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT