Charsets + encoding + codesets

From: Yves Savourel (YvesS@ile.com)
Date: Mon Oct 06 1997 - 06:44:35 EDT


Thanks for the various answers. Keld's paper was also very useful.
Now two things seem to be clear for me:

-- A "character set" has no code-points associated to each character.
-- I should use the term "encoded character set" to name the
implementation of a character set according a specific "encoding
scheme".

with this in mind I can't help but have still questions:

-- If UNICODE is an "encoded character set" what is the name of the
"character set" it implements? (UNICODE as well?). In other words, how
should I call the character repertoire that UNICODE and 10646 encode?

-- In Ken's definitions the border between "encoding" and encoded
character sets are not completely clear to me. I though cp47 would be an
encoded character set. It also doesn't seems to correspond to Keld's
definition of "encoding" in his paper that says: "encoding: the relation
from the binary representation via coded character sets to (abstract)
characters. The encoding defines the meaning of a binary data stream. It
can consist of more than one coded character set, and an encoding scheme
can be applied to regulate how these coded character sets are encoded.
Also symbolic characters can be encoded in the encoding." If the
definition is correct and the cp437 is an encoding then what are the
encoded character set and the encoding scheme?

Maybe a little table will illustrate better my puzzlement. It seems that
we have to start a character set, we apply to it an encoding scheme and
get a encoded character set. (Maybe I'm too simplistic?)
Therefore we have something like this the following table. But, to me,
there are missing pieces:

[character set] [encoding scheme] [encoded character set]
                [encoding?]
JIS-xxx EUC EUC-JA
? ?
UNICODE
IBM 919 cp437(?) ?(cp437?)
? UTF-8 ?
? UCS-2(?) ?(UCS-2?)

But beyond that (and more importantly), now here is my real question:

I'm working with several other people from various localization/tools
vendors companies to set up a standard format for translation memory
exchange (TMX). We use an XML-compliant format for this. One of the
problem we run into is naming one of the attribute of some of the
elements.

That attribute specifies what "encoded character set" the original text
was in (the text in TMX being always in Unicode, using ISO646 and
character references for code-points above 128). Two terms proposed
would be CODESET and CHARSET.

Note that CHARSET is used in HTML, and according your various answer it
should not, note also that the IANA page where the name of the
"charsets/codesets" are listed (see
ftp://ftp.isi.edu/in-notes/iana/assignments/character-sets) names
happily everything "character set" (including Unicode, UTF-8, UCS-2,
Shift-JIS, etc.)

The values for that attributes will be Unicode (UCS-2), UTF-8, cp850,
cp1252, Shift-JIS, EUC-JA, MacRoman, HPRoman8, etc. basically any (and
more) of the "codesets/charsets" listed in the IANA page.

What attribute name should we use?
CHARSET looks incorrect according your various answers (and I agree).
CODESET seems to be not very in favor.
ENCODING then? but some are "encoding schemes" (Keld makes a clear
distinction between encoding and encoding scheme).

Any suggestions would be immensely appreciated.

Thanks.
--Yves



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT