Re: Encoding vs Charset

From: Keld Jørn Simonsen (keld@dkuug.dk)
Date: Wed Mar 27 2002 - 17:13:40 EST


On Wed, Mar 27, 2002 at 11:59:10AM -0500, Jungshik Shin wrote:
> On Wed, 27 Mar 2002, Dan Kogai wrote:
>
> > On Wednesday, March 27, 2002, at 11:22 , Jungshik Shin wrote:
> > > IMHO, you're also misusing the term 'charset' here. MIME charset
> > > can be used synonymously with 'encodings' (or
> > > character set encoding scheme: see CJKV Information Processing,
> > > IETF RFC 2130 and RFC 2278). What has to be distinguished
> > > is 'coded character set' on the one hand (JIS X 0208, JIS X 0212,
> > > KS X 1001, KS X 1003, GB 2312, CNS 11xxx, ISO 10646, ISO 646, US-ASCII,
> > > ISO-8859-x) and 'encoding/character
> > > set encoding scheme/MIME charset on the other hand (EUC-JP,
> > > EUC-KR, EUC-TW, EUC-CN, ISO-2022-JP, ISO-2022-KR, ISO-2022-CN,
> > > ISO-8859-x, UTF-8, UTF-32, UTF-7, UTF-16, Big5, UHC)
> >
> > I do not thinks so. This time I can confidently say it is IANA that
> > has goofed. To make my point clear, let me define Charset and Encoding
> > once again.
> >
> > Character Set:
> >
> > a collection of characters in which each character is distinguished
> > with unique ID (in most cases, ID is number).
> >
> > Character Encoding:
> >
> > A way to represent characters in byte stream. Given character
> > encoding may contain a single character set (i.e. US-ascii) or multiple
> > character sets (i.e. EUC-JP that contain US-ascii, JIS X 0201 Kana, JIS
> > X 0208 and JIS X 0212). Given character encoding may also encode
> > character set as-is (raw; US-ascii) or processed (for EUC-JP, US-ascii
> > is as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by
> > 0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).
>
> You got me wrong. I don't have any objection to 'coded character set'
> and 'encoding' defined this way. Problem is that you're using '(coded)
> character set' and 'charset' interchangeably. They're two different
> things depending on where you come from. My point is that because
> 'charset' is already overloaded with two or more different meanings(as
> MIME Content-Type header parameter, it means 'encoding' as you defined
> above), you'd better not use it when comparing coded character set on the
> one hand and encoding/ character set encoding scheme on the other hand.
> Simply, it'd be much better for you to say '(coded) character set vs
> encoding' instead of 'charset vs encodig'

I think you are getting closer to agreement. The IETF 'charset' term is
indeed defined very closely to what is named "encoding" above.

Let me point out that "coded character set" and "character set"
are two quite different things. In the first you have also the
codes associated with the character, while in the latter there is no
codes associated. A "character set" consist of
"abstract characters" in Unicode parlance.

Kind regards
keld
codef character se



This archive was generated by hypermail 2.1.2 : Wed Mar 27 2002 - 17:51:24 EST