Re: Encoding vs Charset

From: Jungshik Shin (jshin@mailaps.org)
Date: Wed Mar 27 2002 - 11:59:10 EST


On Wed, 27 Mar 2002, Dan Kogai wrote:

> On Wednesday, March 27, 2002, at 11:22 , Jungshik Shin wrote:
> > IMHO, you're also misusing the term 'charset' here. MIME charset
> > can be used synonymously with 'encodings' (or
> > character set encoding scheme: see CJKV Information Processing,
> > IETF RFC 2130 and RFC 2278). What has to be distinguished
> > is 'coded character set' on the one hand (JIS X 0208, JIS X 0212,
> > KS X 1001, KS X 1003, GB 2312, CNS 11xxx, ISO 10646, ISO 646, US-ASCII,
> > ISO-8859-x) and 'encoding/character
> > set encoding scheme/MIME charset on the other hand (EUC-JP,
> > EUC-KR, EUC-TW, EUC-CN, ISO-2022-JP, ISO-2022-KR, ISO-2022-CN,
> > ISO-8859-x, UTF-8, UTF-32, UTF-7, UTF-16, Big5, UHC)
>
> I do not thinks so. This time I can confidently say it is IANA that
> has goofed. To make my point clear, let me define Charset and Encoding
> once again.
>
> Character Set:
>
> a collection of characters in which each character is distinguished
> with unique ID (in most cases, ID is number).
>
> Character Encoding:
>
> A way to represent characters in byte stream. Given character
> encoding may contain a single character set (i.e. US-ascii) or multiple
> character sets (i.e. EUC-JP that contain US-ascii, JIS X 0201 Kana, JIS
> X 0208 and JIS X 0212). Given character encoding may also encode
> character set as-is (raw; US-ascii) or processed (for EUC-JP, US-ascii
> is as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by
> 0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

  You got me wrong. I don't have any objection to 'coded character set'
and 'encoding' defined this way. Problem is that you're using '(coded)
character set' and 'charset' interchangeably. They're two different
things depending on where you come from. My point is that because
'charset' is already overloaded with two or more different meanings(as
MIME Content-Type header parameter, it means 'encoding' as you defined
above), you'd better not use it when comparing coded character set on the
one hand and encoding/ character set encoding scheme on the other hand.
Simply, it'd be much better for you to say '(coded) character set vs
encoding' instead of 'charset vs encodig'

  Jungshik Shin

P.S. I'm wondering Why you posted this to Unicode list (where it's not
very much relevant) without posting to perl-unicode? I was force to
post my response to Unicode list, but I'd rather keep this thread (if
there's need to continue) where it began (perl-unicode).



This archive was generated by hypermail 2.1.2 : Wed Mar 27 2002 - 13:05:42 EST