Re: Encoding vs Charset

From: Dan Kogai (dankogai@dan.co.jp)
Date: Wed Mar 27 2002 - 05:17:50 EST


On Wednesday, March 27, 2002, at 11:22 , Jungshik Shin wrote:
> IMHO, you're also misusing the term 'charset' here. MIME charset
> can be used synonymously with 'encodings' (or
> character set encoding scheme: see CJKV Information Processing,
> IETF RFC 2130 and RFC 2278). What has to be distinguished
> is 'coded character set' on the one hand (JIS X 0208, JIS X 0212,
> KS X 1001, KS X 1003, GB 2312, CNS 11xxx, ISO 10646, ISO 646, US-ASCII,
> ISO-8859-x) and 'encoding/character
> set encoding scheme/MIME charset on the other hand (EUC-JP,
> EUC-KR, EUC-TW, EUC-CN, ISO-2022-JP, ISO-2022-KR, ISO-2022-CN,
> ISO-8859-x, UTF-8, UTF-32, UTF-7, UTF-16, Big5, UHC)

   I do not thinks so. This time I can confidently say it is IANA that
has goofed. To make my point clear, let me define Charset and Encoding
once again.

Character Set:

   a collection of characters in which each character is distinguished
with unique ID (in most cases, ID is number).

Character Encoding:

   A way to represent characters in byte stream. Given character
encoding may contain a single character set (i.e. US-ascii) or multiple
character sets (i.e. EUC-JP that contain US-ascii, JIS X 0201 Kana, JIS
X 0208 and JIS X 0212). Given character encoding may also encode
character set as-is (raw; US-ascii) or processed (for EUC-JP, US-ascii
is as-is, JIS X 0201 is prepended with \x8E, JIS X 0208 is added by
0x8080, and JIS X 0212 is added by 0x8080 then prepended with \x8F).

   They are different indeed. But for the better for for the worse,
being correct has little, if not nothing, to do with being standard.
IANA's usage of charset is just one.
   But I believe in the case of charset, IANA is not the only one to
blame. Whoever submitted the name to IANA should take more. Japan was
lucky here because those who submitted did know the difference. Taiwan
was lucky, too because Big5 is raw-encoded :)

>> really means euc-cn and charset="ks_c_5601-1987" really menas euc-kr.
>> Sadly this misconception is enbedded to popular browsers.
>
> Well, use of 'ks_c_5601-1987' is the result of an 'evil'
> act of Microsoft. We furiously objected it, but M$ went on
> to use that name in their products instead of then-well-establisehd
> EUC-KR around 1997. Please, refer to Ken Lunde's CJKV Information
> Processing
> about that 'epic war' between two camps. (see p.197 of
> the book and http://jshin.net/faq/qa8.html)
> We even set up a web page to prevent M$ from spreading that
> ill-defined name. Anyway,
> their designation couldn't withstand the test of the time because
> KS C 5601-1987 was renamed KS X 1001:1998. Still, M$ IE and
> M$ OE, M$ Frontpage keep producing html docs. However,
> it also has to be noted that the encoding
> designated as 'ks_c_5601-1987' by M$ is NOT the same as
> EUC-KR BUT their proprieatary extension of EUC-KR, namely
> CP949/UHC/(X-)-Windows-949.

   Okay, then here is the new canon-alias mapping

euc-kr unaliased (kr.yahoo.com uses this)
ks_c_5601-1987 -> cp949
ksc5601-raw stays unchanged to make NI-S happy in Perl::Tk
                   (raw encodings needed for font loading)

>> Sadly this misconception is enbedded to popular browsers.
>
> MS IE certainly counts as a popular browser, but Mozilla/Netscape
> never used 'ks_c_5601-1987' to mean EUC-KR. They always have
> used 'EUC-KR'. Mozilla uses 'X-Windows-949' to mean CP949/UHC
> and 'ks_c_5601-1987' is an alias to 'X-Windows-949' (but
> Mozilla will never have 'ks_c_5601-1987' in outgoing messages/docs.
> It only accept html/emails labeled that way as in X-Windows-949).

   Reckoned.

> In case of 'GB2312' in place of 'EUC-CN',
> the situation was beyond repair (Ken Lunde's book
> was too late and an error-prone book by a Japanese engineer
> working at MS published a few years earlier spread the
> misconception too widely) so that the name just stuck.

   Okay, then I will leave the current aliasing unchanged, until maybe
the Zhonghuanese :) throws an objection.

> As for Taiwan, the reason there's no confusion between
> coded character set and encoding is not because they're
> technically correct but because in their case EUC-TW
> has never been used widely while the popular encoding
> Big5 has much more complex relationship with CNS 11xxx
> than EUC-KR with KS X 1001 and EUC-CN with GB 2312.
> (Big5 vs CNS 11xxx is similar to Shift_JIS vs JIS X 0208)

Reckoned and fortunately for this case, I knew already.

Dan the Encode Maintainer



This archive was generated by hypermail 2.1.2 : Wed Mar 27 2002 - 06:34:04 EST