Re: Chapter on character sets

From: brendan_murray@lotus.com
Date: Thu Jun 15 2000 - 09:17:31 EDT


Lars Marius Garshol <larsga@garshol.priv.no> wrote
> I would be glad if people here could read through it and tell me if
> they see any mistakes (or other kinds of things that could be
> improved).

This seems to have a nice chatty style. Just a couple of points:
   - When defining characters, you omit anything about control characters.
   - The book references omit the Bible - the Unicode Standard :-)
   - The description of the ranges in 8859 should explicitly state that
   0x00-0x1F and 0x80-0x9F are control characters. The C1 range (0x80-0x9F)
   is really, truly a range for control characters, and not reserved
   because of stripping upper bits from ASCII.
   - The language assignments for the various parts are largely correct,
   but:
     * French encoding in 8859-1 has been officially deprecated in favor of
     8859-15, although the reality is that waaaaaaaay more French data is
     encoded as Latin-1 than Latin-9.
     * Croatian is encoded using Latin-1 (du behøver ikke spør Sylvester)
     * The use of 8859-3 for Turkish has been deprecated - use 8859-9
     instead.
     * You may want to look at 8859-16 - a proposal for Romanian which, in
     my opinion, is based on a misguided sense of patriotism and is likely
     to cause major problems for Romanian users.
   - The C1 range wasn't empty: Microsoft simply took advantage of the fact
   that this range isn't needed on PC's, and filled it with graphic
   characters in the Windows codepages.
   - UTF-7 has been deprecated
   - In the discussion about UTF-16, you should mention that the upper
   scalar limit for a Unicode character is 0x10FFFF.
   - The Asian character sets miss a few encodings (use "Traditional
   Chinese" instead of "Taiwan"):
     KSC 5601 (Korean)
     Big5 (Traditional Chinese)
     EUC-TW (Traditional Chinese)
     GB 2312 (Simplified Chinese)
     GBK (Simplified Chinese) Supercedes GB 2312
   - The reason using a signed char to hold a byte in C is a problem is
   that C sign-extends data, and a single SBCS character can end up
   occupying two bytes when converting it to an int, as many C functions
   do. Suggest that text data should use an unsigned quantity.
   - C++ has the same issues as C. However, you can use the Standard C++
   Library classes such as wstring and, of course, the MFC CString. On the
   other hand, ICU is cross-platform and open-source so it may be the best
   solution.

Points I thought could be mentioned are:
   endianness - big versus little
   more emphasis on the glyph vs character issue

B=



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT