Lars Marius Garshol <larsga@garshol.priv.no> wrote
> I would be glad if people here could read through it and tell me if
> they see any mistakes (or other kinds of things that could be
> improved).
This seems to have a nice chatty style. Just a couple of points:
- When defining characters, you omit anything about control characters.
- The book references omit the Bible - the Unicode Standard :-)
- The description of the ranges in 8859 should explicitly state that
0x00-0x1F and 0x80-0x9F are control characters. The C1 range (0x80-0x9F)
is really, truly a range for control characters, and not reserved
because of stripping upper bits from ASCII.
- The language assignments for the various parts are largely correct,
but:
* French encoding in 8859-1 has been officially deprecated in favor of
8859-15, although the reality is that waaaaaaaay more French data is
encoded as Latin-1 than Latin-9.
* Croatian is encoded using Latin-1 (du behøver ikke spør Sylvester)
* The use of 8859-3 for Turkish has been deprecated - use 8859-9
instead.
* You may want to look at 8859-16 - a proposal for Romanian which, in
my opinion, is based on a misguided sense of patriotism and is likely
to cause major problems for Romanian users.
- The C1 range wasn't empty: Microsoft simply took advantage of the fact
that this range isn't needed on PC's, and filled it with graphic
characters in the Windows codepages.
- UTF-7 has been deprecated
- In the discussion about UTF-16, you should mention that the upper
scalar limit for a Unicode character is 0x10FFFF.
- The Asian character sets miss a few encodings (use "Traditional
Chinese" instead of "Taiwan"):
KSC 5601 (Korean)
Big5 (Traditional Chinese)
EUC-TW (Traditional Chinese)
GB 2312 (Simplified Chinese)
GBK (Simplified Chinese) Supercedes GB 2312
- The reason using a signed char to hold a byte in C is a problem is
that C sign-extends data, and a single SBCS character can end up
occupying two bytes when converting it to an int, as many C functions
do. Suggest that text data should use an unsigned quantity.
- C++ has the same issues as C. However, you can use the Standard C++
Library classes such as wstring and, of course, the MFC CString. On the
other hand, ICU is cross-platform and open-source so it may be the best
solution.
Points I thought could be mentioned are:
endianness - big versus little
more emphasis on the glyph vs character issue
B=
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT