From: Peter Edberg (pedberg@apple.com)
Date: Wed Apr 05 2006 - 18:03:49 CST
We are planning to use CLDR exemplar character set data for various
purposes, and so I have been looking into some current exemplar sets
(as indicated by the survey tool at <http://unicode.org/cldr/apps/
survey>). The exemplar sets are "supposed to represent the set of
characters needed to write the form of the language that is currently
in use" (per Deborah Goldsmith). By this criterion, several of them
include characters that seem to me to be inappropriate. I wanted to
get some feedback on these, and then try to get them changed ASAP
(via bugs etc.). In all cases below I am referring to the standard
exemplar set, not the auxiliary characters.
1. Arabic (ar) & Persian (fa):
- Both of these include 200C and 200D (ZWNJ and ZWJ). I would argue
that these characters are not required in order to write Arabic or
Persian.
2. Armenian (hy):
- This currently includes 0559. According to TUS 4.0 (pg. 181), 0559
is not used, appears to be a duplicate, and 02BB should be used instead.
- This also currently includes the presentation form ligatures FB13-
FB17. Again according to TUS 4.0, these forms are traditionally found
in handwriting and in fonts that mimic handwriting. However, these do
not seem to be required order to write Armenian (e.g. in a non-
handwriting style), and it also seems odd to include any
compatibility characters in exemplar sets.
3. Hebrew (he):
- This currently includes points 05B0-05B9, 05BB-05BC, 05BD, 05BF,
05C1-05C2, 05C4. Points are not required for writing modern Hebrew,
so these should not be in the standard set. Perhaps these should be
in an auxiliary set.
4. Thai (th):
- This currently includes 200B (ZWSP). This is not required in order
to write Thai (though ZWSP can optionally be used to indicate word
breaks).
5. For other locales that I checked whose exemplar sets seemed to
include extraneous or incorrect characters, there are already
proposals or bugs to address the changes, e.g.:
- Romanian (ro) currently has 015F/0163 and not 0219/021B, but there
is a bug for this,
<http://dev.icu-project.org/cgi-bin/locale-bugs/data?id=434>
- Norwegian Bokmål (nb) currently includes 01CE, but proposal u223-1
deletes this.
- Swedish (sv) currently includes standalone 0300 (in addition to the
composed chars), but proposals u30-1 & u219-1 delete this.
- Hindi (hi) currently includes characters unnecessary for Hindi, but
proposal u109-1 deletes these. SImilar issues for other Indic locales
such as Gurmukhi and Punjabi.
-Peter Edberg, Apple Computer
This archive was generated by hypermail 2.1.5 : Wed Apr 05 2006 - 18:10:36 CST