CLDR: Bad exemplar chars for some locales

From: Peter Edberg (pedberg@apple.com)
Date: Wed Apr 05 2006 - 18:03:49 CST

  • Next message: Elliotte Harold: "Re: The Phaistos Disc"

    We are planning to use CLDR exemplar character set data for various
    purposes, and so I have been looking into some current exemplar sets
    (as indicated by the survey tool at <http://unicode.org/cldr/apps/
    survey>). The exemplar sets are "supposed to represent the set of
    characters needed to write the form of the language that is currently
    in use" (per Deborah Goldsmith). By this criterion, several of them
    include characters that seem to me to be inappropriate. I wanted to
    get some feedback on these, and then try to get them changed ASAP
    (via bugs etc.). In all cases below I am referring to the standard
    exemplar set, not the auxiliary characters.

    1. Arabic (ar) & Persian (fa):
    - Both of these include 200C and 200D (ZWNJ and ZWJ). I would argue
    that these characters are not required in order to write Arabic or
    Persian.

    2. Armenian (hy):
    - This currently includes 0559. According to TUS 4.0 (pg. 181), 0559
    is not used, appears to be a duplicate, and 02BB should be used instead.
    - This also currently includes the presentation form ligatures FB13-
    FB17. Again according to TUS 4.0, these forms are traditionally found
    in handwriting and in fonts that mimic handwriting. However, these do
    not seem to be required order to write Armenian (e.g. in a non-
    handwriting style), and it also seems odd to include any
    compatibility characters in exemplar sets.

    3. Hebrew (he):
    - This currently includes points 05B0-05B9, 05BB-05BC, 05BD, 05BF,
    05C1-05C2, 05C4. Points are not required for writing modern Hebrew,
    so these should not be in the standard set. Perhaps these should be
    in an auxiliary set.

    4. Thai (th):
    - This currently includes 200B (ZWSP). This is not required in order
    to write Thai (though ZWSP can optionally be used to indicate word
    breaks).

    5. For other locales that I checked whose exemplar sets seemed to
    include extraneous or incorrect characters, there are already
    proposals or bugs to address the changes, e.g.:
    - Romanian (ro) currently has 015F/0163 and not 0219/021B, but there
    is a bug for this,
    <http://dev.icu-project.org/cgi-bin/locale-bugs/data?id=434>
    - Norwegian Bokmål (nb) currently includes 01CE, but proposal u223-1
    deletes this.
    - Swedish (sv) currently includes standalone 0300 (in addition to the
    composed chars), but proposals u30-1 & u219-1 delete this.
    - Hindi (hi) currently includes characters unnecessary for Hindi, but
    proposal u109-1 deletes these. SImilar issues for other Indic locales
    such as Gurmukhi and Punjabi.

    -Peter Edberg, Apple Computer



    This archive was generated by hypermail 2.1.5 : Wed Apr 05 2006 - 18:10:36 CST