L2/05-051
Public Review Issue #63: POSIX Data for CLDR
There is a new tool that creates POSIX locale data files from CLDR. It has been used to generate
draft POSIX locale data files for public review. We encourage review of this data; any feedback can
be filed at
http://unicode.org/cldr/filing_bug_reports.html. (Note: the CLDR 1.3 freeze data has been
extended to allow for feedback on this and other locale data.)
The draft files are available in
http://unicode.org/cldr/data/common/posix/. Because POSIX locale data files are specific to
charset, there are two kinds of files:
- generated with the UTF-8 charset, such as
http://unicode.org/cldr/data/common/posix/hi_IN.UTF-8.src
- These include all the locales
- generated with other charsets, such as
http://unicode.org/cldr/data/common/posix/de_DE.ISO8859-15.src
- These include just a few samples, for checking.
The main remaining issue at this point appears to be the repertoire of characters to be used for
the UTF-8 locales. Currently the mechanism is to use the following heuristic:
- start with the exemplar characters (main + auxiliary)
- add the collation tailored characters (including the contractions and prefixes for LC_COLLATE),
- add characters in the same script (for script values associated with sets of letters, and
excluding Han),
- add characters in the same block (excluding letters and unassigned characters),
- add ASCII
Feedback on this and other issues is welcome.
Notes:
- The tools actually have the ability override the above heuristic, and force the main
repertoire set and/or the collation repertoire set by specifying a UnicodeSet pattern from the
command line:
GeneratePOSIX -u [\\u0000-\\U10ffff] -x [\\u0000-\\U10ffff] -m de_DE -c UTF-8
- There are known bugs in the draft; those will show up in
http://www.jtcsv.com/cgibin/locale-bugs/posix.
- The LC_TYPE values are taken from Unicode.
- Since POSIX model can't represent all of CLDR, the tool needs to "downcast" to the closest
version, eg
- d_fmt, t_fmt: using the medium format
- d_t_fmt: using the long format
- LC_MESSAGES data will be updated with new data from CLDR 1.3.