Parsers for the UnicodeSet notation?
emuller at adobe.com
Thu Jul 24 01:51:15 CDT 2014
Thanks for the answers.
I take it from Steve's answer that Roozbeh's parser may work today but
may break tomorrow.
A couple of suggestions:
- a full "parser" of UnicodeSet is non-trivial, since it involves having
access to property values. That does not seem really necessary for
exemplars, so may be it would be good restrict the UnicodeSet there.
- alternatively, since the extent of a UnicodeSet can involve property
values, it means that the extent can depend on the Unicode version from
which those values come from. Which means that there ought to be a
Unicode version number in the CLDR data; it would be nice for that
number to be present in the data files (I don't see one in he.xml)
> Incidentally, I copy/pasted the punctuation exemplar characters for
> he.xml into the utility, and it reported that the set contains 8,130
> code points, including the ascii letters. Somehow, that seems
> incorrect. What did I do wrong?
Sorry, I took the UnicodeSet straight out of he/characters.json, without
handling the json serialization (or rather deserialization) of strings.
Taking it straight out of he.xml (where there is no serialization
effect) gives a much more reasonable set of twenty strings. XML wins
More information about the Unicode