Re: Parsers for the UnicodeSet notation?

From: Steven R. Loomis <srl_at_icu-project.org>
Date: Wed, 23 Jul 2014 15:31:24 -0700

On 07/23/2014 03:23 PM, Eric Muller wrote:
> I would like to work with the exemplarCharacters data in the CLDR.
> That uses the UnicodeSet notation. Is there somewhere a parser for
> that notation, that would return me just the list of characters in the
> set? Something a bit like the UnicodeSet utility at
> <http://unicode.org/cldr/utility/list-unicodeset.jsp>, but for use in
> apps/shell.
>
> I suspect that the exemplarCharacters use a restricted form of the
> UnicodeSet notation (e.g. do not use property values). Is that
> correct, and if so, what's the subset?
>
> Incidentally, I copy/pasted the punctuation exemplar characters for
> he.xml into the utility, and it reported that the set contains 8,130
> code points, including the ascii letters. Somehow, that seems
> incorrect. What did I do wrong?
>
> Thanks,
> Eric.
>

Eric,
UnicodeSet is a class available in ICU4J and ICU4C/C++ and so you can
parse and query using the ICU API. I wrote a little command line utility
badly named "ucd" that is similar to the web page mentioned above.

It is here: http://source.icu-project.org/repos/icu/icuapps/trunk/ucd/
and here is the readme:
http://source.icu-project.org/repos/icu/icuapps/trunk/ucd/readme.txt

let me know what platform you are on and I can send you build instructions.
-s

-- 
IBMer but all opinions are mine.
https://www.ohloh.net/accounts/srl295 // fingerprint @ https://ssl.icu-project.org/trac/wiki/Srl
_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Wed Jul 23 2014 - 17:32:19 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 23 2014 - 17:32:19 CDT