From: Mark Davis (mark.davis@icu-project.org)
Date: Thu Feb 16 2006 - 18:59:26 CST
You have to be very careful. UnicodeData.txt is just one file of many
that contain the data for Unicode character properties. And in a great
many cases, the recommended property is *not* the one in Unicode data.
For examples of this, look at the following.
http://www.unicode.org/reports/tr18/#Compatibility_Properties
http://www.unicode.org/reports/tr31/
For other examples, such as determining whether letters are lowercase or
not, see "Case Conversion" in http://www.macchiato.com/slides/gotchas.html
(I'll be talking about these issues at the upcoming Unicode conference.)
Mark
Kit Peters wrote:
>
>
> On 2/16/06, *Jukka K. Korpela* <jkorpela@cs.tut.fi
> <mailto:jkorpela@cs.tut.fi>> wrote:
>
> On Wed, 15 Feb 2006, Kit Peters wrote:
>
> > I am interested in the characters whose properties are
> > defined in UnicodeData.txt.
>
> But do you really mean that? That is, do you mean Unicode characters
> except Han characters and Hangul syllables? Why would this be a
> relevant subset? If it is, I don't think there is any shorter
> expression
> you could use.
>
>
> The reason I am only interested right now in the characters from
> UnicodeData.txt is that is what the larger project I am working on
> (CLforJava, a pure Java Common Lisp implementation) only parses
> UnicodeData.txt. While eventually we plan to parse Unihan.txt, at the
> present time I am concentrating on parsing all the numbers in
> UnicodeData.txt.
>
> Besides, the formulation is vague.
>
>
> What would be a more accurate formulation?
>
> Kit Peters
This archive was generated by hypermail 2.1.5 : Thu Feb 16 2006 - 19:21:31 CST