From: spir (denis.spir@free.fr)
Date: Sat Feb 06 2010 - 05:56:53 CST
Hello,
I have a bunch of questions on the topic.
The provided test data hold a huge list of specific and generic cases, of which about 11500 hangul ones.
-1- Why so many? Is it necessary to test all of these? I guess for instance if a func correctly transforms 1, 2, 3 hangul LVT syllables, then it correctly transforms all of them, no?
-2- Since hangul codes are normalized algorithmically (as opposed to a mapping), shouldn't they be in a separate part?
-3- What are the specific cases (part 0), why are they apart?
I also wonder about the source codes to be normalised.
-4- Does each code / group of codes represent a whole, consistent, "user-perceived character"?
-5- Would their concatenation build a valid character string (text)?
-6- Should the (NFD) normalisation of this text result in the concatenation of individual normalized cases?
My intention is to do the following; please tell me whether it makes sense:
* Build separate test sets for specific / hangul / generic cases (done).
* Select all specific, and N randomly-chosen hangul and generic cases.
* From the complete data, run and check case-per-case normalisation, using the given assertions c3 == NFD(c1) == NFD(c2) == NFD(c3) and c5 == NFD(c4) == NFD(c5).
* Using only source and NFD data columns, run and check complete text normalisation.
________________________________
la vita e estrany
This archive was generated by hypermail 2.1.5 : Sat Feb 06 2010 - 06:02:54 CST