Hello,
Markus Scherer <markus.icu_at_gmail.com> wrote:
|On Thu, Apr 24, 2014 at 12:56 PM, Steffen Nurpmeso <sdaode\
|n_at_yandex.com>wrote:
|> Markus Scherer <markus.icu_at_gmail.com> wrote:
|>|I strongly recommend you parse the derived properties rather than trying
|> to
|>|follow the derivation formula, because that can change over time.
|>
|> ..this file includes only those core properties that have
|> themselves a derivation-may-change property?
|
|I don't know what that means.
|What I tried to say is, if you need ID_Start, then parse ID_Start from
|DerivedCoreProperties.txt. That's more stable (and easier than parsing the
|pieces and deriving
|
|# Lu + Ll + Lt + Lm + Lo + Nl
|# + Other_ID_Start
|# - Pattern_Syntax
|# - Pattern_White_Space
|
|yourself.
But i *do* need to parse several many pieces (since i'm hardly
interested in ID_Start only)!
Unicode has DerivedAge.txt (i don't know where that is derived
from) and i need to parse PropList.txt anyway (to get the full
list of whitespace characters, for example).
So imho it's a bit like «Kraut und Rüben» («higgledy-piggledy»
sayy <http://www.dict.cc/?s=Kraut+und+R%C3%BCben>).
|For example, at least one of the derivation formulas (for Alphabetic) is
|changing from 6.3 to 7.0.
That is interesting or frightening, i don't know yet.
Wouldn't it make sense to introduce a single PropListsJoined.txt
that does it all. Or, for the sake of small and possibly
space-constrained projects..
?0[steffen_at_sherwood ]$ (cd ~/arena/docs.coding/unicode/data;
> ll DerivedCore* PropList*)
100 [.] 99531 25 Sep 2013 PropList.txt
820 [.] 836985 25 Sep 2013 DerivedCoreProperties.txt
..and this is what i would do: offer a new file, say, Formula.txt,
which defines exactly the necessary formula, e.g., to quote your
example
Alphabetic
< UnicodeData.txt
< PropList.txt
+ Lu + Ll + Lt
+ Lm
+ Lo + Nl
+ Other_ID_Start
- Pattern_Syntax
- Pattern_White_Space
=
That concept seems to be scalable at first glance. Old parsers
will not generate correct data in the future anymore if
i understood correctly? At least there should be
a formular-compatibility version tag added somewhere, so that
parsers can prevent themselves from generating incorrect data and
automatically.
I don't know why there need to be megabytes of duplicated data.
Ach; and i'm not gonna start to dream of better support for ISO
C / POSIX character classes. (Oh. ...It's surely sapless.)
Ciao,
--steffen
_______________________________________________
Unicode mailing list
Unicode_at_unicode.org
http://unicode.org/mailman/listinfo/unicode
Received on Fri Apr 25 2014 - 08:06:46 CDT
This archive was generated by hypermail 2.2.0 : Fri Apr 25 2014 - 08:06:49 CDT