Re: Just an observation from Steffen on 2013-08-06 (Unicode Mail List Archive)

From: Steffen <sdaoden_at_gmail.com>
Date: Tue, 06 Aug 2013 20:54:01 +0200

"Whistler, Ken" <ken.whistler_at_sap.com> wrote:
|Steffen Daode Nurpmeso continued:
|
|> Hmm. To me, this raises the question why these constraints were
|> introduced at all. Imho either one adds constraints due to solid
|> considerations, and enforces them after some period of backward
|> compatibility, or there simply should be no constraints.
|
|What you are talking about in the notes about the case mapping
|fields in UnicodeData.txt do not really constitute constraints, but
|rather are attempts to clearly document what the nature of the
|data is. The Unicode Consortium does maintain true constraints
|on various aspects of the data files: those are generally referred
|to as the "stability guarantees" or the stability policy:
|
|http://www.unicode.org/policies/stability_policy.html
|
|See also:
|
|http://www.unicode.org/policies/property_value_stability_table.html
|
|There is no stability policy (yet) regarding the titlecase field in particular,
|although there could be, I suppose, if the Unicode Technical Committee
|(and the Unicode Consortium officers) decided there was a good enough
|reason to add one.
|
|In the meantime, the Unicode Technical Committee also runs various
|tests on the UCD for each release checking what are termed
|"invariants", to look for possible problems when adding new repertoire
|or changing properties for existing characters. Some of those
|invariants are the subject of stability policies and *must* be honored
|when changing the UCD. Others are simply existing patterns (like
|the relationship between the titlecase mapping and the uppercase
|mapping) which are checked to look for inadvertent introduction
|of bonehead errors in the data.
|
|>
|> There are parsers (i know of one) which use *only* UnicodeData.txt
|> for generating tables (using patterns like `SPACE' etc. to join
|> characters into sets; which seems to have been common practice in
|> the past -- as in [3], „Case Mappings“: „derivable from the
|> presence of the terms "CAPITAL" or "SMALL" in the character
|> name“).
|
|That is very bad practice, and should be avoided. The UCD documentation
|warns against making assumptions about character properties based
|only on character names. It leads to many bad results.

I haven't yet seen those warnings. (I have the 3.0 book, but at
that time we decided that Unicode is beyond what we need / want to
support; i actually came back with 6.0 or so, but haven't read all
the book from the begin to the end, so far.)
But i assumed it is like that; the sets are known to not cover all
alphabetics, for example.

|> If there is no such extensive guaranteed backward compatibility
|> for UnicodeData.txt content already today then that should be
|> noted (i wouldn't know where that is true?), but otherwise it
|> cannot be that labour-intensive to drop these constraints again,
|> since nothing had to be done at all?
|> I.e., are these parsers already broken today?
|> Just curious…
|
|Parsers which deduce properties based on character names are
|definitely broken -- and that would include any case mapping information.

That is hard. Once my project is ready enough to provide the
(hopefully correct) data, i'll run some tests against data created
by the mentioned one, and will reply with some comparison results!

|As regards actual constraints, please refer to the stability policies to
|see what the Unicode Consortium officially claims to be required
|constraints on data changes.
|
|And if the odd edge cases for parsing the legacy data files (and
|UnicodeData.txt is the ur-data file with the most legacy status)
|seem problematical, the ultimate fix is just to refer to the UCD in XML:
|
|http://www.unicode.org/Public/UCD/latest/ucdxml/
|
|which has a fully rationalized and regular structure, well documented
|in UAX #42.

Well, thanks for the pointer, that i had not detected yet.
I personally would favour JSON, since it could be parsed pretty
easily even with awk(1), whereas XML is somewhat hard with a basic
Unix / POSIX installation, and that is what my personal project is
(or will be) based upon. (I.e., users should be able to update
the Unicode version, and it should simply work after recompiling
the library, except for ambiguities in the standard, like, e.g.,
setting a visual width of 1 for SOFT HYPHEN though that is a Cf,
and possibly needed adjustments for added code points, whatever.)
So i hope to be able to manage it based on the plain text version
instead…

Thanks!

|--Ken

--steffen
Received on Tue Aug 06 2013 - 13:58:48 CDT

This archive was generated by hypermail 2.2.0 : Tue Aug 06 2013 - 13:58:49 CDT