On Thu, Mar 10 2016 at 22:40 CET, kenwhistler@att.net writes: [...]The *reason* that NamesList.txt exists at all is to drive the tool, unibook, that formats the full Unicode code charts for posting. It is only posted in the Unicode Character Database at all as a matter of convenience, to give people access to a text only version of the names list that appears in the fully formatted pdf versions of the code charts that contain all the representative glyphs. NamesList.txt should *not* be data mined.I've just noticed that NamesList.txt is in a sense data mined by the Unicode consortium itself. I mean the "Unicode Utilities: Character Properties", which e.g. for LATIN SMALL LETTER P WITH FLOURISH (http://unicode.org/cldr/utility/character.jsp?a=A753) display in particular subhead: Medievalist addition Am I right that this information is available only in NamesList.txt?
In my opinion this is important information and should be officially available for character data mining engines.
There's an additional reason why we discourage the kind of data
mining that treats these as if they were character properties:
just because they are easy to lift out of the file doesn't mean
that they represent information that is more useful than, for
example, information contained in the discussion of the script of
character block in the text of the core specification.
If you seriously wanted to present "all that is known about a
character" you would need to excerpt all mentions of it in the
core specification, as well as (potentially) any additional
details presented in the version of the proposal document that was
approved by the UTC as part of encoding the character. (In
addition to each and any explicit and implicit mention in the text
of a UAX and which is not already covered by a formal character
property).
The reason nobody provides such a comprehensive summary, although
perhaps they should, is that the way the information is presented
in the core specification is, while equally useful(!), simply not
formatted in a way that makes data mining easy.
If you take a shortcut, and only present the information that's
easy to scrape, you are not necessarily doing your users any
service.
It's not quite garbage-in/garbage-out because the subheaders were
selected with some care, and in some cases, will provide the users
with a necessary or useful hint, but at the cost of misleading the
same users about the fact that these hints are not supplied
consistently and uniformly. And that, by ignoring the discussion
in the core specification, a lot of more useful and often more
important information is ignored.
Best regards Janusz
This archive was generated by hypermail 2.2.0 : Sat Mar 26 2016 - 23:39:56 CDT