On 3/12/2016 10:55 PM, Janusz S. "Bień"
wrote:
In fact, the possibility of reuse in this context probably among the
> unstated rationales for making the information and syntax available in
> the first place.
I understand there is no intention to make an official XML version of
the file as it would require changes in Unibook?
In principle, the tooling that the editorial committee maintains
could be modified to
write out some XML version of the information. It's only software.
By the same token
principle, someone could write a new parser for Unibook that can
read the XML.
Both would consume significant amount of resources, for absolutely
no gain when
it comes to the core purpose: the production of the code charts.
In fact, the work would not be done, because the code chart process
requires
the use of some namelist-aware tools for draft preparation. All of
these would have
to be translated into a new format as well.
Finally, Unibook relies on auxiliary files that provide font
selection and configuration
data. Logically, the smart thing would be to convert all of them to
XML, or JSON, or
whatever the structured data description format du jour.
Looked at it from a practical perspective, by those involved in
doing the work of
creating the code charts and issuing new versions of the Standard,
it's a non-starter.
There are explanations about character use that are only maintained in
the PDF of the core specification, where this information is packaged
in a way that can be understood by a human reader, but is not amenable
to be extracted by machine.
While the annotations, comments, cross references etc. in Namelist.txt
appear, formally, to be machine extractable, the way they are created
and managed make them just as much "human-accessible" only as the core
specification.
I'm afraid it's not clear for me. Let's take an example. Sometime ago I
inquired about a controversial alias for U+018D:
http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0014.html
Can I really find anything about "reversed Polish-hook o" in the core
specification which is not a literal copy of the information from
NamesList.txt?
This comment referred to the facts that (a) the
nameslist is not exhaustive and
that (b) it is perfectly OK to have information that's not intended
for machine-
parsing.
Information intended for machine-parsing has a certain amount of
structure and
consistency, so that when a data table is built from it, the consuming
program
can rely on the fact that it will cover some aspect of character
identity or
behavior in a systematic way.
Well, not all possible information is systematized that way. Some
information
requires being interpreted by a human reader; the fact that the
information is
not buried in running text, but shows up in "fields" in a list,
doesn't make it
systematized in the same way as case mapping, decomposition or
other property
data.
You might as well have a tool that extracts snippets from the core
specification.
All fine, if your goal is, for example, to present all bits of
text mentioning a certain
code point (search engines will do some of that extraction for you).
However, even after extraction, the data is still just as
unstructured as before,
and, while useful to a human reader, doesn't constitute a formal
character
property. That's the whole reason why we go to the trouble of
defining so
clearly what is and isn't a character property (see UAX#44).
A./
Received on Sun Mar 13 2016 - 22:33:13 CDT