NamesList.txt as data source
asmusf at ix.netcom.com
Thu Mar 10 20:13:21 CST 2016
On 3/10/2016 5:49 PM, "J. S. Choi" wrote:
> One thing about NamesList.txt is that, as far as I have been able to tell, it’s the only machine-readable, parseable source of those annotations and cross-references.
There are explanations about character use that are only maintained in
the PDF of the core specification, where this information is packaged in
a way that can be understood by a human reader, but is not amenable to
be extracted by machine.
While the annotations, comments, cross references etc. in Namelist.txt
appear, formally, to be machine extractable, the way they are created
and managed make them just as much "human-accessible" only as the core
The goal getting a complete and machine-readable description of
character behavior is illusory.
> As part of the Unicode Standard and the UCD, the name lists’ annotations and cross-references contain much useful data on the intended usage of characters and code points beyond the core specification’s chapters. I have long held an interest in making the name-list data more universally accessible to the general public, especially to visually impaired people—i.e., using screen-reader-friendly HTML rather than PDF—while making clear that the annotations are merely references to the original, normative Standard’s actual code charts and name lists.
This is a different issue. The nameslist.txt is a reasonable source for
driving other _formatting_ programs than just Unibook. In fact, the
possibility of reuse in this context probably among the unstated
rationales for making the information and syntax available in the first
Let's understand this properly: using the file to translate it into a
"human-readable" output format is a proper use of this data, even if
that translation is done using a mechanical too, as long as the format is
a) a format that benefits from the special shortcuts taken in selecting
the information present in the namelist.txt file,
b) a format intended to be interpreted by a observant and intelligent
human reader, and not
c) a format intended as direct input to any text-processing algorithm,
or any algorithm that "understands" the contents
> What are these other primary sources that maintain these other annotation data; are they publicly available? If the name list is the only place where these sources’ data have been published, then, for better or for worse, the name list is all that is available for much information on many code points’ usage.
See my first through third paragraph.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode