NamesList.txt as data source (was: Re: Gaps in Mathematical Alphanumeric Symbols)
"J. S. Choi"
js_choi at icloud.com
Thu Mar 10 19:49:52 CST 2016
> On Mar 10, 2016, at 3:40 PM, Ken Whistler <kenwhistler at att.net> wrote:
> On 3/10/2016 1:00 PM, Andrew West wrote:
>> It (http://www.unicode.org/Public/UNIDATA/NamesList.txt) is
>> machine-readable, although the file specifically warns that "this file
>> should not be parsed for machine-readable information".
> NamesList.txt is just a structured text file, so of course it is "machine-readable".
> The problem is that because it is machine-readable, people tend to jump
> to the conclusion that all the information they need can simply be
> reliably parsed out of that file.
> It can't be.
> The reason is that NamesList.txt is itself the result of a complicated merge
> of code point, name, and decomposition mapping information from
> UnicodeData.txt, of listings of standardized variation sequences from
> StandardizedVariants.txt, and then a very long list of annotational
> material, including names list subhead material, etc., maintained in
> other sources.
> If people actually want to get reliably parsed data on code points, names,
> and decomposition mappings, they should get that directly from
> UnicodeData.txt. Likewise for information about standardized variation
> sequences, from StandardizedVariants.txt.
> The *reason* that NamesList.txt exists at all is to drive the tool, unibook,
> that formats the full Unicode code charts for posting. It is only
> posted in the Unicode Character Database at all as a matter of
> convenience, to give people access to a text only version of the
> names list that appears in the fully formatted pdf versions of the code charts
> that contain all the representative glyphs.
> NamesList.txt should *not* be data mined. Well, nobody can stop
> people from attempting to do so, of course, but they tend to end
> up confused and disappointed, because their assumptions going in
> don't match the editorial realities that affect the development of
> the annotational content added to the names list and the actual
> use for which NamesList.txt was created in the first place.
> On Mar 10, 2016, at 7:05 PM, Asmus Freytag (t) <asmus-inc at ix.netcom.com> wrote:
> On 3/10/2016 2:14 PM, Doug Ewell wrote:
>> Ken Whistler wrote:
>>> NamesList.txt should *not* be data mined.
>> And yet it was the only Unicode data file utilized by MSKLC.
>> There are many possible reasons for this approach, which we will
>> probably never know.
> Extracting information from namelist.txt that was added to that file based on information from the UCD is plain folly - not least because it uses a secondary source instead of a primary source. What may not have come across from Ken's description is that the process for incorporating this data is under editorial control - and some values or entries may be suppressed for readability. There is explicitly not guarantee for completeness.
> There is some information that *only* exists in the nameslist.txt file. This includes, informal aliases for character names, cross references, etc.. The problem with extracting this information blindly (that is, not mediated by a human) is, again, that the level of consistency of presentation is that appropriate for a human reader, not for an extraction algorithm.
> For example, to reduce clutter, cross references are not symmetric or transitive, even though the relationship that gave rise to the cross reference in te first place (e.g. similarity) would normally be one that is symmetric and transitive. The human reader can be trusted to determine that, for example "<" is the "main" entry and that from there all the other, same or similar characters are referenced, but by not listing the reverse direction everywhere, the level of clutter in the rest of the nameslist is reduced, making additional cross references stand out more.
> Those are just the intentional inconsistencies.
> There is a historical development in the annotations - over time, more characters get annotated. However, annotations are not always backported, so the level of annotations can be inconsistent for reasons of incremental development.
> Now, for the x-refs on gaps, a human reader could extract and verify the set, but relying blindly on an algorithm to extract the data is fraught with peril. (Other gaps may have slightly different origin and status, yet also carry an annotation).
> Using the mathematical data files for this is a step up, because the data there is focused on a single use case. The downside is that the information is in a comment field.
One thing about NamesList.txt is that, as far as I have been able to tell, it’s the only machine-readable, parseable source of those annotations and cross-references.
As part of the Unicode Standard and the UCD, the name lists’ annotations and cross-references contain much useful data on the intended usage of characters and code points beyond the core specification’s chapters. I have long held an interest in making the name-list data more universally accessible to the general public, especially to visually impaired people—i.e., using screen-reader-friendly HTML rather than PDF—while making clear that the annotations are merely references to the original, normative Standard’s actual code charts and name lists.
What are these other primary sources that maintain these other annotation data; are they publicly available? If the name list is the only place where these sources’ data have been published, then, for better or for worse, the name list is all that is available for much information on many code points’ usage.
J. S. Choi
More information about the Unicode