On 3/10/2016 2:14 PM, Doug Ewell wrote:
Ken Whistler wrote:
NamesList.txt should *not* be data mined.
And yet it was the only Unicode data file utilized by MSKLC.
There are many possible reasons for this approach, which we will
probably never know.
Extracting information from namelist.txt that was added to that file
based on information from the UCD is plain folly - not least because
it uses a secondary source instead of a primary source. What may not
have come across from Ken's description is that the process for
incorporating this data is under editorial control - and some values
or entries may be suppressed for readability. There is explicitly
not guarantee for completeness.
There is some information that *only* exists in the nameslist.txt
file. This includes, informal aliases for character names, cross
references, etc.. The problem with extracting this information
blindly (that is, not mediated by a human) is, again, that the level
of consistency of presentation is that appropriate for a human
reader, not for an extraction algorithm.
For example, to reduce clutter, cross references are not symmetric
or transitive, even though the relationship that gave rise to the
cross reference in te first place (e.g. similarity) would normally
be one that is symmetric and transitive. The human reader can be
trusted to determine that, for example "<" is the "main" entry
and that from there all the other, same or similar characters are
referenced, but by not listing the reverse direction everywhere, the
level of clutter in the rest of the nameslist is reduced, making
additional cross references stand out more.
Those are just the intentional inconsistencies.
There is a historical development in the annotations - over time,
more characters get annotated. However, annotations are not always
backported, so the level of annotations can be inconsistent for
reasons of incremental development.
Now, for the x-refs on gaps, a human reader could extract and verify
the set, but relying blindly on an algorithm to extract the data is
fraught with peril. (Other gaps may have slightly different origin
and status, yet also carry an annotation).
Using the mathematical data files for this is a step up, because the
data there is focused on a single use case. The downside is that the
information is in a comment field.
A./
Received on Thu Mar 10 2016 - 19:06:33 CST