Public Review Issues

Accumulated Feedback on PRI #486

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Sun Dec 10 13:41:57 CST 2023
ReportID: ID20231210134157
Name: Manuel Strehl
Report Type: Public Review Issue
Opt Subject: 486


Hello!

I’m the author of codepoints.net, a non-commercial Unicode explainer and
visualizer site on-line since 2012. I have handled Unicode data in various
formats over the years, from the text files in the Public folder to data
files from CLDR to exports from Unicode’s Github repository, both for
private projects as well as commercially for customers.

The XML data file defined in UAX #42 and published under

https://www.unicode.org/Public/UCD/latest/ucdxml/ucd.all.flat.zip 

is in my experience the most concise representation of the data defined in
the Unicode Standard with regards to single code point information. While
it is technically possible to parse the plain-text data files and extract
information from there on a file-by-file basis (something I have done for
customers in the past already for some of the files) this is a tedious work
partly due to subtle format differences between some of the files.

It also bears the fear of having forgotten some properties when working with
the files on a regular base. The latter comes from the mix of normative and
informal content in the Public/UCD/latest/ucd/ folder, which is great if
what one wants is to get an overview of what the data consists of.

For actually working with the data this is suboptimal, though. It leads to
lots of special cases when trying to bring the data into some structured
way: Which files to ignore, which properties to search for, and so on.

Now, of course this means that someone had to do the structurizing work
before to get the data into XML shape. I am very grateful for having the
consortium do this all the years and provide the XML files in various
formats on their website.

Therefore it was quite a shock to read PRI 486 about sunsetting the update
of said XML files. Having them fixed at 15.1 in the future means basically
that they become instantly useless for me.

I acknowledge the non-trivial work that goes into providing them and that
someone has to do maintenance and updates. What I want to stress out with
this comment is simply that solving this conundrum between additional work
and ease of use by simply ending UAX #42 might do a disservice to the
Unicode Standard, since one of the easiest formats to incorporate all the
standard’s code point data into other software will be gone.

I’d like to suggest a different solution, apart from the unsustainable
keeping going on. It might be worthwhile to try and place the work on more
shoulders. Given that Unicode already develops some of its tools in the
open on Github, it could be a possibility to make parts of the workflow to
produce the XML files open source and place them in a dedicated public
repository.

There are two possible positive outcomes from doing so. One is, that
third-party developers might issue pull requests to keep the tools up to
date with new Unicode versions, allowing the consortium to still produce
XML files for new versions of the Unicode Standard.

Another possibility is that the tools develop into a standardized way to
convert the Standard’s plain-text files into structured formats. This work
could then be possibly expanded to more formats and incorporate other
projects. A concrete other project I have in mind that might be interested
is the Node Unicode Data project under

https://github.com/node-unicode/node-unicode-data 

that provides Unicode data in a JavaScript-importable format.

I’m hoping dearly that the important structure to Unicode data provided by
UAX #42 will not simply vanish but maybe the situation to be turned to
providing an even better and easier access for more people to Unicode’s
data, something that I try to offer for non-technical audiences with
codepoints.net since more than 10 years.

—Manuel Strehl

Date/Time: Wed Dec 13 04:45:06 CST 2023
ReportID: ID20231213044506
Name: Daniel Bünzli
Report Type: Public Review Issue
Opt Subject: 486

Hello, 

I'm the author of various libraries [1] collectively implementing
pieces of the Unicode standard for the OCaml language and kept of
up-to-date in a timely manner with its evolution since 2012.

At the root of this work is the uucd [2] library which provides an API
to extract and expose data from the Unicode character database from the 
UAX42 representation in order to generate efficient data structures for 
the data.

It goes without saying that the prospect of having to parse the bunch
of disparate and loosely specified text files provided by UAX44, with
default values embedded into comments is not very enticing; to say the
least.

But it's not only that. It should also be stressed that besides
providing a clean, uniform and simple (modulo the grouping mechanism)
access to all UCD character properties and other non per-code point
properties (e.g. named sequences), UAX42 is a huge time saver and
unique work in order to follow the evolution of the Unicode character
database since:

1. It centralizes the actual *types* for each of the properties in a single 
   and readily available place [3].
2. The modification section [4] of the annex carefully chronicles the
   evolution of the types and the introduction of the new properties. 
   Something that is nowhere to be found in the horrible mess that UAX44 is.
   
The fact that I'm able to support new Unicode releases in a timely
manner on a volunteer basis (except for Unicode 15.{0,1}.0 which were
paid for by the OCaml Software Foundation) relies entirely on that.

Now I'm in no way attached to XML or UAX42's representation but I
think it would be quite an embarrassement for the Unicode Consortium to
simply drop this careful work without providing something equivalent
that allows implementers to work in an efficient manner. Something, it
should be stressed, is in no way provided UAX44.

This could be as simple as having simple RFC 4180 CSV files. One for
the repertoire with 1'114'111 lines and one property per column, one
for the blocks, one for the named sequences etc. *and* a clear
description of the type of each of the columns and their evolution on
each new version. More fancy you could even contemplate an sqlite3
file (which is recommended by the Library of Congress as a format for
long term archival) that provides different tables with all that 
information that can be found in the ucdxml. I wouldn't mind moving to 
something else as long as it is clean and its evolution is carefully 
documented as UAX42 was.

Honestly, if I had to freeze something it would rather be UAX44's text
files swamp. Perhaps making ICU development depend on UAX42's
representation could be the way forward here ? 

Best,

Daniel

P.S. Note that I'm unlikely to be unique in this case. While it seems 
debian never provided the ucdxml in a dedicated package (see [5] that
requests for it). You will find that various packages have a copy of
it (e.g. [6]). Be careful in what you break, it's highly unlikely you
will hear about all these people on this PRI before you actually break
them. You may want to have a look at these [7] queries on that popular
code hosting platform.

[1]: https://erratique.ch/software#unicode 
[2]: https://erratique.ch/software/uucd 
[3]: https://www.unicode.org/reports/tr42/#d1e2882 
[4]: https://www.unicode.org/reports/tr42/#Modifications 
[5]: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1021334 
[6]: https://sources.debian.org/src/angular.js/1.8.3-1/i18n/ucd/src/ 
[7]: https://github.com/search?q=ucd.all.flat.xml&type=code 
     https://github.com/search?q=ucd.all.grouped.xml&type=code

Date/Time: Thu Dec 14 16:11:56 CST 2023
ReportID: ID20231214161156
Name: asmus
Report Type: Public Review Issue
Opt Subject: 486

A critical use case for external specifications is the fact that UAX#42
chooses not only the "short" alias for properties and values, but does it
in a stable way, whereas the PropertyValueAliases and PropertyAliases are
subject to changes in capitalization etc that are within the Loose matching
envelope.

In addition, aliases may be augemented by new aliases (sometimes because of
corrections). While the old aliases are not removed, they may be moved to a
different position on the line. It is therefore not possible to use these
files for *stable* keys as they would be needed for DTDs or similar use
cases.

There's at least one IETF specification that normatively references UAX#42
for that purpose, and like UAX#42 it is a XML data format that needs to be
able to /identify/ unicode properties and values in a stable (but does not
need to provide a listing of the actual property data).

Identifying a stable set of keys that do not require loose matching is one
feature that is unique to UAX#42 and cannot be replaced by accessing the
original UCD. If UAX#42 is to be retired, this functionality should be
replaced and linked from the page that documents the stabilization of
UAX#42.

Date/Time: Tue Jan 02 06:43:32 CST 2024
ReportID: ID20240102064332
Contact: duerst@it.aoyama.ac.jp
Name: Martin Dürst
Report Type: Public Review Issue
Opt Subject: 486

I have read the comments from Manuel Strehl, Daniel Bünzli, and Asmus
Freytag on this issue.

Based on my experience with regularly updating property data for the
programming language Ruby, I fully agree with what they write. My comments
are in addition based on the experience of a year-long project with a
student where we tried to automate extraction of Unicode property data and
metadata.

The various legacy file formats that are used to publish the Unicode
property data are a real pain to work with. I understand that 30 years ago,
files had to be compact, and that old, established file conventions better
not be changed. But in this day and age of daily video consumption on the
Internet, and taking generic file compression into account, the volume of
Unicode property data should no longer be that much of a concern. But
rather than a move to flatter file formats, there seems to be a continued
tendency by some people at the Unicode consortium to prefer what are now
just cute shortcuts over straightforward simplicity. Shortcuts and
compression tricks should be left to library implementers, which in many
cases will use optimized binary formats anyway.

Also, when Unicode got invented, the idea of generic file formats was still
in its infancy. Now we can choose from XML, JSON, CSV, and a few others.
Having the data available in just one of these formats is a big help and
avoids a lot of the overhead of dealing with all the special cases in the
current Unicode data file 'formats', and with subtle changes from version
to version. Moving to a generic data format for all Unicode property data
should be the long term future direction for the Unicode consortium. This
would also reduce the dependency on knowledge that some of the Unicode
old-timers have but that will sooner or later unfortunately be lost.

In addition to providing the actual data in a streamlined form, the Unicode
Consortium should also provide metadata (property types,...) in a
streamlined form. The schema in USA #42, the PropertyAliases.txt and
PropertyValueAliases.txt files, and Table 9 in USA #44 currently come
closest, but are still a far shot from what would be possible. I already
proposed this about five years ago in an Internationalization and Unicode
Conference talk. I remember very well that Mark Davis welcomed this idea
then. It's unfortunate that nothing much has apparently happened in the
meantime.

Similar to other commenters, I'm ready to help, e.g. by contributing on
github or somewhere.

In conclusion, rather than giving up on Unicode in XML and TR 42, Unicode
should think seriously about its long-term strategy to make its data much
more streamlined and accessible.

P.S.: Please also make the links from the PRIs
(https://www.unicode.org/review/pri486/ for this PRI) to this form fill in
the PRI number automatically.

Date/Time: Tue Jan 02 13:14:32 CST 2024
ReportID: ID20240102131432
Contact: bob_hallissy@sil.org
Name: Bob Hallissy
Report Type: Public Review Issue
Opt Subject: 486

I concur with many previous comments already posted, in particular
that "freezing" UAX 42 makes it immediately useless to us and other
developers who depend on it -- and I am confident there are more such
developers than just those who have commented on this PRI to date.

Please find an alternative approach (and a few have been suggested in these
comments).

- Bob Hallissy, SIL

Feedback above this line reviewed during UTC #178 in January 2024.