This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Sun Dec 10 13:41:57 CST 2023
ReportID: ID20231210134157
Name: Manuel Strehl
Report Type: Public Review Issue
Opt Subject: 486
Hello! I’m the author of codepoints.net, a non-commercial Unicode explainer and visualizer site on-line since 2012. I have handled Unicode data in various formats over the years, from the text files in the Public folder to data files from CLDR to exports from Unicode’s Github repository, both for private projects as well as commercially for customers. The XML data file defined in UAX #42 and published under https://www.unicode.org/Public/UCD/latest/ucdxml/ucd.all.flat.zip is in my experience the most concise representation of the data defined in the Unicode Standard with regards to single code point information. While it is technically possible to parse the plain-text data files and extract information from there on a file-by-file basis (something I have done for customers in the past already for some of the files) this is a tedious work partly due to subtle format differences between some of the files. It also bears the fear of having forgotten some properties when working with the files on a regular base. The latter comes from the mix of normative and informal content in the Public/UCD/latest/ucd/ folder, which is great if what one wants is to get an overview of what the data consists of. For actually working with the data this is suboptimal, though. It leads to lots of special cases when trying to bring the data into some structured way: Which files to ignore, which properties to search for, and so on. Now, of course this means that someone had to do the structurizing work before to get the data into XML shape. I am very grateful for having the consortium do this all the years and provide the XML files in various formats on their website. Therefore it was quite a shock to read PRI 486 about sunsetting the update of said XML files. Having them fixed at 15.1 in the future means basically that they become instantly useless for me. I acknowledge the non-trivial work that goes into providing them and that someone has to do maintenance and updates. What I want to stress out with this comment is simply that solving this conundrum between additional work and ease of use by simply ending UAX #42 might do a disservice to the Unicode Standard, since one of the easiest formats to incorporate all the standard’s code point data into other software will be gone. I’d like to suggest a different solution, apart from the unsustainable keeping going on. It might be worthwhile to try and place the work on more shoulders. Given that Unicode already develops some of its tools in the open on Github, it could be a possibility to make parts of the workflow to produce the XML files open source and place them in a dedicated public repository. There are two possible positive outcomes from doing so. One is, that third-party developers might issue pull requests to keep the tools up to date with new Unicode versions, allowing the consortium to still produce XML files for new versions of the Unicode Standard. Another possibility is that the tools develop into a standardized way to convert the Standard’s plain-text files into structured formats. This work could then be possibly expanded to more formats and incorporate other projects. A concrete other project I have in mind that might be interested is the Node Unicode Data project under https://github.com/node-unicode/node-unicode-data that provides Unicode data in a JavaScript-importable format. I’m hoping dearly that the important structure to Unicode data provided by UAX #42 will not simply vanish but maybe the situation to be turned to providing an even better and easier access for more people to Unicode’s data, something that I try to offer for non-technical audiences with codepoints.net since more than 10 years. —Manuel Strehl
Date/Time: Wed Dec 13 04:45:06 CST 2023
ReportID: ID20231213044506
Name: Daniel Bünzli
Report Type: Public Review Issue
Opt Subject: 486
Hello, I'm the author of various libraries [1] collectively implementing pieces of the Unicode standard for the OCaml language and kept of up-to-date in a timely manner with its evolution since 2012. At the root of this work is the uucd [2] library which provides an API to extract and expose data from the Unicode character database from the UAX42 representation in order to generate efficient data structures for the data. It goes without saying that the prospect of having to parse the bunch of disparate and loosely specified text files provided by UAX44, with default values embedded into comments is not very enticing; to say the least. But it's not only that. It should also be stressed that besides providing a clean, uniform and simple (modulo the grouping mechanism) access to all UCD character properties and other non per-code point properties (e.g. named sequences), UAX42 is a huge time saver and unique work in order to follow the evolution of the Unicode character database since: 1. It centralizes the actual *types* for each of the properties in a single and readily available place [3]. 2. The modification section [4] of the annex carefully chronicles the evolution of the types and the introduction of the new properties. Something that is nowhere to be found in the horrible mess that UAX44 is. The fact that I'm able to support new Unicode releases in a timely manner on a volunteer basis (except for Unicode 15.{0,1}.0 which were paid for by the OCaml Software Foundation) relies entirely on that. Now I'm in no way attached to XML or UAX42's representation but I think it would be quite an embarrassement for the Unicode Consortium to simply drop this careful work without providing something equivalent that allows implementers to work in an efficient manner. Something, it should be stressed, is in no way provided UAX44. This could be as simple as having simple RFC 4180 CSV files. One for the repertoire with 1'114'111 lines and one property per column, one for the blocks, one for the named sequences etc. *and* a clear description of the type of each of the columns and their evolution on each new version. More fancy you could even contemplate an sqlite3 file (which is recommended by the Library of Congress as a format for long term archival) that provides different tables with all that information that can be found in the ucdxml. I wouldn't mind moving to something else as long as it is clean and its evolution is carefully documented as UAX42 was. Honestly, if I had to freeze something it would rather be UAX44's text files swamp. Perhaps making ICU development depend on UAX42's representation could be the way forward here ? Best, Daniel P.S. Note that I'm unlikely to be unique in this case. While it seems debian never provided the ucdxml in a dedicated package (see [5] that requests for it). You will find that various packages have a copy of it (e.g. [6]). Be careful in what you break, it's highly unlikely you will hear about all these people on this PRI before you actually break them. You may want to have a look at these [7] queries on that popular code hosting platform. [1]: https://erratique.ch/software#unicode [2]: https://erratique.ch/software/uucd [3]: https://www.unicode.org/reports/tr42/#d1e2882 [4]: https://www.unicode.org/reports/tr42/#Modifications [5]: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1021334 [6]: https://sources.debian.org/src/angular.js/1.8.3-1/i18n/ucd/src/ [7]: https://github.com/search?q=ucd.all.flat.xml&type=code https://github.com/search?q=ucd.all.grouped.xml&type=code
Date/Time: Thu Dec 14 16:11:56 CST 2023
ReportID: ID20231214161156
Name: asmus
Report Type: Public Review Issue
Opt Subject: 486
A critical use case for external specifications is the fact that UAX#42 chooses not only the "short" alias for properties and values, but does it in a stable way, whereas the PropertyValueAliases and PropertyAliases are subject to changes in capitalization etc that are within the Loose matching envelope. In addition, aliases may be augemented by new aliases (sometimes because of corrections). While the old aliases are not removed, they may be moved to a different position on the line. It is therefore not possible to use these files for *stable* keys as they would be needed for DTDs or similar use cases. There's at least one IETF specification that normatively references UAX#42 for that purpose, and like UAX#42 it is a XML data format that needs to be able to /identify/ unicode properties and values in a stable (but does not need to provide a listing of the actual property data). Identifying a stable set of keys that do not require loose matching is one feature that is unique to UAX#42 and cannot be replaced by accessing the original UCD. If UAX#42 is to be retired, this functionality should be replaced and linked from the page that documents the stabilization of UAX#42.
Date/Time: Tue Jan 02 06:43:32 CST 2024
ReportID: ID20240102064332
Contact: duerst@it.aoyama.ac.jp
Name: Martin Dürst
Report Type: Public Review Issue
Opt Subject: 486
I have read the comments from Manuel Strehl, Daniel Bünzli, and Asmus Freytag on this issue. Based on my experience with regularly updating property data for the programming language Ruby, I fully agree with what they write. My comments are in addition based on the experience of a year-long project with a student where we tried to automate extraction of Unicode property data and metadata. The various legacy file formats that are used to publish the Unicode property data are a real pain to work with. I understand that 30 years ago, files had to be compact, and that old, established file conventions better not be changed. But in this day and age of daily video consumption on the Internet, and taking generic file compression into account, the volume of Unicode property data should no longer be that much of a concern. But rather than a move to flatter file formats, there seems to be a continued tendency by some people at the Unicode consortium to prefer what are now just cute shortcuts over straightforward simplicity. Shortcuts and compression tricks should be left to library implementers, which in many cases will use optimized binary formats anyway. Also, when Unicode got invented, the idea of generic file formats was still in its infancy. Now we can choose from XML, JSON, CSV, and a few others. Having the data available in just one of these formats is a big help and avoids a lot of the overhead of dealing with all the special cases in the current Unicode data file 'formats', and with subtle changes from version to version. Moving to a generic data format for all Unicode property data should be the long term future direction for the Unicode consortium. This would also reduce the dependency on knowledge that some of the Unicode old-timers have but that will sooner or later unfortunately be lost. In addition to providing the actual data in a streamlined form, the Unicode Consortium should also provide metadata (property types,...) in a streamlined form. The schema in USA #42, the PropertyAliases.txt and PropertyValueAliases.txt files, and Table 9 in USA #44 currently come closest, but are still a far shot from what would be possible. I already proposed this about five years ago in an Internationalization and Unicode Conference talk. I remember very well that Mark Davis welcomed this idea then. It's unfortunate that nothing much has apparently happened in the meantime. Similar to other commenters, I'm ready to help, e.g. by contributing on github or somewhere. In conclusion, rather than giving up on Unicode in XML and TR 42, Unicode should think seriously about its long-term strategy to make its data much more streamlined and accessible. P.S.: Please also make the links from the PRIs (https://www.unicode.org/review/pri486/ for this PRI) to this form fill in the PRI number automatically.
Date/Time: Tue Jan 02 13:14:32 CST 2024
ReportID: ID20240102131432
Contact: bob_hallissy@sil.org
Name: Bob Hallissy
Report Type: Public Review Issue
Opt Subject: 486
I concur with many previous comments already posted, in particular that "freezing" UAX 42 makes it immediately useless to us and other developers who depend on it -- and I am confident there are more such developers than just those who have commented on this PRI to date. Please find an alternative approach (and a few have been suggested in these comments). - Bob Hallissy, SIL
Feedback above this line reviewed during UTC #178 in January 2024.