[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search

CLDR Ticket #11239(new)

Opened 5 months ago

Last modified 2 days ago

CLDR Data stability guideline is inappropriate and should be discarded

Reported by: Marcel Schneider <charupdate@…> Owned by: fredrik
Component: docs-site Data Locale: All
Phase: dsub Review:
Weeks: Data Xpath: http://cldr.unicode.org/translation#TOC-Data-stability


CLDR guidelines mention data stability as more desirable than optimal data:

“Data stability

Please be mindful of data stability by carefully reviewing previously Approved data. When it's clearly incorrect, it should be changed — but for data stability, don't change the field it is already acceptable (even if not optimal). When you have an evidence of a variant being much better and in customary use than the existing Approved data, use the Forum to bring up discussions and gain consensus to change Approved values.”

Found at: http://cldr.unicode.org/translation#TOC-Data-stability, as per v34 survey while inherited from prior surveys (old guideline).

The meaning of “acceptable” is subject to interpretation. As a thumb rule, in User Interfaces, whenever data is not optimal, it is unacceptable. From an end‐user point of view, data that is suboptimal while it could be easily corrected (like now in a survey round of CLDR), reflects badly on the software brand it is used in, worsening the corporate image of the vendors that appear unable or unwilling to issue an order to correct all obsolete or otherwise wrong data. That phenomenon may commonly be associated with an impression of mindlessness, carelessness, cavalier attitude, flippancy on one hand, and old‐fashioned, not up‐to‐date, belated, backward‐minded on the other hand.

To encourage maintenance of suboptimal data in CLDR is ultimately to deceive the software vendors that are paying for it through their Unicode membership and through appointing delegated staff members to the project.

From that it becomes clear that this displaced stability policy aims at nothing more than keeping inadvertant end‐users in an illusion of accuracy, while those end‐users who are aware of wrong or suboptimal data are disrespected as a small minority (an assumption that needs yet to be assessed).

In a short form, stable while suboptimal CLDR may be considered a front‐stage of hypocrisy.

Therefore, I kindly request that poorly designed stability policy to be discarded.


Change History

comment:1 Changed 4 weeks ago by mark

  • Milestone changed from UNSCH to to-assess

comment:2 Changed 7 days ago by mark

  • Component changed from survey-backend to docs-site

Wrong component

comment:3 follow-up: ↓ 4 Changed 7 days ago by mark

Our concern is that optimal is in the eye of the beholder, resulting in back-and-forths between X and Y, and instability across versions with gratuitous differences that cause real pain for implementers.

Where X would be clearly viewed as better than Y by essentially all users of the language, that is different.

comment:4 in reply to: ↑ 3 Changed 7 days ago by Marcel Schneider <charupdate@…>

Replying to mark:

Our concern is that optimal is in the eye of the beholder, resulting in back-and-forths between X and Y, and instability across versions with gratuitous differences that cause real pain for implementers.

Where X would be clearly viewed as better than Y by essentially all users of the language, that is different.

Good point, indeed.

I was thinking at things like the need to remove circumflex accents on dévanâgarî in fr_FR, because they were nothing but a workaround at a time of limited support for diacritics. Today in fr_FR we may either write dévanagari, or devanagari, or devanāgarī. The latter is scientific, the second is widely accepted, but only the first conforms to the official spelling redesigned in 1990 and backed in 2016 by the administration. Only this spelling encounters a positive echo among a large variety of users, eg in discussions on Wikipedia, while (if I understand correctly) the unaccented form is preferred mainly by those following the remedial policy intended to repair the damages caused by a long-lasting colonial overfrenchfication. (See eg the difference in level of adaptation to locale spelling in French vs Breton, Polish, Czech, Hungarian.) This policy commands the rejection of all French accents on foreign names. So the circumflexes have neither support from hardliners (rejecting all French accents), nor from legalists (supporting ordinary French spelling allowing correct pronounciation without extra knowledge), and are therefore almost universally rejected. As of the acute accent, it is required for the sake of integrating these names into the French culture. As of the circumflex accents on these names, they mainly survive in the French translation of the Unicode Code Charts, which was coined uniquely by a locale that does not have any issue with the mainland colonial overfrenchification. I do not recommend CLDR to back these accents for use in UIs in the fr_FR locale. They may well stay in use in the fr_CA locale in CLDR, if really they are still widely preferred in that locale (needs to be assessed). What I’ve written above fully applies to the fr_FR locale.

Now we still find new-to-Unicode names like Nandinagari translated to nandinâgarî in fr, but only French standardists are keeping that line, just to avoid changes, not to seek consensus with real end-users. Changes are made to French character names where people are not supposed to look at, while in the first place, poorly designed names such as espace insécable étroite (literally translated from English) instead of espace fine insécable (real French term used in practice and mirrored in the Unicode Standard) are kept in maintenance even when the whole list is updated and many changes are implemented, as happened in the wake of Unicode 10.0.0.

When I say “optimal”, that usually is from the end-user point of view. Eg using <NNBSP> is the only way of getting working data, and is industrial practice in France, while replacing <NNBSP> with <NBSP> makes messy data such as may be “thrown together in Word.” Bad quality. If somebody advocates that, I can’t figure out that it would be sincerely and with the well-understood advantage of the end-user in mind. Missing font support is not considered a valid argument.

Alternatively one could implement spacing in CLDR separately by defining several levels of quality, like the percent sign has its placeholder in the number patterns but is rendered in a locale-tailored way, eg with leading <NNBSP> in French standard text, or with leading <NBSP> in bad typography but without a .notdef box in outdated fonts. However such different support levels are not found in CLDR, where the issue is only the coverage level (please see ticket:11524). I’d suggest that CLDR maintains correct data, and vendors may downgrade them as needed. That was the idea. CLDR should not present legacy workarounds as a standard way of writing in a locale. And indeed it doesn’t any longer, since <NNBSP> is the group separator in French, except that this is also the rule of the International System of measurement units, and should therefore be implemented in all locales using spacing as a group separator.

Your caveat should be added to the documentation, shouldn’t it?

comment:5 Changed 3 days ago by Marcel Schneider <charupdate@…>

In fact, the guideline as it is, shows clearly the point in distinguishing when to change and when not to change. My complaining about outseasoned stability enforcement was probably triggered by some misuse.

On the other hand, experience shows that the stability policy is eagerly overridden when supposed better reasons prompt to do so, eg when preformatted superscripts were replaced with baseline ordinal indicators in a number of languages’s RBNF rules (ticket:11626 and ticket:11653). With an entirely wrong rationale and possibly without seeking feedback from communities, Unicode support was downgraded for the sake of a bunch of unprofessional font designs.

One should just revert that abusive change.

Alternatively the RBNF rules can be restored, but just for English one would need to reach out to the communities wrt American vs European usage, and sort out the sublocales wrt preferences about ordinal indicators (superscript vs lining/baseline).

That seems to me a textbook example of uninformed change, and no CLDR guideline is strong enough to prevent people from messing with a number of Latin-script-using locales.

Suggested fix

The fix would probably be to make sure the CLDR guidelines are correctly applied.

comment:6 Changed 2 days ago by mark

  • Owner changed from anybody to fredrik

Setting person for initial assessment according to https://unicode.org/cldr/trac/admin/ticket/components

(IMO This would need discussion)


Add a comment

Modify Ticket

as new

E-mail address and user name can be saved in the Preferences.

Note: See TracTickets for help on using tickets.