Feedback on CLDR JSON and encoding crucial data only in keys

Ben Hamilton beng at
Tue Nov 3 10:36:09 CST 2015

Will do, thanks! I'll file two separate issues, since they're unrelated.

Outlook<> より送信

On Mon, Nov 2, 2015 at 10:32 PM -0800, "Mark Davis ☕️" <mark at<mailto:mark at>> wrote:

I suggest that you file this as a bug, and we can discuss in the meeting.

For #1, the knottiest issue is
"dayperiod": {
"displayName": "AM/PM",
"displayName-alt-variant": "am/pm"

We've wrestled with this. As I recall, we considered fleshing it out, be something like:

"dayperiod": {
 "displayName": {
    "plain": "AM/PM",
    "variant": "am/pm"
But because 'alt' could potentially go on every leaf node that would require adding a level (and "plain") for essentially every leaf node. (And where alt can go on non-leaf nodes we'd have to work that in also.) But we could explore some ideas.

For #2, we could probably go to a simpler format for JSON. We could look at space-delimited strings, maybe with a special sequence for ranges, that would be easy to parse.


On Mon, Nov 2, 2015 at 10:25 AM, Ben Hamilton <beng at<mailto:beng at>> wrote:
Hi folks,

I'm working on a server to allow arbitrary queries of slices of CLDR data using the GraphQL protocol (<>).

While working with the fully resolved CLDR JSON data, I noticed a few design decisions that complicate building a structured object model (required by GraphQL) to represent it:

1) Crucial LDML data is often encoded only in JSON keys, requiring clients to parse keys to extract them

For example, number formats (e.g. from main/root/numbers.json) require parsing the keys to know the range of values to which the format should be applied:

"decimalFormat": {
"1000-count-other": "0K",
"10000-count-other": "00K",
"100000-count-other": "000K",
"1000000-count-other": "0M",

If I wanted to build an object model to represent this, I'd need to know that the keys of this dictionary include three pieces of data separated by "-" and write a parser which understands the meaning of each section.

This becomes much more complicated when dealing with dateFields.json, which include keys with particularly complex encodings. From main/root/dateFields.json:

"sat-narrow": {
"relative-type--1": "last Sa",
"relative-type-0": "this Sa",
"relative-type-1": "next Sa"
"dayperiod": {
"displayName": "AM/PM",
"displayName-alt-variant": "am/pm"

For this, I need to know that the "-" separators have multiple meanings, and might be present (or not), and could act either as a field separator, or as a negation operation in front of a number.

I think we can keep the keys as-is as opaque unique identifiers, but the values should be more structured. A map with separate fields for the meanings of each item in the key (plus the original value) would be great. The original XML format does this pretty well; I think we can do that in the JSON without too much trouble.

2) Much of the LDML data is represented as serialized UTS #35 UnicodeSet objects, which requires deserializing them to understand the underlying meaning

For example, main/root/characters.json includes:

"characters": {
"exemplarCharacters": "[]",
"auxiliary": "[]",
"punctuation": "[\\\\- , ; \\\\: ! ? . ( ) \\\\[ \\\\] \\\\{ \\\\}]",

This means every program which wants to interact with this data needs to include a UTS #35 UnicodeSet deserializer (or forward the raw patterns on to the client with the assumption that it will include a UnicodeSet deserializer).

For many languages including JavaScript / ECMAScript, I don't think there exists such a deserializer today—please let me know if I'm wrong!


CLDR-Users mailing list
CLDR-Users at<mailto:CLDR-Users at><>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <>

More information about the CLDR-Users mailing list