Re: CLDR and ICU from Philippe Verdy on 2012-07-27 (Unicode Mail List Archive)

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Sat, 28 Jul 2012 00:47:16 +0200

2012/7/27 Richard Wordingham <richard.wordingham_at_ntlworld.com>:
> The restrictions improve legibility. As it is, many of the
> character-level elements in CLDR XML files tend to be unreadable. It
> would be better for them not to require genuinely complex text
> rendering. In a related matter, it was very inconvenient to have to
> treat collation test files as binary data because they could not be DOS
> text files - ctrl/Z in the comments cut the files short.

I do agree. Even if non-characters are used internally for the
processing of CLDR data, this is the result of the internal conversion
of the CLDR data files which are still interchanged as text files.

For this reason, the needed sentinels (which are perfectly a valid
element to include in the data) should be reencoded. As these data are
in CML format, it is easy to markup them using specific XML elements
allowing to include not just text elements but simply indirect
references to code points that will be used internally, or probably
better without any assumption that these specific non-character
codepoints will be used).

As this internal processing will be internal, let's hide this
implementation detail and not even expose it even indirectly with
artefacts like <char cp="0000"/> but really as something more semantic
like <sentinel type="xyz"/>.

The internal processing of these sentinels are not restricted to use
only the code point encoding space, but could as well use any internal
integer type with negative values and bit packing of various flags in
those values, or could use other internal structure not needing any
sentinel, such as TLV-encoded structures for variable-length data that
will be mixing text contents and non-text data or meta-information).

The presence of ^Z controls, or other XML-restricted controls or
Unicode non-characters in the CLDR data files is also undesirable and
not needed (even if they appear in comments those comments which are
expected to be plain-text should still respect the minimum plain-text
requirements). These data files should still be fully compliant to
plain-text and XML requirements so that they can safely be used by
text-processing tools (including text editors, import tools for
databases, spreadsheet processors, and various format converters...).
Received on Fri Jul 27 2012 - 17:52:42 CDT

This archive was generated by hypermail 2.2.0 : Fri Jul 27 2012 - 17:52:44 CDT