Re: U+nnnn notation and normative identifiers.

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Nov 09 2005 - 11:11:47 CST

  • Next message: Markus Scherer: "Re: Origin of the U+nnnn notation"

    From: "Erkki Kolehmainen" <erkki.kolehmainen@kotus.fi>
    > Philippe Verdy wrote:
    >> Why isn't there a project in CLDR to create such supplementary data for
    >> translated character names (that won't be identifiers, the only
    >> identifiers being the normative 4-to-6-digit hexadecimal code points)?
    >
    >
    > This would be a truly major project for all the languages, and each of
    > them would require an unprecedented consensus for all the names. In
    > Finland, we have translated the names of the Multilingual European Subset
    > 2 (MES-2) into Finnish and made the list freely available as a
    > recommendation - not a standard - of the Finnish Standards Association
    > SFS. We are now starting the process to expand the list, but we are only
    > considering the addition of a few hundred character names.

    Not that huge: the "root" locale can be fed with existing standard names,
    and then what is needed is a repository to store the per-language
    corrections when they are attested for that language. For most characters,
    no translation would be needed, and the normative name would be inherited
    from "root". So those that complain about incorrect English names could
    perform these corrections in the English locale, but not in the "root"
    locale.

    But I agree that once such project is started, there will be lots of updates
    for each language, trying to add more and more character name translations.

    The first thing to translate would be the basic set of letters, digits, and
    punctuations needed for the language. Then one could expand it to cover a
    significant subset of the native script, and finally cover the whole script,
    and symbols, before attempting to cover other scripts. At that time, the
    basic English collections would have been covered too, as well as major
    languages, so this would facilitate the creation of translations for
    languages that use other scripts than the existing translations.

    Such database would also resolve the various ambiguities that some existing
    names are causing: it's difficult to guess which character is effectively
    meant by the normative name, if you have not seen and compared their
    representative glyphs, and usage notes (when they are present in the Unicode
    names list file). Having to download a whole (possibly big, for example with
    Han ideographs) PDF to get those information is too much demanding when a
    better name could improve the correct interpretation of names, and would
    facilitate the search of characters by names.

    For some scripts, the database should also contain additional resource keys
    to retrive extra information (notably in Han ideographs, for which a search
    by radical and strokes would be helpful, as well as search by
    traditional/simplified usage).

    Finally, the representative glyphs could also be stored as bitmaps with a
    limited resolution (or SVG?) in the "root" locale, in another extra database
    (would be helpful mostly for Han ideographs), but this should not compete
    with font implementations (so no glyph properties, no kerning pairs, etc...
    only the rendered graphic at a single size would be stored). This would
    allow building input editors and other character selectors that present the
    characters in a grid.



    This archive was generated by hypermail 2.1.5 : Wed Nov 09 2005 - 11:14:01 CST