RE: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Dec 11 2003 - 20:55:19 EST

  • Next message: Chris Jacobs: "Re: character map in Microsoft Word"

    Peter Kirk wrote:
    > I am sure that some tricks could be found to
    > simplify the indexing if necessary, e.g. using PUA or non-character code
    > points indexed into a special table to replace DGCs which cannot be
    > represented as a single character. (There are plenty of non-characters
    > available as you need to use UTF-32 here to avoid exactly the same
    > problems with surrogates.)

    You're quite optimistic here: the total number of DGCs that can be encoded
    in Unicode goes far beyond the capacity of PUAs and even of the whole
    Unicode range itself.

    I did not try to count them for the simplest cases, but possible DGCs are
    nearly infinite:
    - there's no upper limit for the number of diacritics you can combine with a
    base character
    - there's no limit in the number of base characters that can be used to
    build Hangul syllables.

    So how will you allocate PUAs? Using an internal lookup table stored with
    the document that use these PUAs that translates only the DGCs used
    internally into single PUAs ? Now how will you implement indexing with these
    private private PUAs which change of semantics across documents? What is the
    relevant scope for these PUAs?

    For me it seems simpler (and more interoperable or integrable within
    compound documents) to avoid PUAs in all cases where they can be encoded
    using DGCs made of assigned code points.

    Use of PUAs is a convenient tool to assign glyph IDs within fonts that
    implement contextual forms referenced in the internal font lookup tables,
    when these tables can be processed by an external renderer to select
    contextual glyphs or to control ligation or kerning. The scope of these PUA
    is directly limited to the font that use them to allow a renderer to create
    Unicode strings that will finally be rendered using a basic string renderer.

    Other uses of PUAs have a limited scope related to specific standards or
    APIs or protocol layers in which they may be used to include some "markup"
    data within a stream of Unicode characters. These PUAs are internal to the
    process that use it, and the public erxternal interface will simply
    ignore/drop/reject external strings containing any colliding PUA whose
    semantic is not certified to match the one in the protocol scope -- the
    other option being to remap external PUAs to REPLACEMENT CHARACTER if they
    collide with internal PUAs, and to signal to the external program using that
    interface that transmission or usage of these external PUAs is not supported
    or that the interface can cause data loss.

    __________________________________________________________________
    << ella for Spam Control >> has removed Spam messages and set aside
    Newsletters for me
    You can use it too - and it's FREE! http://www.ellaforspam.com





    This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 21:33:59 EST