Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Dec 12 2003 - 06:44:51 EST

  • Next message: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

    On 11/12/2003 17:55, Philippe Verdy wrote:

    >Peter Kirk wrote:
    >
    >
    >>I am sure that some tricks could be found to
    >>simplify the indexing if necessary, e.g. using PUA or non-character code
    >>points indexed into a special table to replace DGCs which cannot be
    >>represented as a single character. (There are plenty of non-characters
    >>available as you need to use UTF-32 here to avoid exactly the same
    >>problems with surrogates.)
    >>
    >>
    >
    >You're quite optimistic here: the total number of DGCs that can be encoded
    >in Unicode goes far beyond the capacity of PUAs and even of the whole
    >Unicode range itself.
    >
    >I did not try to count them for the simplest cases, but possible DGCs are
    >nearly infinite:
    >- there's no upper limit for the number of diacritics you can combine with a
    >base character
    >- there's no limit in the number of base characters that can be used to
    >build Hangul syllables.
    >
    >
    More than that, actually infinite, as any one diacritic may be repeated.

    >So how will you allocate PUAs? Using an internal lookup table stored with
    >the document that use these PUAs that translates only the DGCs used
    >internally into single PUAs ? ...
    >
    Well, I wasn't actually thinking of storing these with the document,
    although I suppose they could be if I were to choose an approach which I
    don't like of storing documents in a private format. (This wouldn't even
    be an efficient format if I am mostly using UTF-32.) I was thinking
    rather of translating complex DGCs into PUAs etc on input of each
    document individually, and keeping in memory a table mapping these PUAs
    to character strings. Actually it is probably better in this case to use
    non-characters as there may be PUAs in the document already, and this
    avoids some of the problems you noted. As I have 65519 whole planes of
    non-characters available which can support more than 4 billion distinct
    DGCs, I think I will have enough space for any practical document.

    >... Now how will you implement indexing with these
    >private private PUAs which change of semantics across documents? What is the
    >relevant scope for these PUAs?
    >
    >
    The scope would be one instance of a document opened in an application.
    As for implementation details, that is for implementers to sort out.
    This was a tentative suggestion which I made in passing, not something
    which I had thought through in detail.

    In the 19th century Charles Babbage wrote, concerning his prototype
    computers:

    > Propose to an Englishman any principle, or any instrument, however
    > admirable, and you will observe that the whole effort of the English
    > mind is directed to find a difficulty, a defect, or an impossibility
    > in it.

    I regret that we English may have exported this unfortunate trait.

    -- 
    Peter Kirk
    peter@qaya.org (personal)
    peterkirk@qaya.org (work)
    http://www.qaya.org/
    


    This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 07:28:15 EST