Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Fri Dec 12 2003 - 06:44:51 EST

Next message: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"

Previous message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
In reply to: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Reply: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 11/12/2003 17:55, Philippe Verdy wrote:

>Peter Kirk wrote:
>
>
>>I am sure that some tricks could be found to
>>simplify the indexing if necessary, e.g. using PUA or non-character code
>>points indexed into a special table to replace DGCs which cannot be
>>represented as a single character. (There are plenty of non-characters
>>available as you need to use UTF-32 here to avoid exactly the same
>>problems with surrogates.)
>>
>>
>
>You're quite optimistic here: the total number of DGCs that can be encoded
>in Unicode goes far beyond the capacity of PUAs and even of the whole
>Unicode range itself.
>
>I did not try to count them for the simplest cases, but possible DGCs are
>nearly infinite:
>- there's no upper limit for the number of diacritics you can combine with a
>base character
>- there's no limit in the number of base characters that can be used to
>build Hangul syllables.
>
>
More than that, actually infinite, as any one diacritic may be repeated.

>So how will you allocate PUAs? Using an internal lookup table stored with
>the document that use these PUAs that translates only the DGCs used
>internally into single PUAs ? ...
>
Well, I wasn't actually thinking of storing these with the document,
although I suppose they could be if I were to choose an approach which I
don't like of storing documents in a private format. (This wouldn't even
be an efficient format if I am mostly using UTF-32.) I was thinking
rather of translating complex DGCs into PUAs etc on input of each
document individually, and keeping in memory a table mapping these PUAs
to character strings. Actually it is probably better in this case to use
non-characters as there may be PUAs in the document already, and this
avoids some of the problems you noted. As I have 65519 whole planes of
non-characters available which can support more than 4 billion distinct
DGCs, I think I will have enough space for any practical document.

>... Now how will you implement indexing with these
>private private PUAs which change of semantics across documents? What is the
>relevant scope for these PUAs?
>
>
The scope would be one instance of a document opened in an application.
As for implementation details, that is for implementers to sort out.
This was a tentative suggestion which I made in passing, not something
which I had thought through in detail.

In the 19th century Charles Babbage wrote, concerning his prototype
computers:

> Propose to an Englishman any principle, or any instrument, however
> admirable, and you will observe that the whole effort of the English
> mind is directed to find a difficulty, a defect, or an impossibility
> in it.

I regret that we English may have exported this unfortunate trait.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Previous message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
In reply to: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Reply: Philippe Verdy: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Dec 12 2003 - 07:28:15 EST