Re:Canonical ordering

From: Peter Constable (
Date: Tue May 02 2000 - 09:48:45 EDT


>The ones everyone knows about are Vietnamese, Greek,
>and Hebrew.

       Not necessarily everyone knows... ;-)

>I would expect that any language that made extensive
>use of more than a single accent on top of a letter
>might have some history of horizontal accomodations
>for the accents in its typography, however. It is
>just the natural thing for typographers to do when
>trying to create typefaces that work while having to
>deal with multiple accents.

       These aren't the only cases to consider. I'm thinking of cases
       of new orthographies for previously unwritten languages. Such
       languages obviously have no such traditions, yet it's possible
       that they may horizontally position diacritics.

>You really only would need to start language tagging
>if you are faced with having to deal with aggressively
>multilingual text, for which mixed conventions
>regarding accent stacking were significant and
>required to be rendered correctly. Frankly I think that
>is a small percentage case inside a small percentage

       I don't know for sure what "agressively multilingual" means,
       but it only takes two languages for such problems to arise. And
       it is not necessarily the case that this is unlikely to occur.

>Unicode is not intended as a generic text layout macro

       You know I know that.

>*Some* aspects of text layout need to be left to text
>markup and text description languages. :-) And it isn't
>clear that trying to include a plain text character
>mechanism for describing exactly how accents are placed
>over a letter makes sense to include in the character
>encoding per se.

       But Unicode does provide some level of support for this where
       it pertains to the meaning of text. That's why we have
       canonical ordering classes. So I'm just trying to determine how
       far we go and what can be done in situations that involve novel
       use of existing scripts.

       Let me give an example case (which I think is real):

       Thai diacritics when used for Standard Thai have strict
       co-ocurrence restrictions, and of those that can co-occur above
       a base character - vowel + tone or vowe + thanthakhat - it is
       always the case that the tone or thanthakhat stacks above the
       vowel. One particular co-occurence restriction is that mai tai
       khu never co-occurs with any other diacritic.

       There are a number of Mon-Khmer minority languages spoken in
       Thailand. Typically, these languages have a number of
       phonological distinctions that are not found in Thai,
       particularly related to vowel articulation. As a result, when
       writing one of these languages using Thai script, a
       significantly larger number of spellings for vowels are needed.
       For most such languages, it is likely that orthographic
       innovations will include (but not necessarily be limited to)
       the use of combinations of superior diacritics that do not
       co-occur when writing Standard Thai. Such combinations could
       include (for example) mai eek and mai thoo or mai eek and mai
       tai khuu positioned side-by-side above the base character. (I
       don't recall right now exactly what combinations I've been told
       about, but I'm pretty sure there were some that involved
       side-by-side positioning.)

       The likelihood of documents containing text from such a
       language as well as text in Thai is high.

       So, in a case like this, is language-specific rendering (and
       language tagging as needed - which would be always for data
       that will be exchanged) deemed to be the appropriate solution,
       or might we want to consider some mechanism in Unicode (e.g.
       base + diacr + GJ/ZWJ + diacr)?

       Note: the only cases I currently know of where this is
       potentially an issue involve Thai script.


This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT