Re: Tengwar vowel signs

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Fri Jan 04 2002 - 00:17:03 EST

Previous message: Asmus Freytag: "Fwd: PDUTR #25: Unicode Support for Mathematics"
In reply to: Kenneth Whistler: "Re: Tengwar vowel signs"
Next in thread: David Hopwood: "Grapheme breaking rules (was: Tengwar vowel signs)"
Reply: David Hopwood: "Grapheme breaking rules (was: Tengwar vowel signs)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

-----BEGIN PGP SIGNED MESSAGE-----

Kenneth Whistler wrote:
> Michael Everson wrote:
> > http://www.evertype.com/standards/iso10646/pdf/tengwar-vowels.pdf
> > http://www.evertype.com/standards/iso10646/pdf/tengwar.pdf
>
> Maybe I haven't read these carefully enough, but it appears to
> me that the analysis you provide in tengwar-vowels.pdf (which I
> find myself in agreement with) doesn't match your statement about
> the vowels (tehtar) in tengwar.pdf,

Note that there are three different proposals in the two papers.

> where you claim the tehtar
> are not combining marks, that the logical order is the same for
> both Quenya and Sindarin modes, and that display is a matter of
> picking out ligatures that ligate the tehtar with preceding base
> letters in Quenya and following base letters in Sindarin.

An alternative (fourth) approach encodes the tehtar twice: a set of
"preceding tehtar", and a set of "following tehtar". Equivalently,
the vowels can be thought of as being in three sets: Quenya-style,
Sindarind-style and Beleriand-style (the correspondance between these
sets should be reflected in the encoding).

The fact that some tehtar precede the character they are applied to is not
really a significant problem IMHO; there are already far more complicated
cases of conjoining characters that don't follow a simple "base + diacritic"
model in Unicode (e.g. in Hangul and Tamil). AFAICS, the main reason for the
"combining characters follow the base character" rule is to allow for
consistent canonicalisation, but that is not a problem here, I don't think:
the preceding tehtar can be given combining class 0.

Here are some of Michael Everson's examples encoded using this approach:

  +e = following tehta above
  _e = following tehta below
  e+ = preceding tehta above
  e = full vowel

  language/style word encoding
  -----------------------------------------------------
  Quenya nelde n +e ld +e
  Quenya neltildi n +e l t +i ld +i
  Sindarin neled n e+ l e+ d
  Sindarin nelthil n e+ l th i+ l
  Beleriand neled n e l e d
  Beleriand nelthil n e l th i l
  English/Quenya animal ^ +a n +i m +a l
  English/Sindarin animal a+ n i+ m a+ l
  English/Beleriand animal a n i m a l
  Old English mihton m i+ h ZWJ t o+ n
  Old English <thorn><ae>re th ae+ r _e

(A much nicer .gif illustration of this is attached.)

A minor modification is needed to the grapheme breaking rules. Give
preceding tehtar a new property 'Grapheme_Precede', following tehtar
'Grapheme_Extend', and add some rules to prevent breaking between
Grapheme_Precede and a following character:

Precede x Precede
Precede x Base

This is potentially useful for other scripts as well, and it wouldn't
increase the complexity of grapheme breaking much.

[Actually, I've just noticed that there are no rules "Extend x Extend"
and "Extend x Link". Shouldn't there be? If there aren't, then there will
be breaks within combining sequences, and between a combining sequence
and GRAPHEME JOINER, for example.]

If a "preceding tehta" is at the end of a string, it is treated as a
spacing character. COMBINING DOT ABOVE and COMBINING ACUTE ACCENT are
used for Beleriand as in tengwar.pdf.

Advantages:
- preserves the logical structure (including the language style of each
word, and therefore the pronunciation).

- no need for two font types, unlike the proposal in tengwar.pdf.

- each logical element of the script corresponds to exactly one Unicode
character (except for use of ZWJ for true ligatures).

- straightforward one-to-one transliteration without reordering is
possible between the Quenya, Sindarin, and Beleriand styles, and Latin
script (except for adding carriers [*]).

- no problems for collation: it's easy to sort this encoding according to
pronunciation. Carriers would be ignorable.

- completely straightforward input - the language style determines which
vowel character is produced when the user types a given Latin vowel.

- more natural encoding of vowels following a consonant for Old English;
   use both preceding and following tehtar as appropriate (see
   "<thorn><ae>re" example above, which would have to be encoded as
   "th ae r ZWJ e-below" in the tengwar.pdf proposal).

- no problems with canonicalisation or grapheme breaking, provided
preceding tehtar are given the correct properties. Grapheme breaks
reflect the syllabic language structure.

[*] Alternatively, two consecutive tehtar, or a following tehta at the
    beginning of a word, or a preceding tehta at the end of a word, could
    could be considered to imply a carrier. The advantage of that would
    be closer correspondance to the underlying language, but it requires
    more complex rendering; on the whole I think carriers should probably
    be encoded explicitly.

Disadvantage:
- requires fonts to be able to place a mark over the following character
rather than the preceding one.

I doubt that the font issue is a serious problem when using OpenType or
similar. Even a very simple font can treat the preceding tehtar as zero-
width overlays that extend to the right (as usual this doesn't take
account of character widths or heights, but it's an acceptable fallback).

I think it's also significant that one of the main original purposes of
Tengwar was as a way of exploring script and language structure (whether
of fictional languages or real ones). If the encoding doesn't reflect that
structure, then what's the point?

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPDU5vzkCAxeYt5gVAQF1LggAg68wBSSq3+p0rADtDq6vEpZgtSCvKGsZ
5z77dLn/R5JJz2KboqIZa3KVOFPUTMq/zqkL0okLhClKMIvaVywqX4go3CoKijkf
Bu2u1pMUQLImf9RIt/nolHBFpBcVSKvDnFRzquzShh/lhqOBxrYids/BwHwQpL52
s79I0zXiZ2iNzSSKXzkddrhIZrCukrVnavdbpvwrtXoQD9uqK9V7DOHnQkcoiYnY
zCEdRmzxmX9gvWWlc3qz2MJwz9qChbMTSMQMydW5e+l3yFUvcqVIYvtfk9LP6bFT
cpWJ1ja5EeWBQuD6gnBjsZHDa16sy16GvAzab6r6TM1+wuLn51/g+A==
=9soF
-----END PGP SIGNATURE-----

Previous message: Asmus Freytag: "Fwd: PDUTR #25: Unicode Support for Mathematics"
In reply to: Kenneth Whistler: "Re: Tengwar vowel signs"
Next in thread: David Hopwood: "Grapheme breaking rules (was: Tengwar vowel signs)"
Reply: David Hopwood: "Grapheme breaking rules (was: Tengwar vowel signs)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Fri Jan 04 2002 - 03:31:27 EST