Grapheme breaking rules (was: Tengwar vowel signs)

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Fri Jan 04 2002 - 22:40:02 EST


-----BEGIN PGP SIGNED MESSAGE-----

David Hopwood wrote:
> A minor modification is needed to the grapheme breaking rules. Give
> preceding tehtar a new property 'Grapheme_Precede', following tehtar
> 'Grapheme_Extend', and add some rules to prevent breaking between
> Grapheme_Precede and a following character:
>
> Precede × Precede
> Precede × Base
>
> This is potentially useful for other scripts as well, and it wouldn't
> increase the complexity of grapheme breaking much.
>
> [Actually, I've just noticed that there are no rules "Extend × Extend"
> and "Extend × Link". Shouldn't there be? If there aren't, then there will
> be breaks within combining sequences, and between a combining sequence
> and GRAPHEME JOINER, for example.]

I've looked at this more closely, and I'm now sure that there are two
mistakes in the breaking rules in PDUTR #28: the one described above, and
the fact that in the rule to prevent breaking CRLF, 'not CR' is used intead
of 'CR'.

Here is a corrected version of the existing rules:

                CR × LF

              Base × Extend
            Extend × Extend
              Link × Base
              Link × Join_Control Base
              Base × Link
            Extend × Link

                 L × (L / V / LV / LVT)
          (LV / V) × (V / T)
         (LVT / T) × T

               Any ÷

[The 6 main rules can alternatively be written as:

   (Base / Extend) × (Extend / Link)
              Link × [Join_Control] Base
]

and here are some proposed modifications to support preceding combining
marks, and slightly change the behaviour of join controls (see below):

  Precede = Join_Control / Preceding_Tehta
  Extend = Me / Mn / Mc / Following_Tehta / Other_Extend \ Link
  Link = GRAPHEME_JOINER / Virama
  Base = Any \ CONTROL \ Zp \ Zs \ Precede \ Extend \ Link

                CR × LF

           Precede × Precede
           Precede × Base
              Base × Extend
            Extend × Extend
              Link × Precede
              Link × Base
              Base × Link
            Extend × Link

                 L × (L / V / LV / LVT)
          (LV / V) × (V / T)
         (LVT / T) × T

               Any ÷

[The eight main rules can alternatively be written as:

   (Base / Extend) × (Extend / Link)
  (Link / Precede) × (Precede / Base)

Note that the "Link × Join_Control Base" rule is implemented instead
by "Link × Precede" and "Precede × Base". This allows more than one
Join_Control to appear between the Link and Base, but that should make
no practical difference.]

There are two differences in behaviour as a result of the modified rules:

 a) a sequence of join controls is considered to belong to the grapheme
    cluster that follows them.
 b) scripts like Tengwar, that have characters that combine with the
    following base character, are supported.

a) means that there are no 'invisible' grapheme clusters as a result
of join controls. This means that additional arrow keystrokes are not
needed to step over join controls, and that join controls are
deleted when the grapheme that follows them is deleted.

(Of course, an editor could have a mode that makes normally invisible
controls visible; in that case they would be treated like base characters
for grapheme breaking.)

There can still be invisible grapheme clusters as a result of other
characters in the set 'Default_Ignorable_Code_Point'; those should
probably all be looked at more closely, to see whether it would be better
to put some of them in the Extend or Precede categories.

(For example, why are the Mongolian and generic variation selectors not
in Grapheme_Extend? I'm confused, because they are category Mn, and not
in Grapheme_Link, but Grapheme_Extend is supposed to have been generated
as 'Me + Mn + Mc + Other_Grapheme_Extend - Grapheme_Link'. The file
versions I'm looking at are DerivedCoreProperties-3.2.0d4.txt and
PropList-3.2.0d6.txt.)

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPDZ07TkCAxeYt5gVAQHM/ggAw8tdn/hau+/IKsQsO0ouLB+RV4gVT/1c
JzwAhsLVxcw1KaJA1Jg2eExvc8B+FrCXQw+XpGOTKaje1WoyGJm3liZNIgLrRQ3M
z8da140ahfnhOcmlk13vGdGicJOutc7gJwDeoHPMU48JUqWR7eIv8GBLsXHOQ3Yn
CoXuIoKiF7fGYTbtCTV9Ow3h4ya11+S6SmCxr/NszqMddA+vVzB8kOnYe7u5fmTE
MHivd3B4e6fMm/RE6udmFn+gseQ4cRRj3C8UDRgnIyQOFVrrd2kbeO2Xek8HNOfn
cvBJOTPP672Z+BnigDXdunNm3txeaIgBfxCOO5/yORywgIdjQANzEw==
=XW57
-----END PGP SIGNATURE-----



This archive was generated by hypermail 2.1.2 : Sat Jan 05 2002 - 02:13:23 EST