Re: Grapheme cluster boundaries and left-side spacing dependent vowels

From: Mark Davis (mark.davis@jtcsv.com)
Date: Tue Apr 22 2003 - 15:55:31 EDT

  • Next message: Addison Phillips [wM]: "RE: regular expressions with unicode situation?"

    To add on to what Ken has said, what UAX #29 does is define default grapheme
    cluster boundaries. While these form a well-defined core which can be very
    useful in language-independent processing, for particular languages a
    tailored grapheme cluster may be more useful, consisting of one or more
    default grapheme clusters. Examples of this are given in UAX #29.

    Mark
    (مرقص بن داود)
    ________
    mark.davis@jtcsv.com
    IBM, MS 50-2/B11, 5600 Cottle Rd, SJ CA 95193
    (408) 256-3148
    fax: (408) 256-0799

    ----- Original Message -----
    From: "Kenneth Whistler" <kenw@sybase.com>
    To: <Peter_Constable@sil.org>
    Cc: <unicode@unicode.org>; <kenw@sybase.com>
    Sent: Tuesday, April 22, 2003 11:45
    Subject: Re: Grapheme cluster boundaries and left-side spacing dependent
    vowels

    > Peter Constable wrote:
    >
    > > Jungshik Shin wrote on 04/21/2003 09:27:04 PM:
    > >
    > > > I think two cases are distinct. In bidi text, bouncing back and
    forth
    > > > is across grapheme boundaries while in what James described, it's
    > > > within a single grapheme.
    > >
    > > Well, wasn't the point of James' comments: to determine whether the
    Indic
    > > sequences *should* be considered a grapheme?
    >
    > It's up to implementations, applications, and graphologists to
    > decide.
    >
    > The UTC made a brief foray onto the unforgiving ground of trying
    > to determine grapheme status and grapheme boundaries, but after
    > wrestling with the issue of trying to define "unithood" inside
    > Indic orthographic syllables, backed off again.
    >
    > UAX #29 now has a very streamlined definition of "default
    > grapheme cluster boundaries" which basically amounts to
    > trying to keep boundaries from falling within sequences of
    > base letters + non-spacing marks or within sequences of
    > jamos that constitute a Korean syllable. That's it.
    > UAX #29 default grapheme cluster boundaries don't even attempt
    > to specify whether Devanagari consonant conjuncts, or
    > akshara's, or orthographic syllables, or Indic constructs involving
    > vowels behaving as chunks of conjunct forms, or whatnot constitute
    > graphemes. Such determinations are basically out-of-scope for
    > Unicode, in my opinion.
    >
    > --Ken
    >
    >
    >



    This archive was generated by hypermail 2.1.5 : Tue Apr 22 2003 - 16:34:20 EDT