Grapheme clusters and east asian width
daniel.buenzli at erratique.ch
Wed Sep 16 16:34:17 CDT 2015
Le mercredi, 16 septembre 2015 à 21:27, Dominikus Dittes Scherkl a écrit :
> Why adding them up?
> I think every grapheme cluster of hangul syllables would have simply
> width 2 - that is the concept of CJK charakters.
I don't personally know how CJK characters behave in general w.r.t. to width, that's why I'm asking. I'm just trying to find a simple, best-effort, data-driven algorithm for the problem at-hand by using standard properties and possibly without making built-in assumptions about scripts.
Le mercredi, 16 septembre 2015 à 20:33, Richard Wordingham a écrit :
> Have you addressed the issue of Indic scripts? There are
> discontiguous grapheme clusters composed of indecomposable code points
> (e.g. U+17C4 KHMER VOWEL SIGN OO) and of decomposable code points (e.g.
> U+0BCA TAMIL VOWEL SIGN OO),
Not sure I understand what you mean here.
> and whether consonant + virama + consonant is one cell or two may even depend on the font (e.g.
Well anything that is related to font metrics is out of scope from the point of view of a tty as I can't get the information. For example it seems that U+1F400 to U+1F579 have an east-asian width of N but will actually occupy two columns in the built-in osx terminal; of course these scalar values are not east asian text per se.
> How are you handling ligatures between grapheme clusters,
> e.g. English <f, i>?
Here again I'd need font information for that, I expect the tty not to make ligatures between f and i.
Of course the best way would be to be able to hand out a string to the tty for it to measure. But then it already seems impossible to test whether a terminal is able to handle UTF-8 or not…
Maybe trying to use that east asian width property, was not a good idea to start with.
More information about the Unicode