On 2019-01-27 11:38 PM, Richard Wordingham via Unicode wrote:
> On Sun, 27 Jan 2019 19:57:37 +0000
> James Kass via Unicode <unicode_at_unicode.org> wrote:
>
>> On 2019-01-27 7:09 PM, James Tauber via Unicode wrote:
>>> In my original post, I asked if a language-specific tailoring of
>>> the text segmentation algorithm was the solution but no one here
>>> has agreed so far.
>> If there are likely to be many languages requiring exceptions to the
>> segmentation algorithm wrt U+2019, then perhaps it would be better to
>> establish conventions using ZWJ/ZWNJ and adjust the algorithm
>> accordingly so that it would be cross-languages. (Rather than
>> requiring additional and open ended language-specific tailorings.) (I
>> inserted several combinations of ZWJ/ZWNJ into James Tauber's
>> example, but couldn't improve the segmentation in LibreOffice,
>> although it was possible to make it worse.)
> If you look at TR29, you will see that ZWJ should only affect word
> boundaries for emoji. ZWNJ shall have no effect. What you want is a
> control that joins words, but we don't have that.
>
> Richard.
>
(https://unicode.org/reports/tr29/)
It’s been said that the text segmentation rules seem over-complicated
and are probably non-trivial to implement properly. I tried your
suggestion of WORD JOINER U+2060 after tau ( γένοιτ’ ἄν ), but it only
added yet another word break in LibreOffice.
The problem may stem from the fact that WORD JOINER is supposed to be
treated as though it were a zero-width no-break space. IOW it is a
*space*, and as a space it indicates a word break. That doesn’t seem right.
Instead of treating WORD JOINER as a SPACE, why not treat it as a WORD
JOINER? It could save a lot of problems wrt undesirable string
segmentation in addition to possibly minimizing future language-specific
tailoring and easing the burden on implementers.
Received on Sun Jan 27 2019 - 21:49:21 CST
This archive was generated by hypermail 2.2.0 : Sun Jan 27 2019 - 21:49:22 CST