Richard has given some cogent arguments below.
Another counter example is the use of ":" to form abbreviations in
Swedish. (It's inserted in the word to replace the elided part). In that
use, this punctuation character is suddenly part of a "word".
To handle the full set of general case, word recognition has to be
plenty smart (and context or environment sensitive). The basic,
untailored "default" word breaking algorithm will only ever do the plain
vanilla cases right.
Basing decisions about encoding of characters on the failings of such
simple minded algorithms is really a non-starter. (The few existing
exceptions just prove the rule).
A./
On 3/9/2013 6:52 PM, Richard Wordingham wrote:
> On Sat, 09 Mar 2013 16:21:17 -0700
> Karl Williamson <public_at_khwilliamson.com> wrote:
>
>> Rendering is not the only consideration. Processing textual content
>> for 0387 is broken because it is considered to be an ID_Continue
>> character, whereas its Greek usage is equivalent to the English
>> semicolon, something that would never occur in the middle of a word
>> nor an identifier.
> ID_Continue is for processing things like variable names. How does
> allowing U+0387 in variable names cause problems in the processing of
> text?
>
> How would ID_continue allow you to process English «foc’s’le» or
> «co-operate»? The default word boundary determination has been
> tailored to give you the right results,and should work for Greek unless
> you are working with scripta continua, in which case you have massive
> problems regardless.
>
> Note also that word boundary determination is intended to be
> tailorable, which would allow one to exclude U+00B7 and U+0387 from
> words or deal with miscoded accents and breathings physically at the
> start of a word beginning with a capitalised vowel. One should also be
> able to tailor it to deal with word final apostrophes - though doing
> that in the CLDR style could be computationally excessive if the text
> may contain quoting apostrophes. One might even tailor it to allow
> Greek «ὅ,τι», depending on whether one wishes to count it as a word.
>
> Richard.
>
>
>
Received on Sat Mar 09 2013 - 21:22:34 CST
This archive was generated by hypermail 2.2.0 : Sat Mar 09 2013 - 21:22:40 CST