Asmus Freytag (c)
asmusf at ix.netcom.com
Sun Sep 18 15:02:01 CDT 2016
On 9/18/2016 3:26 AM, Janusz S. Bien wrote:
> Quote/Cytat - Christoph Päper <christoph.paeper at crissov.de> (pią, 16
> wrz 2016, 23:51:38):
>> Janusz S. Bień <jsbien at mimuw.edu.pl>:
>>> 1. Graphemes, if I understand correctly, are language dependent, …
>> That’s true in linguistic terminology – well, at least within the
>> more popular schools of thought –, but not in technical (i.e.
>> Unicode) jargon.
> From the Unicode glossary:
> Grapheme. (1) A minimally distinctive unit of writing in the context
> of a particular writing system.[...] (2) What a user thinks of as a
"writing system" is vague enough to cover variations that might be
regional or language dependent.
> As for (2), cf.
> User-Perceived Character. What everyone thinks of as a character in
> their script.
> So we have "a user" versus "everyone...in their script" - is the
> difference intentional? Probably not. Anyway the definitions are
> language/locale dependent.
The "everyone" here aims at a shared understanding.
This becomes tricky in the case of Abugidas. There's certainly a shared
understanding that the "unit of writing" is the syllable, rather than in
individual mark, but the latter do have well-understood identities, not
least for teaching. That's perhaps the reason why there's the handwaving
about "minimally distinctive".
In some scripts like that, users can enter multiple sequences of
characters that resolve (for all practical purposes) into the same
syllable. (A big part of that in some scripts is that Unicode does not
always provide a means to normalize the order of subsidiary signs and
marks, typically combining marks)
For some tasks it would be great to have only well-formed syllables; but
to do that, you would need to add additional interpretation on top of
the Unicode definitions of a grapheme cluster.
If you just wrap the raw combining sequences into textels, then some
tasks might not actually get simpler. Instead of a simple rule that
determines which alternate orderings of marks are equivalent (to account
for users not typing them in the preferred order) you would have to
exhaustively list all combinations and set up equivalent tables.
More information about the Unicode