On Mon, 28 May 2018 20:03:11 +0530
SundaraRaman R via Unicode <unicode_at_unicode.org> wrote:
> Hi, thanks for your reply.
>
> > There is only one character with a canonical combining class of 9
> > that is included as other_alphabetic, namely U+0E3A THAI CHARACTER
> > PHINTHU. That last had any of the other properties of viramas back
> > in Unicode 1.0; the characters that triggered such behaviours were
> > permanently removed in Unicode 1.1.
>
> I didn't understand the second sentence here, could you clarify?
Sorry, I messed that system up. It should have read, "The last time
that that had any of the other properties of viramas back
in Unicode 1.0;"
> What
> do you mean by "any of the other properties" here?
The effects of virama that spring to mind are:
(a) Causing one or both letters on either side to change or combine to
indicate combination;
(b) Appearing as a mark only if it does not affect one of the letters
on either side;
(c) Causing a left matra to appear on the left of the sequence of
consonants joined by a sequence of non-visible viramas.
> And "triggered such
> behaviours" seems to imply having them in other_alphabetic had
> negative consequences, could you give an example of what that might
> be?
Nowadays, the Thai syllable ไตร, normatively pronounced /trai/, is
only encoded <U+0E44 THAI CHARACTER SARA AI MAIMALAI, U+0E15 THAI
CHARACTER TO TAO, U+0E23 THAI CHARACTER RO RUA>, and the character
U+0E3A is always visible when used; for most routine purposes it is
little different to U+0E38 THAI CHARACTER SARA U. However, in Unicode
1.0, while <U+0E44, U+0E15, U+0E23> was rendered as at present, the same
visible string could also be encoded as <U+0E15, U+0E3A, U+0E23, U+0E74
THAI PHONETIC ORDER VOWEL SIGN SARA MAI MALAI> - no glyph would be
rendered for U+0E3A. If one wanted the official Sanskritised Pali
version, one could type ไตฺร <U+0E44, U+0E15, U+0E3A, U+0E23> as at
present. One could also encode it as <U+0E15, U+0E3A, U+200C, U+0E23,
U+0E74>.
Weirdly, I couldn't have used the phonetically ordered vowel to type a
monk's name ending in มฺโม <U+0E21 THAI CHARACTER MO MA, U+0E3A, U+0E42
THAI CHARACTER SARA O, U+0E21>, as <U+0E21, U+0E3A, U+200C, U+0E21,
U+0E72 THAI PHONETIC ORDER VOWEL SIGN O> would have been rendered as
โมฺม.
As the non-phonetic virama-like behaviours of U+0E3A are only mentioned
under the heading 'Alternate Ordering', I can only presume that they
were triggered by the phonetic order vowel signs, U+0E70 to U+0E74.
It is possible that U+0E3A acquired the alphabetic property because it
ceased to behave like a virama. Alternatively, it may have acquired
the alphabetic property because of its use in the compound vowels of
minority languages.
> But in the case of Tamil, I'm curious why most other combining Tamil
> marks go in class 0, whereas pulli goes in 9. Even u0B82 Anusvara, a
> character barely used in Tamil text, has combining class 0 and is
> included in Other_Alphabetic, but the visually similar and similarly
> positioned pulli is not. In this particular case, is it a historical
> accident that these got assigned this way, or is there a rationale
> behind these? Would it at all be possible to get this changed in the
> upcoming Unicode standard?
Tamil has usually been treated as just another Indian Indic script.
U+0E3A is the only virama-like character with the property of being
'alphabetic'.
I can't see a change making it into Unicode 11.0. It requires too much
careful thought. Besides, anything that considered <pulli> as
alphabetic should also considerer <pulli, ZWNJ> as alphabetic - they
should be mostly interchangeable in Tamil.
> > I fear that the correct test for what you want is to split text into
> > words and check that each word begins with an alphabetic
> > character.
>
> Do you mean "each grapheme cluster begins with an alphabetic
> character" here? It seems to me (in my very limited Unicode knowledge)
> that such a test, going through grapheme clusters and checking the
> first codepoint in each, would also ensure the text is full
> alphabetic.
Not directly. Is the string "mark2mark" alphabetic? It constitutes a
single word. My suggested simplification would say 'no', as it
contains '2'; perhaps my simplification is wrong.
> And it has the advantage that more languages have a
> (relatively) easy way for splitting text into grapheme clusters, than
> for checking minor Unicode properties like WordBreak, so this one
> might be easier to implement. Does this test anywhere in the ballpark
> of being right?
Yes, it's close to being right. Note that simple approximations for SE
Asian word-breaking (e.g. treating SE Asian characters as
alphabetic) should work well for your application.
Richard.
Received on Mon May 28 2018 - 11:46:04 CDT
This archive was generated by hypermail 2.2.0 : Mon May 28 2018 - 11:46:04 CDT