From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sat Feb 10 2007 - 09:15:00 CST
Philippe Verdy wrote on Saturday, February 10, 2007 1:27 AM
> Lokesh Joshi wrote on Thursday, February 08, 2007 4:45 AM
>> If possible can anyone pls confirm that the thai unicode sequence:
>
>> U+0E25 (THAI CHARACTER LO LING) U+0E37 (THAI CHARACTER SARA UEE)
>> U+0E4C(THAI CHARACTER THANTHAKHAT)
> is a valid sequence, as far i have been knowing thai this seems to be an
> invalid sequence, only in above vowels, SARA I (U+0E34) is valid before
> THANTHAKHAT.
> Gien that you use the U+xxxx notation, you are questionning about the
> validity of a sequence of codepoints. As all these codeponts are valid
> individually, the sequence is valid, and can be successfully encoded and
> decoded with all compliant UTF-* transforms or any compliant encoding.
I think Lokesh was struggling for the right term of denigration.
'Defective' is the closest I can find, but it does not quite address the
issue. In various places, the TUS states what order characters should occur
in, but doesn't define a term for sequences whose characters that are in the
wrong order. For example, in the Myanmar script (TUS 5.0 Section 11.3),
vowel signs below precede vowel signs above. This can't be handled by
(standard) canonical equivalence, because both sets have combining class 0.
Similarly, in most Indic scripts, a vowel mark above should follow a
subscript consonant in the same akshara, though there are a few cases (e.g.
Khmer) where both orders occur and the difference is significant.
> What may be intrigating is that the Thai TIS standard may have restricted
> the validity of those strings when they are encoded with this national
> standard (not sure about that).
It's a slight overreaction to the problem of typing the Thai word for
water -
1) <U+0E19 THAI CHARACTER NO NU, U+0E49 THAI CHARACTER MAI THO, U+0E33 THAI
CHARACTER SARA AM>
2) <U+0E19 NO NU, U+0E49 MAI THO, U+0E4D THAI CHARACTER NIKHAHIT, U+0E32
THAI CHARACTER SARA AA>
3) <U+0E19 NO NU, U+0E4D NIKHAHIT, U+0E49 MAI THO, U+0E32 SARA AA>
Coding 1 is the recommended method, and is in NFC and NFD. Coding 2 is the
NFKC and NFKD from Coding 1. It often misrenders, though it would work with
a mechanical typewriter or dumb font. Windows XP (but not Word XP and
probably not Vista) rejects it as keyboard input. Coding 3 is in NFC, NFD,
NFKC and NFKD forms, renders well, can easily be entered through the
generally restrictive Windows XP input editor, and is WRONG for Siamese.
(Does anyone know the correct denigration?) Uniscribe effectively aims to
convert Coding 1 to Coding 3, though I suspect a font could maintain an
artificial distinction. WTT 2.0 only allows Coding 1 - or, to be pedantic,
its TIS 620 equivalent.
According to 5.0 TUS Section 11.1, coding 2 should render with NIKHAHIT
above MAI THO. Dare one suggest that following SARA AA should change the
behaviour?
> Your question is similar to asking if the sequence string "qzlkqw" is
> valid using ony Latin consonnants;
No. The correct Thai analogue of "qzlkqw" as intended would be,say, <TO
TAO, SARA I, NIKHAHIT, TO TAO, SARA U, NIKHAHIT>. And both are potentially
interpretable, though. Are you sure no-one has used "qzlkqw" as his private
notation for a hypothetical PIE compound **kþl̥k̂ku? (I'd prefer
**tkl̥k̂ku-.)
Richard.
This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 09:19:07 CST