Re: Query for Validity of Thai Sequence

From: Richard Wordingham (richard.wordingham@ntlworld.com)
Date: Sat Feb 10 2007 - 09:15:00 CST

  • Next message: Mike: "Re: Autodetection of CP437 vs. Latin-1"

    Philippe Verdy wrote on Saturday, February 10, 2007 1:27 AM

    > Lokesh Joshi wrote on Thursday, February 08, 2007 4:45 AM

    >> If possible can anyone pls confirm that the thai unicode sequence:
    >
    >> U+0E25 (THAI CHARACTER LO LING) U+0E37 (THAI CHARACTER SARA UEE)
    >> U+0E4C(THAI CHARACTER THANTHAKHAT)

    > is a valid sequence, as far i have been knowing thai this seems to be an
    > invalid sequence, only in above vowels, SARA I (U+0E34) is valid before
    > THANTHAKHAT.

    > Gien that you use the U+xxxx notation, you are questionning about the
    > validity of a sequence of codepoints. As all these codeponts are valid
    > individually, the sequence is valid, and can be successfully encoded and
    > decoded with all compliant UTF-* transforms or any compliant encoding.

    I think Lokesh was struggling for the right term of denigration.
    'Defective' is the closest I can find, but it does not quite address the
    issue. In various places, the TUS states what order characters should occur
    in, but doesn't define a term for sequences whose characters that are in the
    wrong order. For example, in the Myanmar script (TUS 5.0 Section 11.3),
    vowel signs below precede vowel signs above. This can't be handled by
    (standard) canonical equivalence, because both sets have combining class 0.
    Similarly, in most Indic scripts, a vowel mark above should follow a
    subscript consonant in the same akshara, though there are a few cases (e.g.
    Khmer) where both orders occur and the difference is significant.

    > What may be intrigating is that the Thai TIS standard may have restricted
    > the validity of those strings when they are encoded with this national
    > standard (not sure about that).

    It's a slight overreaction to the problem of typing the Thai word for
    water -

    1) <U+0E19 THAI CHARACTER NO NU, U+0E49 THAI CHARACTER MAI THO, U+0E33 THAI
    CHARACTER SARA AM>

    2) <U+0E19 NO NU, U+0E49 MAI THO, U+0E4D THAI CHARACTER NIKHAHIT, U+0E32
    THAI CHARACTER SARA AA>

    3) <U+0E19 NO NU, U+0E4D NIKHAHIT, U+0E49 MAI THO, U+0E32 SARA AA>

    Coding 1 is the recommended method, and is in NFC and NFD. Coding 2 is the
    NFKC and NFKD from Coding 1. It often misrenders, though it would work with
    a mechanical typewriter or dumb font. Windows XP (but not Word XP and
    probably not Vista) rejects it as keyboard input. Coding 3 is in NFC, NFD,
    NFKC and NFKD forms, renders well, can easily be entered through the
    generally restrictive Windows XP input editor, and is WRONG for Siamese.
    (Does anyone know the correct denigration?) Uniscribe effectively aims to
    convert Coding 1 to Coding 3, though I suspect a font could maintain an
    artificial distinction. WTT 2.0 only allows Coding 1 - or, to be pedantic,
    its TIS 620 equivalent.

    According to 5.0 TUS Section 11.1, coding 2 should render with NIKHAHIT
    above MAI THO. Dare one suggest that following SARA AA should change the
    behaviour?

    > Your question is similar to asking if the sequence string "qzlkqw" is
    > valid using ony Latin consonnants;

    No. The correct Thai analogue of "qzlkqw" as intended would be,say, <TO
    TAO, SARA I, NIKHAHIT, TO TAO, SARA U, NIKHAHIT>. And both are potentially
    interpretable, though. Are you sure no-one has used "qzlkqw" as his private
    notation for a hypothetical PIE compound **k‏þl̥k̂ku? (I'd prefer
    **tkl̥k̂ku-.)

    Richard.



    This archive was generated by hypermail 2.1.5 : Sat Feb 10 2007 - 09:19:07 CST