Thai Word Breaking
charupdate at orange.fr
Thu Aug 27 14:49:45 CDT 2015
On 22 Aug 2015 at 15:47, Richard Wordingham wrote:
> I'm trying to work out the meaning of TUS 8.0 Section 23.2.
> To do Thai word breaking properly, one needs to do a semantic analysis
> of the text to do the equivalent of resolving the equivalent of
> 'humanevents' into 'human events' rather than 'humane vents'. One also
> needs to cope with unknown and misspelt words. (A lot of effort has
> been devoted to avoid going to the extreme of doing semantic analysis.)
> However, it is possible to read Section 23.2 as prohibiting the use of
> certain information, and I would like to check whether this is the
> intended meaning.
> The opening paragraph seems clear enough on first reading:
> "The effect of layout controls is specific to particular text processes.
> As much as possible, lay-out controls are transparent to those text
> processes for which they were not intended. In other words, their
> effects are mutually orthogonal."
> However, my first question is, "Are paragraph boundaries
> directly admissible as evidence for or against word boundaries not
> adjacent to them?". For example, most Thai word breakers would not
> regard a paragraph boundary as any more significant than a
> phrase-delimiting space. However, a paragraph boundary often indicates
> a change of topic.
> My second question is, "Are line breaks admissible as evidence for
> or against word boundaries not adjacent to them?" For example, if a
> phrase makes heavy use of U+200B ZERO WIDTH SPACE (ZWSP), one may deduce
> that it is likely that all word boundaries within it are marked
> explicitly. This example is more useful for Khmer than to Thai, for
> whereas Cambodians were once taught to mark word boundaries, Thais
> rarely use ZWSP to mark word boundaries.
> My third question is, "Is the absence of a line break opportunity
> admissible as evidence for or against a word boundary?". Here I
> see conflicting signals.
> There is a character U+2060 WORD JOINER (WJ) which *used* to be regarded
> as the counterpart of ZWSP. The understanding was that ZSWP marked a
> word boundary and provided a line-break opportunity, while WJ denied
> both. This, however, is no longer the case. To quote the TUS section
> about WJ:
> P1: (Ignored)
> P2S1: The word joiner must not be confused with the zero width joiner
> or the combining grapheme joiner, which have very different functions.
> P2S2: In particular, inserting a word joiner between two characters has
> no effect on their ligating and cursive joining behavior.
> P2S3: The word joiner should be ignored in contexts other than line
> P2S4: Note in particular that the word joiner is ignored for word
> P2S5: (See Unicode Standard Annex #29, “Unicode Text Segmentation.”)
> Paragraph 2 Sentence 3 (P2S3) appears to rule out its use in
> word-breaking, but perhaps it does not if line-breaking is being used
> as evidence for word boundaries.
> P2S4 has three very different interpretations:
> (i) This is an assertion of fact, and may therefore be incorrect.
> (ii) The word 'is' is sloppy wording for 'should be'. Section 23.2
> contains much sloppier wording, as I have already advised members of
> the UTC (4 July 2015).
> (iii) This is a deduction from other parts of the specification. Now,
> if P2S4 said 'is normally ignored for word segmentation', that would
> have made sense, for that applies to the default word boundary
> specification in UAX#29. However, just before Section 4.1, UAX#29
> explains that it does not specify what happens for word boundary
> determination in Thai! (It does constrain what happens, though.)
> At the end of UAX#29 Section 6.2, there is the provision, "The Ignore
> rules should not be overridden by tailorings, with the possible
> exception of remapping some of the Format characters to other
> classes." To accord with the user perceptions of Unicode-aware
> people who work with SE Asian scripts, I am tempted to ask for CLDR
> to tailor the word-breaking algorithms for the corresponding languages
> so that the word-breaking classes of WJ (and ZWNBSP) are changed from
> Format to MidLetter. That would match the widespread old *perception*
> that there should be no word break in a sequence > mark,)* WJ, Thai letter>. However, there are several objections:
> (a) Perhaps P2S3 and P2S4 prohibit this.
> (b) If the word-break property of Thai letters falls back to Other,
> there would still be a word break between them.
> (c) If the word-break property of Thai letters fell back to ALetter,
> an old suggestion, WJ would have no effect on the presence of a word
> (d) If Thai word breaking assigns word-break classes to each letter
> (gc=Lo), then word boundaries can be suppressed by choosing the classes
> appropriately. Non-spacing Thai vowels are very relevant to Thai
> word-breaking, but formally are 'ignored'. WJ could be 'ignored' in
> exactly the same way.
Still nobody answered the questions Richard Wordingham raised five days ago. I'm very busy and can hardly channel off any time for concerns not related so far, except when I believe there's some need, as this is a discussion list.
However the Word Joiner topic made me launch a thread too, which has been thankfully answered. Now I feel that even if the WJ is apparently tailored to delimit words in mainstream word processors, the Standard denies this property, and Richard agrees if I've well understood. The criticism he works out should IMHO be fed into the 9.0 workflow.
Any comments from Thai users, implementers, and scientists?
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode