Request for Information
richard.wordingham at ntlworld.com
Thu Jul 24 01:37:42 CDT 2014
On Wed, 23 Jul 2014 20:45:48 +0100
fantasai <fantasai.lists at inkedblade.net> wrote:
> I would like to request that Unicode include, for each writing system
> it encodes, some information on how it might justify.
Unicode encodes scripts, and I suspect CLDR only really supports living
languages. Scripts can be used for multiple writing systems - the
example of the Latin script for Romaji in Japanese was given in the
> a) Text justification typically expands at word-separating
> characters, but may also expand between letters.
> b) Since this writing system does not use spaces, justification
> typically expands between letters.
Are you hoping for details on this? This justification, which I've
seen called 'Thai justification' in Microsoft Word, generally treats
spacing combining marks (gc=Mc) like letters in the Tai Tham script when
used for Tai Khuen.
> a) Latin typically breaks only at spaces and other punctuation.
> However, it also admits hyphenation within words.
> In some contexts (such as Japanese), it may, as a stylistic
> option, break anywhere (without hyphens).
This is also a mediaeval European style!
> c) Javanese only breaks between clauses, where punctuation is used,
> resulting in horrendously ragged lines. (Did I get that right?)
No. The text samples I could find quickly show scripta continua, but I
suspect the line breaks are occurring at word or syllable boundaries.
If I am right about the constraint on line break position, then this
can be recovered by marking the optional line breaks with ZWSP. In
addition, the consonants should be reclassified from AL to SA.
However, such a change would be incompatible with a modern writing
system in which words are separated by spaces (if such exists). I don't
know what happens in Indonesian schools, so I can't report an error.
Scripta continua and non-scripta continua in the same script are
incompatible in plain text.
> This information is of course encoded into UAX#14 and can be extracted
> from there (as I have done for Javanese above),
Not when writing systems in the same script differ as to whether they
delimit words by line-break inducing marks. Some Thai script minority
writing systems are supposed to use spaces to separate words, whereas
Thai is written using scripta continua.
More information about the Unicode