Amiguity(?) in Sinhala named sequences
mjansche at google.com
Fri Oct 14 12:07:23 CDT 2016
For Sinhala, the following named sequences are defined (for good reasons):
SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll write
Ya for 0DBA and Ra for 0DBB.
Note that these give rise to two potentially ambiguous codepoint strings,
0DBB 0DCA 200D 0DBA
0DBB 0DCA 200D 0DBB
I'll concentrate on the first, as all arguments apply to the second one
At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible parses:
0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
First question: Does the standard give any guidance as to which one is the
intended parse? The section on Sinhala in the Unicode Standard is silent
about this. Is there a general principle I'm missing?
Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not used
and is considered incorrect, suggesting that the second parse (Repaya+Ya)
should be the default interpretation of this sequence. However, SLS 1134
does not address the potential ambiguity of this sequence explicitly and
the description there could be read as informative, not normative.
Second question: Given that one parse of this sequence should be the
default, how does one represent the non-default parse?
In most cases one can guess what the intended meaning is, but I suspect
this is somewhat of a gray area. In practice, trying to render these
problematic sequences and their neighbors in HarfBuzz with a variety of
fonts results in a variety of outcomes (including occasionally unexpected
glyph choices). If the meaning of these sequences is not well defined, that
would partly explain the variation across fonts.
Am I missing something fundamental? If not, it seems this issue should be
called out explicit in some part of the standard.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode