Amiguity(?) in Sinhala named sequences
harshula at hj.id.au
Sun Oct 16 18:15:57 CDT 2016
On 15/10/16 04:07, Martin Jansche wrote:
> For Sinhala, the following named sequences are defined (for good reasons):
> SINHALA CONSONANT SIGN YANSAYA;0DCA 200D 0DBA
> SINHALA CONSONANT SIGN RAKAARAANSAYA;0DCA 200D 0DBB
> SINHALA CONSONANT SIGN REPAYA;0DBB 0DCA 200D
> I'll abbreviate these as Yansaya, Rakaransaya, and Repaya, and I'll
> write Ya for 0DBA and Ra for 0DBB.
> Note that these give rise to two potentially ambiguous codepoint
> strings, namely
> 0DBB 0DCA 200D 0DBA
> 0DBB 0DCA 200D 0DBB
> I'll concentrate on the first, as all arguments apply to the second one
> At a first glance, the sequence 0DBB 0DCA 200D 0DBA has two possible parses:
> 0DBB + 0DCA 200D 0DBA, i.e. Ra + Yansaya
> 0DBB 0DCA 200D + 0DBA, i.e. Repaya + Ya
> First question: Does the standard give any guidance as to which one is
> the intended parse? The section on Sinhala in the Unicode Standard is
> silent about this. Is there a general principle I'm missing?
> Sri Lanka Standard SLS 1134 (2004 draft) states that Ra+Yansaya is not
> used and is considered incorrect, suggesting that the second parse
> (Repaya+Ya) should be the default interpretation of this sequence.
> However, SLS 1134 does not address the potential ambiguity of this
> sequence explicitly and the description there could be read as
> informative, not normative.
1) re: 0DBB 0DCA 200D 0DBA
SLS 1134 was updated in 2011 (The latest public version I could find is
v3.41. This extract is the same in v3.6.):
"1. The yansaya is not used following the letter ර. e.g.: the spelling
කාර්ය is incorrect."
If the above is insufficient, it's best to discuss the issue with Harsha
(CC'd) and Ruvan (CC'd).
2) re: 0DBB 0DCA 200D 0DBB
Harsha & Ruvan can clarify this too.
> Second question: Given that one parse of this sequence should be the
> default, how does one represent the non-default parse?
> In most cases one can guess what the intended meaning is, but I suspect
> this is somewhat of a gray area. In practice, trying to render these
> problematic sequences and their neighbors in HarfBuzz with a variety of
> fonts results in a variety of outcomes (including occasionally
> unexpected glyph choices). If the meaning of these sequences is not well
> defined, that would partly explain the variation across fonts.
> Am I missing something fundamental? If not, it seems this issue should
> be called out explicit in some part of the standard.
> -- martin
More information about the Unicode