L2/20-247
Date/Time: Mon Sep 28 14:35:19 CDT 2020
Name: Charlotte Buff
Subject: Restrictions on base characters of variation sequences (L2/20‑244)
Document L2/20‑244 (Lindenberg, “Variation sequences for combining marks”) proposes that the definition of variation sequences be changed so that all characters that ⓐ have a canonical combining class of 0 and ⓑ do not canonically decompose may serve as base characters to which variation selectors can be applied. However, this new definition – and also in fact the present definition used by the Unicode standard – is too loose, as even within the specified restrictions, there are still a couple of characters which would cause normalisation‐related problems if they supported variation selectors: Those that are the trailing codepoints in other codepoints’ (reversible) canonical decompositions. Consider for example U+09BE ◌া BENGALI VOWEL SIGN AA, which is a spacing mark with ccc=0 that does not decompose and would therefore theoretically allow for standardised variants under both the old and the new rules. However, if U+09BE occured directly succeeding U+09C7 ◌ে BENGALI VOWEL SIGN E, normalisation forms C and KC would combine them into U+09CB ◌ো BENGALI VOWEL SIGN O. If U+09BE had had an accompanying variation selector, it would now apply to U+09C7 instead, forming an invalid sequence. As of Unicode 13.0.0, the following codepoints are affected by this issue: U+09BE BENGALI VOWEL SIGN AA U+09D7 BENGALI AU LENGTH MARK U+0B3E ORIYA VOWEL SIGN AA U+0B56 ORIYA AI LENGTH MARK U+0B57 ORIYA AU LENGTH MARK U+0BBE TAMIL VOWEL SIGN AA U+0BD7 TAMIL AU LENGTH MARK U+0CC2 KANNADA VOWEL SIGN UU U+0CD5 KANNADA LENGTH MARK U+0CD6 KANNADA AI LENGTH MARK U+0D3E MALAYALAM VOWEL SIGN AA U+0D57 MALAYALAM AU LENGTH MARK U+0DCF SINHALA VOWEL SIGN AELA-PILLA U+0DDF SINHALA VOWEL SIGN GAYANUKITTA U+102E MYANMAR VOWEL SIGN II U+1161..U+1175 HANGUL JUNGSEONG A..HANGUL JUNGSEONG I U+11A8..U+11C2 HANGUL JONGSEONG KIYEOK..HANGUL JONGSEONG HIEUH U+1B35 BALINESE VOWEL SIGN TEDUNG U+11127 CHAKMA VOWEL SIGN A U+1133E GRANTHA VOWEL SIGN AA U+11357 GRANTHA AU LENGTH MARK U+114B0 TIRHUTA VOWEL SIGN AA U+114BA TIRHUTA VOWEL SIGN SHORT E U+114BD TIRHUTA VOWEL SIGN SHORT O U+115AF SIDDHAM VOWEL SIGN AA U+11930 DIVES AKURU VOWEL SIGN AA This list may expand in the future as new canonically decomposable characters are encoded. However, existing characters cannot become affected in a later version of the standard because a new character decomposing into already assigned codepoints would automatically be composition-excluded. Regardless of whether the rules for variation sequences will be changed or not, the aforementioned characters must be forbidden from receiving standardised variants, either implicitly (by simply never defining variants for them) or explicitly by changing the wording of section 23.4 of the core standard to specifically exclude them and potential future characters like them.