This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.
Date/Time: Thu Dec 15 00:05:11 CST 2022
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 466
The proposed draft UTS #55, Unicode Source Code Handling, lacks information on spoofing and usability issues arising from lookalike syllables in Brahmic scripts. Most Brahmic scripts have been encoded in Unicode according to principles that differ from those used for most other scripts. For most non-Brahmic scripts, spacing characters are encoded in visual order, with nonspacing marks following the spacing characters they attach to. If multiple nonspacing marks attach to the same base, marks that interact typographically are encoded from innermost (closest to the base) to outermost, while Unicode normalization handles ambiguities caused by nonspacing marks that don’t interact typographically. For most Brahmic scripts, the intent is that characters are encoded in phonetic order, independent of visual placement relative to each other, and Unicode normalization is largely disabled by using the canonical combining class 0 for most combining marks. To ensure interoperability between smart keyboards, predictive input systems, spelling checkers, font rendering systems, fonts, systems for searching and sorting text, optical character recognition systems, speech input and output systems, text normalization, and other text processing software, the Unicode Standard would have to define the encoding order of orthographic syllable components precisely and unambiguously for each Brahmic script. However, the Unicode Standard fails to do so. Fonts and font rendering systems to some extent try to impose order by inserting dotted circles into character sequences that their designers find inappropriate, but do so incompletely and inconsistently, with a tendency to relax rules over time. The result is that in a number of Brahmic scripts a given orthographic syllable can be encoded in multiple ways with the same rendering. This is well documented, for example, for Khmer – see Horton et al. 2017, Lindenberg 2019, Hosken 2021. For example, the word ស្ត្រី (woman) can be encoded with three different character sequences with identical rendering in all major rendering systems: ស្ត្រី, ស្រ្តី, ស្រី្ត – even after eliminating ambiguities introduced by the intentional confusable subjoined consonants ◌្ដ (coeng da) and ◌្ត (coeng ta). See Hosken 2021 pages 34-36 for more examples. The issues could be documented in UTS 55 as follows. Spoofing using lookalike orthographic syllables The Unicode Standard uses phonetic encoding order for most Brahmic scripts, but does not define the encoding order of orthographic syllable components for most such scripts. As a consequence, syllables can often be encoded in multiple character sequences that render identically. This can be used for spoofing, for instance, by constructing identifiers that look like they are the same, but are actually different. Example: Consider the following Python program: ស្ត្រី = True ស្រ្តី = False if ស្ត្រី: print("True!") else: print("False?”) The program looks like it would print “False?”, but it actually prints “True!” because the ស្រ្តី assigned False is a different variable than the ស្ត្រី assigned True, and the ស្ត្រី tested in the if-statement is the one assigned True. Usability issues arising from lookalike orthographic syllables When working with Brahmic scripts, there is a common usability issue whereby one accidentally types an orthographic syllable using the wrong character sequence, with no difference in the resulting rendering. For example, the code shown in “Spoofing using lookalike orthographic syllables” may be the result of one engineer typing ស្ត្រី, another typing ស្រ្តី, which look identical but are in fact different variables. To address these problems, the Unicode Standard would have to specify the encoding order of orthographic syllable components for all Brahmic scripts. A proposal for Khmer is currently under discussion. References: Joshua Horton, Makara Sok, Marc Durdin, Rasmey Ty: Spoof-Vulnerable Rendering in Khmer Unicode Implementations. 2017. https://lt4all.elra.info/proceedings/lt4all2019/pdf/2019.lt4all-1.35.pdf Norbert Lindenberg: Issues in Khmer syllable validation. 2019. https://lindenbergsoftware.com/en/notes/issues-in-khmer-syllable-validation/ Martin Hosken: Khmer Encoding Structure. 2021. https://www.unicode.org/L2/L2021/21241-khmer-structure.pdf