Accumulated Feedback on PRI #466

This page is a compilation of formal public feedback received so far. See Feedback for further information on this issue, how to discuss it, and how to provide feedback.

Date/Time: Thu Dec 15 00:05:11 CST 2022
Name: Norbert Lindenberg
Report Type: Public Review Issue
Opt Subject: 466

The proposed draft UTS #55, Unicode Source Code Handling, lacks information
on spoofing and usability issues arising from lookalike syllables in
Brahmic scripts.

Most Brahmic scripts have been encoded in Unicode according to principles
that differ from those used for most other scripts. For most non-Brahmic
scripts, spacing characters are encoded in visual order, with nonspacing
marks following the spacing characters they attach to. If multiple
nonspacing marks attach to the same base, marks that interact
typographically are encoded from innermost (closest to the base) to
outermost, while Unicode normalization handles ambiguities caused by
nonspacing marks that don’t interact typographically. For most Brahmic
scripts, the intent is that characters are encoded in phonetic order,
independent of visual placement relative to each other, and Unicode
normalization is largely disabled by using the canonical combining class 0
for most combining marks.

To ensure interoperability between smart keyboards, predictive input
systems, spelling checkers, font rendering systems, fonts, systems for
searching and sorting text, optical character recognition systems, speech
input and output systems, text normalization, and other text processing
software, the Unicode Standard would have to define the encoding order of
orthographic syllable components precisely and unambiguously for each
Brahmic script. However, the Unicode Standard fails to do so. Fonts and
font rendering systems to some extent try to impose order by inserting
dotted circles into character sequences that their designers find
inappropriate, but do so incompletely and inconsistently, with a tendency
to relax rules over time.

The result is that in a number of Brahmic scripts a given orthographic
syllable can be encoded in multiple ways with the same rendering. This is
well documented, for example, for Khmer – see Horton et al. 2017,
Lindenberg 2019, Hosken 2021. For example, the word ស្ត្រី (woman) can be
encoded with three different character sequences with identical rendering
in all major rendering systems: ស្ត្រី, ស្រ្តី, ស្រី្ត – even after
eliminating ambiguities introduced by the intentional confusable subjoined
consonants ◌្ដ (coeng da) and ◌្ត (coeng ta). See Hosken 2021 pages 34-36
for more examples.

The issues could be documented in UTS 55 as follows.

Spoofing using lookalike orthographic syllables

The Unicode Standard uses phonetic encoding order for most Brahmic scripts,
but does not define the encoding order of orthographic syllable components
for most such scripts. As a consequence, syllables can often be encoded in
multiple character sequences that render identically. 

This can be used for spoofing, for instance, by constructing identifiers
that look like they are the same, but are actually different.

Example: Consider the following Python program:

ស្ត្រី = True
ស្រ្តី = False
if ស្ត្រី:
    print("True!")
else:
    print("False?”)

The program looks like it would print “False?”, but it actually
prints “True!” because the ស្រ្តី assigned False is a different variable
than the ស្ត្រី assigned True, and  the ស្ត្រី tested in the if-statement
is the one assigned True.

Usability issues arising from lookalike orthographic syllables

When working with Brahmic scripts, there is a common usability issue whereby
one accidentally types an orthographic syllable using the wrong character
sequence, with no difference in the resulting rendering. For example, the
code shown in “Spoofing using lookalike orthographic syllables” may be the
result of one engineer typing ស្ត្រី, another typing ស្រ្តី, which look
identical but are in fact different variables.

To address these problems, the Unicode Standard would have to specify the
encoding order of orthographic syllable components for all Brahmic scripts.
A proposal for Khmer is currently under discussion.

References:

Joshua Horton, Makara Sok, Marc Durdin, Rasmey Ty: Spoof-Vulnerable Rendering 
in Khmer Unicode Implementations. 2017.
https://lt4all.elra.info/proceedings/lt4all2019/pdf/2019.lt4all-1.35.pdf 

Norbert Lindenberg: Issues in Khmer syllable validation. 2019.
https://lindenbergsoftware.com/en/notes/issues-in-khmer-syllable-validation/ 

Martin Hosken: Khmer Encoding Structure. 2021.
https://www.unicode.org/L2/L2021/21241-khmer-structure.pdf