Mixed-Script confusables in prog.languages
verdy_p at wanadoo.fr
Sun Dec 4 13:07:22 CST 2016
For Japanese, Korean and Chinese there are already assigned som "script"
codes in ISO 15924 you can use for mixed scripts (e.g. "Jpan"="Hani+Hrkt"
These are already standardized aliases you can use. For some languages this
can be more complex (e.g. some Berber languages may use
Latin+Tamazigh+Arabic, probably not in identifiers, but possibly in user
names if they are also used as identifiers)
There will stil remain confusables (such as between Latin, Greek and
Cyrillic variants of letter A) which are unavoidable in some names using
mixed scripts (notably in user names or some geographic feature names or
trademarks if they are used as identifiers for page names or similar on a
community website, forum, wiki, or similar).
Various websites and applications will need their own limitations on usable
names (and must know that any limitation may cause some orthographic
problems notably for user names).
In more technical programming languages however, you can usually be much
more restrictive as the identifiers used are generally abbreviated and
simplified: you can kill lettercase differences for example, as well as
bidi controls, and probably some joiner/disjoiner controls and other
invisible format controls (the identifiers will need to be distinguished,
if needed, using some other characters), and forcing a normalization to NFC
is certainly helpful. If you need to embed in these languages some user
names, they'll need to be "escaped" sometimes, or included in string
litterals rather than plain identifiers.
2016-12-04 12:09 GMT+01:00 Reini Urban <reini at cpanel.net>:
> Of course there exist several languages which require more than one
> script, like
> Japanese = Hiragana and Katakana and maybe Han,
> Korean = Hangul + Han, …
> or african languages as some have other than Latin roots, e.g. Ethiopian
> from Semitic.
> Indian languages also sound problematic, and all the Old_<script>
> For these I just add aliases to allow multiple Scripts.
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode