I’m implementing a Unicode names library. I’m confused about loose character-name matching, even after rereading The Unicode Standard § 4.8, UAX #34 § 4, #44 § 5.9.2 – as well as [L2/13-142](http://www.unicode.org/L2/L2013/13142-name-match.txt <http://www.unicode.org/L2/L2013/13142-name-match.txt>), [L2/14-035](http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035 <http://www.unicode.org/cgi-bin/GetMatchingDocs.pl?L2/14-035>), and the [meeting in which those two items were resolved](https://www.unicode.org/L2/L2014/14026.htm <https://www.unicode.org/L2/L2014/14026.htm>).
In particular, I’m confused by the claim in The Unicode Standard § 4.8 saying, “Because Unicode character names do not contain any underscore (“_”) characters, a common strategy is to replace any hyphen-minus or space in a character name by a single “_” when constructing a formal identifier from a character name. This strategy automatically results in a syntactically correct identifier in most formal languages. Furthermore, such identifiers are guaranteed to be unique, because of the special rules for character name matching.”
I’m also confused by the relationship between UAX34-R3 and UAX44-LM2.
To make these issues concrete, let’s say that my library provides a function called getCharacter that takes a name argument, tries to find a loosely matching character, and then returns it (or a null value if there is no currently loosely matching character). So then what should the following expressions return?
getCharacter(“HANGUL-JUNGSEONG-O-E”)
getCharacter(“HANGUL_JUNGSEONG_O_E”)
getCharacter(“HANGUL_JUNGSEONG_O_E_”)
getCharacter(“HANGUL_JUNGSEONG_O__E”)
getCharacter(“HANGUL_JUNGSEONG_O_-E”)
getCharacter(“HANGUL JUNGSEONGCHARACTERO E”)
getCharacter(“HANGUL JUNGSEONG CHARACTER OE”)
getCharacter(“TIBETAN_LETTER_A”)
getCharacter(“TIBETAN_LETTER__A”)
getCharacter(“TIBETAN_LETTER _A”)
getCharacter(“TIBETAN_LETTER_-A”)
Thanks,
J. S. Choi
Received on Thu Jan 17 2019 - 17:45:09 CST
This archive was generated by hypermail 2.2.0 : Thu Jan 17 2019 - 17:45:09 CST