---------- Forwarded message ----------
From: Joan Montané <jmontane@softcatala.org>
Date: Thu, Nov 21, 2013 at 9:01 PM
Subject: Catalan MIDDLEDOT and UAX #31
To: markdavis@google.com


Hi Mr. Davis,

I'm Joan Montané, member of Softcatalà [1], a non-profit organization that promotes the use of Catalan languages in Computer world. Usually we translate open-source programs (Mozilla products, GNOME, LibreOffice...), provide Catalan resources for translator community (memory and guide translations), and for speaker community (dictionaries, hyphenation rules, tesaurus, grammar-checker...). Our work is based 100% by volunteers.

I write you because you appear as Unicode UAX #31 editor [2], and I want to check your opinion first about requering a change in UAX #31. Of course, we can talk in CLDR or other Unicode public mail-list.

Sorry for my English, :)

Currently, I'm searching bugs related with "·" MIDDLEDOT U+00B7. This char is used in Catalan like a diacritical mark between two L's. According the differnet uses of MIDDLEDOT, the category assigned by Unicode [3] is a little nightmare for Catalans.

For instance, we have problems in text-segmenation (and then, spell-cheking problem) if a program doesn't follow UAX #29. Then, we request a text segmentation following UAX #29 all our problems are gone, :)

I've reported several bugs about URLs autodection in email clients. According to RFC5892 [4], appendix A.3, MIDDLEDOT is allowed only between two L's, so I hope these bugs will be fixed.

With UAX #31 is different, because is optional to include MIDDLEDOT as a valid identifier char.

The default settings in UAX #31 doesn't allow identifiers using MIDDLEDOT. Then, a "weird" efect occours. It's allowed to use Ŀ U+013F and ŀ U+0140 in identifiers. But, as you can see in [5] and [6], their NFKC form are L+<U+00B7> and l+<U+00B7>.

So, using default UAX #31,you can't define an identifier if you use the preferred Unicode encoding L+<U+00B7>, but you can obtain identifier compatible (but not cannonically compatible) if you use a non preferred encoding (Ŀ U+013F and ŀ U+0140). Really weird!!!

The best sample is Twitter. I suspect it uses UAX #31 to determine hashtags. Catalans type hashtags like #il·lusió (illusion), and it fails, but if you type #iŀlusió, then it works.

I wonder if the next release of UAX #31 can add a default clause allowing <U+00B7> in identifiers if it follows an L (upper or lower case). Something similar to RFC5892.

Last, but not least, I'm searching a Google's contact about Google Translator. When you translate to Catalan it allways outputs spaces arround MIDDLEDOT, when  "cell" -> "cèl · lula" instead "cèl·lula". Can you help me.

Thank in advance for your time.

Best regards,

Joan Montané

[1] http://en.wikipedia.org/wiki/Softcatal%C3%A0
[2] http://www.unicode.org/reports/tr31/
[3] http://unicode.org/cldr/utility/character.jsp?a=00B7
[4] http://www.rfc-editor.org/rfc/rfc5892.txt
[5] http://unicode.org/cldr/utility/character.jsp?a=013F
[6] http://unicode.org/cldr/utility/character.jsp?a=0140