Hi Mr. Davis,
I'm Joan Montané, member of Softcatalà [1], a non-profit
organization that promotes the use of Catalan languages in
Computer world. Usually we translate open-source programs
(Mozilla products, GNOME, LibreOffice...), provide Catalan
resources for translator community (memory and guide
translations), and for speaker community (dictionaries,
hyphenation rules, tesaurus, grammar-checker...). Our work is
based 100% by volunteers.
I write you because you appear
as Unicode UAX #31 editor [2], and I want to check your opinion
first about requering a change in UAX #31. Of course, we can
talk in CLDR or other Unicode public mail-list.
Sorry for
my English, :)
Currently, I'm searching bugs related with
"·" MIDDLEDOT U+00B7. This char is used in Catalan like a
diacritical mark between two L's. According the differnet uses
of MIDDLEDOT, the category assigned by Unicode [3] is a little
nightmare for Catalans.
For instance, we have problems in
text-segmenation (and then, spell-cheking problem) if a program
doesn't follow UAX #29. Then, we request a text segmentation
following UAX #29 all our problems are gone, :)
I've reported several bugs about URLs autodection in email
clients. According to RFC5892 [4], appendix A.3, MIDDLEDOT is
allowed only between two L's, so I hope these bugs will be
fixed.
With UAX #31 is different, because is optional to include MIDDLEDOT
as a valid identifier char.
The default settings in UAX #31
doesn't allow identifiers using MIDDLEDOT. Then, a "weird" efect
occours. It's allowed to use Ŀ U+013F and ŀ U+0140 in identifiers.
But, as you can see in [5] and [6], their NFKC form are L+<U+00B7>
and l+<U+00B7>.
So, using default UAX #31,you can't define
an identifier if you use the preferred Unicode encoding L+<U+00B7>,
but you can obtain identifier compatible (but not cannonically
compatible) if you use a non preferred encoding (Ŀ U+013F and ŀ
U+0140). Really weird!!!
The best sample is Twitter. I suspect it uses UAX #31 to determine
hashtags. Catalans type hashtags like #il·lusió (illusion), and it
fails, but if you type #iŀlusió, then it works.
I wonder if the next release of UAX #31 can add a default clause
allowing <U+00B7> in identifiers if it follows an L (upper or lower
case). Something similar to RFC5892.
Last, but not least, I'm searching a Google's contact about Google
Translator. When you translate to Catalan it allways outputs spaces
arround MIDDLEDOT, when "cell" -> "cèl · lula" instead "cèl·lula".
Can you help me.