RE: Unicode or specific language charset

From: Michael Maxwell (mmaxwell@casl.umd.edu)
Date: Tue Dec 19 2006 - 15:37:22 CST

  • Next message: Andrew Cunningham: "Re: Unicode or specific language charset"

    > 1) Some people working with diverse languages (thinking here
    > of some academic linguists) who have found comfortable
    > solutions in the past involving non-Unicode fonts may be
    > reluctant to change. These are probably fewer by the day, and
    > I imagine that anyone who has been exchanging text widely in
    > languages with extended Latin or non-Latin characters will
    > have seen the advantage of working in Unicode.

    I used to be one of those persons, when I worked on minority languages in Colombia. I would say the situation was (and maybe still is) more common with field linguists (working in minority languages) than it is with academics in general.

    Still, there is considerable impetus towards using Unicode in field linguistics--an increasing number of tools for field linguists are available in Unicode versions, the IPA has virtually all the characters one would need when recording data phonetically or phonemically, sufficient characters are available in Unicode for practical orthographies(!), major organizations that deal with minority/ previously unwritten languages are encouraging or even mandating the use of Unicode, etc.

    I think the only real issue for field linguists is that in some areas with complex orthographies, the fonts to implement those Unicode characters might be too language-specific. I can imagine that someone working with a minority language in India might find that standard Devanagari (etc.) fonts might not behave they way they need.

    I don't have any real examples of that, but I can say that the font/ rendering support of Unicode for Yoruba (which of course has been written for over a century) was lacking. Specifically, the combination of a dot under a vowel ('e' or 'o') plus a tone mark (grave or acute accent) does not look "pretty". You can see examples at http://en.wikipedia.org/wiki/Yoruba_language. When I look at this page on a Windows XP machine, the tone marks over the plain vowels are "correctly" placed (presumably built-in glyphs in the font), whereas the tone marks over the dotted lower-case vowels are much too high; while either the tone marks are too far to the right over the dotted upper-case vowels, or else the dot is too far to the right under the accented upper-case vowels (depending on which is composed first and therefore uses a built-in glyph, I suspect). (Mid-tone marks are not usually written, but in the wikipedia page you can see a few of these, and they have the same problems as the acute or grave accents on th
    e dotted vowels, and also over the 'n' or engma.)

    While the font issues I'm describing are not the fault of Unicode, this is not obvious to the casual user--and the distinction may not matter to the user in any case. Such a user might very well turn to a proprietary font/ encoding for displaying Yoruba or some other language with similar issues. And as you may know, those proprietary fonts/ encodings are all too common among the Indic languages...

       Mike Maxwell
       CASL/ U MD

       Mike Maxwell
       CASL/ U Md



    This archive was generated by hypermail 2.1.5 : Tue Dec 19 2006 - 15:41:42 CST