L2/06-379
UDHR in Unicode
Table of Contents
1. Presentation 2. Notes for Abkhaz 3. Notes for Afrikaans 4. Notes for Yaneshaʼ 5. Thanks 1. Presentation
The goal of the UDHR in Unicode project is to demonstrate the use of Unicode for a wide variety of languages, using the Universal Declaration of Human Rights (UDHR) as a representative text. The UDHR was selected because it is available in a large number of languages from the Office of the United Nations High Commissioner for Human Rights (OHCHR) at http://www.unhchr.ch/udhr/.
The deliverables of the project are, for each language:
the text of UDHR in a simple XML markup. Particular attention is paid to the use of the proper characters.
alternative presentations generated from the XML representation (plain text, HTML, PDF, etc.)
links to resources for each language, such as dictionaries and grammars (as they happen to be identified to build the first deliverable)
some comments on the use of Unicode (e.g. which characters are appropriate)
As of November 4, 2006, the project has produced the XML representation for 298 different languages. There are 32 additional languages which are available in some form on the OHCHR site but have not yet been converted to a Unicode text format. There have been no submissions of new translations nor of existing translations in other writing systems (but those are welcome).
In addition to the demonstration value, the deliverables of this project may be useful to Unicode implementers, e.g. for testing purposes.
Also, experience reveals there are sometime multiple approaches to the representation of a language in Unicode. The project attempts to select and describe per-language Unicode usage guidelines to facilitate the interchange and processing of text (see the following sections for examples). The UTC may want to incorporate some of those in the text of the standard.
The Unicode community is encouraged to help with this project, in the form of Unicode versions for additional languages (using the OHCHR source or any other source), review of the existing Unicode versions, usage guidelines, and suggestions on making the project more generally useful. In the area of additional languages, the OHCHR data for the 32 languages which do not have a Unicode version yet are not amenable to more or less mechanical conversion, but instead need a complete retyping; also, we are particularly interested in the versions using the Myanmar script, both using the current Unicode model and using the proposed and approved new model.
For more information on the project, to access the current data, and to contribute, please visit http://udhrinunicode.org.
2. Notes for Abkhaz
Bibliography
K.V. Lomtatidze, Grammatika abkhazskogo *i*azyka. Fonetika i morfologi*i*a, Sukhumi, "Alashara," 1968.
Chirikba, V. A. (V*i*acheslav Andreevich), Abkhaz, Munich, LINCOM, 2003. Volume 119 of Languages of the world/Materials. ISBN 3895861367.
Bgazhba, Kh., Iz istorii pis'mennosti v Abkhazii [On the history of Writing in Abkhazia], Tbilisi Metsnieveba, 1967.
[Chirikba] discusses the various orthographies which have been used for Abkhaz. There is in particular on page 88 a reprint of a page of [Bgazhba], which shows a comparative table of those othographies, some using cyrillic. Unfortunately, [Chirikba] reprints only part of the table, and we have not (yet) found a copy of [Bgazhba] to see the complete table.
Hooks and descenders
[Lomtatidze] and [Chirikba] give the same inventory of letters if one considers that the addition of a middle hook or a descender to a base letter is a glyph variation. The set of such modified letters is г ghe, к ka, п pe, т te, х ha and ч che. [Chirikba] consistently uses the forms with descender, while [Lomtatidze] uses the form with hook for гghe and п pe and the form with descenders for the others.
Because Unicode sometimes encodes separate characters for a letter with middle hook and a letter with descender, there can be some confusion about which character is appropriate. For Abkhaz, the following characters are recommended:
г ghe U+0495 ҕ CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK к ka U+049B қ CYRILLIC SMALL LETTER KA WITH DESCENDER п pe U+04A7 ҧ CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK т te U+04AD ҭ CYRILLIC SMALL LETTER TE WITH DESCENDER х ha U+04B3 ҳ CYRILLIC SMALL LETTER HA WITH DESCENDER ч che U+04B7 ҷ CYRILLIC SMALL LETTER CHE WITH DESCENDER Rendering software can use the forms with descender for U+0495 and U+04A7 for Abkhaz text.
3. Notes for Afrikaans
Bibliography
Afrikaanse woordelys en spelreëls, Taalkommisie van die Suid-Afrikaanse Akademie van Wetenskap en Kuns, Pharos, Cape Town, 2002, ISBN 1868900347.
Representing ’n
Afrikaans writes the indefinite article ’n, which is clearly an elision of een. There are a priori multiple ways to represent this using Unicode:
<U+0149 ʼn LATIN SMALL LETTER N PRECEDED BY APOSTROPHE>. After all, this character is specifically annoted “Afrikaans”. However, it is also annoted “legacy compatibility character for ISO/IEC 6937”, and the corresponding character is deprecated (in that standard, not in Unicode).
using the compatibility decomposition of U+0149, which is <U+02BC ʼ MODIFIER LETTER APOSTROPHE, U+006E n LATIN SMALL LETTER N>. However, it seems that this decomposition is a remnant of the time when U+02BC was the recommended character for the representation of elision, which is no longer the case.
<U+2019 ’ RIGHT SINGLE QUOTATION MARK, U+006E n LATIN SMALL LETTER N>, on the basis of U+2019 being the appropriate character for the apostrophe used for elision, and on the basis of other similar constructs in Afrikaans. This is the solution we have retained for the representation of the UDHR.
Regardless of one’s choice, it seems that all three representations, as well as the obvious <U+0027 ' APOSTROPHE, U+006E n LATIN SMALL LETTER N>, can be found on the web without too much difficulty. It is also likely that one can find <U+2018 ‘ LEFT SINGLE QUOTATION MARK, U+006E n LATIN SMALL LETTER N>, as a result of “smart quotes” functionality.
Note that the UCD provides the following complex uppercasing and titlecasing for U+0149: <U+02BC ʼ MODIFIER LETTER APOSTROPHE, U+004E N LATIN CAPITAL LETTER N>. However, according to the Afrikaanse woordelys en spelreëels, when a word starting with an apostrophe is capitalized, it is the first letter of the next word which is capitalized; for example, 'n groot ... at the beginning of a sentence becomes 'n Groot ... (rule 9.3; see more examples at http://hapax.iquebec.com/typo-afrikaans.htm).
4. Notes for Yaneshaʼ
Bibliography
SIL Peru's site for Yaneshaʼ: http://www.sil.org/americas/peru/html/pubs/pub_index_list.asp?sortby=lang&name=Yanesha%C2%B4%20(Amuesha)
Dictionary: http://www.sil.org/americas/peru/html/pubs/show_work.asp?id=601
Grammar: http://www.sil.org/americas/peru/html/pubs/show_work.asp?id=597
Orthography: http://www.sil.org/americas/peru/html/pubs/show_work.asp?id=3213
Apostrophe
Besides the usual latin letters (some used in digraphs), the Yaneshaʼ orthography uses an apostrophe to write a glotal stop. This is best represented using U+02BC ʼ MODIFIER LETTER APOSTROPHE.
5. Thanks
This project could not happen without the help of volunteers (http://udhrinunicode.org/credits.html). Many thanks to them.