L2/06-379

UDHR in Unicode

Table of Contents

1.  Presentation
2.  Notes for Abkhaz
3.  Notes for Afrikaans
4.  Notes for Yaneshaʼ
5.  Thanks

1. Presentation

The goal of the UDHR in Unicode project is to demonstrate the use of Unicode for a wide variety of languages, using the Universal Declaration of Human Rights (UDHR) as a representative text. The UDHR was selected because it is available in a large number of languages from the Office of the United Nations High Commissioner for Human Rights (OHCHR) at http://www.unhchr.ch/udhr/.

The deliverables of the project are, for each language:

As of November 4, 2006, the project has produced the XML representation for 298 different languages. There are 32 additional languages which are available in some form on the OHCHR site but have not yet been converted to a Unicode text format. There have been no submissions of new translations nor of existing translations in other writing systems (but those are welcome).

In addition to the demonstration value, the deliverables of this project may be useful to Unicode implementers, e.g. for testing purposes.

Also, experience reveals there are sometime multiple approaches to the representation of a language in Unicode. The project attempts to select and describe per-language Unicode usage guidelines to facilitate the interchange and processing of text (see the following sections for examples). The UTC may want to incorporate some of those in the text of the standard.

The Unicode community is encouraged to help with this project, in the form of Unicode versions for additional languages (using the OHCHR source or any other source), review of the existing Unicode versions, usage guidelines, and suggestions on making the project more generally useful. In the area of additional languages, the OHCHR data for the 32 languages which do not have a Unicode version yet are not amenable to more or less mechanical conversion, but instead need a complete retyping; also, we are particularly interested in the versions using the Myanmar script, both using the current Unicode model and using the proposed and approved new model.

For more information on the project, to access the current data, and to contribute, please visit http://udhrinunicode.org.

2. Notes for Abkhaz

Bibliography

[Chirikba] discusses the various orthographies which have been used for Abkhaz. There is in particular on page 88 a reprint of a page of [Bgazhba], which shows a comparative table of those othographies, some using cyrillic. Unfortunately, [Chirikba] reprints only part of the table, and we have not (yet) found a copy of [Bgazhba] to see the complete table.

Hooks and descenders

[Lomtatidze] and [Chirikba] give the same inventory of letters if one considers that the addition of a middle hook or a descender to a base letter is a glyph variation. The set of such modified letters is г ghe, к ka, п pe, т te, х ha and ч che. [Chirikba] consistently uses the forms with descender, while [Lomtatidze] uses the form with hook for гghe and п pe and the form with descenders for the others.

Because Unicode sometimes encodes separate characters for a letter with middle hook and a letter with descender, there can be some confusion about which character is appropriate. For Abkhaz, the following characters are recommended:

г ghe U+0495 ҕ CYRILLIC SMALL LETTER GHE WITH MIDDLE HOOK
к ka U+049B қ CYRILLIC SMALL LETTER KA WITH DESCENDER
п pe U+04A7 ҧ CYRILLIC SMALL LETTER PE WITH MIDDLE HOOK
т te U+04AD ҭ CYRILLIC SMALL LETTER TE WITH DESCENDER
х ha U+04B3 ҳ CYRILLIC SMALL LETTER HA WITH DESCENDER
ч che U+04B7 ҷ CYRILLIC SMALL LETTER CHE WITH DESCENDER

Rendering software can use the forms with descender for U+0495 and U+04A7 for Abkhaz text.

3. Notes for Afrikaans

Bibliography

Representing ’n

Afrikaans writes the indefinite article ’n, which is clearly an elision of een. There are a priori multiple ways to represent this using Unicode:

Regardless of one’s choice, it seems that all three representations, as well as the obvious <U+0027 ' APOSTROPHE, U+006E n LATIN SMALL LETTER N>, can be found on the web without too much difficulty. It is also likely that one can find <U+2018 ‘ LEFT SINGLE QUOTATION MARK, U+006E n LATIN SMALL LETTER N>, as a result of “smart quotes” functionality.

Note that the UCD provides the following complex uppercasing and titlecasing for U+0149: <U+02BC ʼ MODIFIER LETTER APOSTROPHE, U+004E N LATIN CAPITAL LETTER N>. However, according to the Afrikaanse woordelys en spelreëels, when a word starting with an apostrophe is capitalized, it is the first letter of the next word which is capitalized; for example, 'n groot ... at the beginning of a sentence becomes 'n Groot ... (rule 9.3; see more examples at http://hapax.iquebec.com/typo-afrikaans.htm).

4. Notes for Yaneshaʼ

Bibliography

Apostrophe

Besides the usual latin letters (some used in digraphs), the Yaneshaʼ orthography uses an apostrophe to write a glotal stop. This is best represented using U+02BC ʼ MODIFIER LETTER APOSTROPHE.

5. Thanks

This project could not happen without the help of volunteers (http://udhrinunicode.org/credits.html). Many thanks to them.