[Unicode]   Technical Notes
 

Unicode Technical Note #19

Recommendations for Creating New Orthographies

Version 1
Authors Deborah Anderson, with Rick McGowan and Ken Whistler (and incorporating comments by Lorna Priest)
Date January 5, 2005
This Version http://www.unicode.org/notes/tn19/tn19-1.html
Previous Version n/a
Latest Version http://www.unicode.org/notes/tn19/


Summary

This document provides a set of recommendations to linguists who are developing new orthographies.

Status

This document is a Unicode Technical Note. It is supplied purely for informational purposes and publication does not imply any endorsement by the Unicode Consortium. For general information on Unicode Technical Notes, see http://www.unicode.org/notes/.

Introduction

When linguists and user communities are devising a new orthography for a language—or even trying to come up with a single new character for a phoneme—it is particularly helpful to select new symbols (or letters) that can be used easily on the computer. In this way, written materials can be made available quickly to others in electronic format, such as on web pages or in word processing documents, without the users needing to download special fonts, or have any specialized software. It should be considered a long-term disservice to users to saddle users with an orthography that does not work on today’s computers.

The following suggestions are intended to provide advice on which characters will be easier—or more difficult—to implement on computer systems that support Unicode, the widely supported international character encoding standard (www.unicode.org).

To get the easiest and quickest access to characters on computers, it is best to select from one of the over 96,000 characters already in the Unicode Standard. By using the Unicode characters as they were intended, chances are good that the devised orthography will be supported in a wide variety of off-the-shelf software and displayable with widely available fonts. (However, those creating orthographies need to test them out on their computer platforms and software before making such decisions; orthographies based on Latin will be more generally supported than those that rely on obscure historical scripts.)

By creating an orthography with a completely new character—not in Unicode—the chances that the full orthography will be supported in readily available software is effectively zero. New characters need to go through a years-long process to be standardized, involving meetings with standards committees, followed by delays while the characters are incorporated into software and fonts. Devising a new orthography means you will not be able to use modern technology for it, and you will need to rely on non-standard fonts and software while waiting for the international character encoding bureaucrats to get the new characters into the international standard. There is also a chance that a de-novo character will not be approved for standardization. Hence it is advisable—whenever possible—to use a character already encoded, but care must be taken in selecting one that is appropriate.

Recommendations for creating an orthography

Begin by looking through the code charts in the latest version of the Unicode Standard, or in the PDF version on the Unicode Consortium website. Characters in Unicode are arranged in blocks of similar characters (code blocks) and are listed by block on the Unicode Consortium website (http://www.unicode.org/charts/). Be sure to also read the documentation on the particular characters you choose, so that they have appropriate semantics. Merely having the correct shape is insufficient for making a decision.

If you are using the Latin alphabet, rely on characters from Latin blocks (Basic Latin, Latin-1 Supplement, Latin Extended-A, Latin Extended-B, and Latin Extended Additional). IPA symbols can also be used (IPA Extensions, the "Phonetic Extensions" block, and Spacing Modifier Letters), although these do not necessarily have upper case equivalents.

If ASCII characters are used (those accessible on most keyboards, including A-Z, a-z, and 0-9 with some common punctuation) abide by the character properties assigned to those characters. For example, "$" has currency value and using it as a letter will cause problems, such as failure in word-selection, parsing difficulties for automated processing, inappropriate font-changes, and so forth if it is used as a letter. Similarly, selecting a number to be used as a letter will also cause problems.

Characters based on combinations of a base character and combining marks (listed in the Combining Diacritical Marks block, http://www.unicode.org/charts/PDF/U0300.pdf) are possible to use, but it is best to try to select a combination from commonly available characters. Hundreds of pre-composed combinations are available for the Latin script. You can also test these, or other, combinations on widely available fonts first (such as Doulos SIL, Gentium or Lucida Grande, Arial Unicode MS, and Code2000) and transport your experimental texts to different platforms. Note that correct placement of a given combining mark may be limited by a given font or text-rendering technology.

Always select a Unicode character with the appropriate character properties and semantics, not just the right shape. Character properties identify characteristics, such as whether a particular character is a letter or a number. The right semantics assure that your character will be treated appropriately as an orthographic character, not as a punctuation mark or non-letter symbol.

To determine a character's properties, one of several methods can be used. You can check the on-line database (http://www.unicode.org/Public/UNIDATA/UnicodeData.txt) or use a utility program to look up the data for characters of interest, such as the free UniBook Character Browser (http://www.unicode.org/unibook/index.html). UniBook has a facility for viewing character properties and displaying charts color-coded by categories such as alphabetic, upper-case, etc. Another tool is the Unicode Character Properties Excel Workbook by Peter Constable, which is available for download at: http://scripts.sil.org/ExcelUnicodeData.

A description of how to read the list of properties is contained in Unicode Character Database, under the "Property Values" section. For example: Lowercase letters are “Ll,” uppercase letters are “Lu.” For further information, please consult Chapter 4 of The Unicode Standard 4.0.

Suggestion: Find another character in Unicode that is used in a similar way to the one you are working with and look at the encoded character’s properties. When in doubt, you can always ask someone, for example by joining the Unicode mail list.

If the script you are working with is left-to-right, be sure to select a character that is similarly left-to-right or, if a letter, neutral. Likewise, an orthography would be very awkward, perhaps even impossible to use, if it included letters with different basic directionality, such as a mixture of Hebrew and Cyrillic letters.

Strongly avoid the use of characters identified as a "presentation form" (i.e., from the Alphabetic Presentation Forms and the Arabic Presentation Forms blocks) or "letterlike symbol," Roman numerals (from the "number forms" block), or mathematical operators, or other characters that are discouraged or deprecated.

For a punctuation mark, select a character from the punctuation block. Using CJK punctuation with a Latin script is possible, but the CJK fonts may make them appear extra-wide or extra-narrow.

Avoid selecting a symbol for use as an orthographic letter that is part of a pair (such as an opening bracket or closing bracket).

For orthographies based on the Latin script, before selecting a character from a script other than Latin, carefully read:

(a) the description for the script contained in The Unicode Standard 4.0 or later (available online in PDF format from http://www.unicode.org/versions/). The Unicode Standard provides script descriptions by groups: chapter 7 discusses the European alphabetic scripts (including the Latin script and its extension, the IPA, as well as the Greek, Cyrillic, Armenian and Georgian scripts), Middle Eastern scripts are covered in chapter 8, etc. The text will describe any special issues that may influence the selection of a given script.

(b) any annotation to the particular character given in the names list (i.e., the notes contained below the name of a character and preceded by a bullet in the code charts). Particularly avoid a character whose use is discouraged. The annotation describing the use of a character does not limit it from being employed in other ways: for example, U+0195 LATIN SMALL LETTER HV appears in Gothic transliteration, but it could be adopted in a Native American language orthography.

Note that if an unattested character is employed only occasionally, users may want to consider changing to a symbol already in Unicode rather than having to wait for 2-5 years for a character to be approved (if it is approved at all). For example, if a special character occasionally is written to represent a glottal stop, other alternative characters already in the Unicode Standard could be used. While this may take some time of adjustment, the user community will be able to access such an alternative immediately.

Avoid the Private Use Area, absolutely. This is an area set aside for the private testing of characters, and is never to be used for any data that is to be interchanged widely or stored permanently in an archive.

Selecting characters from the Basic Multilingual Plane (i.e., those in any block whose characters are numbered below U+10000) will be easier to use on most currently available computer platforms and software. Those scripts listed above "Linear B Syllabary" on the code chart page (http://www.unicode.org/charts/) are in the BMP. (A list of the scripts contained in the BMP is available at: http://www.unicode.org/roadmaps/bmp/.)

Suggestion: Select a character that is already widely incorporated in available fonts. If you select characters from a wide variety of scripts, it may be more difficult to find a readily available font with all the needed glyphs for a given character. A listing of fonts by Unicode character range is found on Alan Wood’s website (http://www.alanwood.net/unicode/fontsbyrange.html#u1e00)

Avoid using all upper-case in a Latin-based orthography. All lowercase would be better (because it's more readable and typeable on standard machines and keyboards).

If you devise an orthography based on the Arabic script, it is best to use existing characters whenever possible. Because the Arabic script model in Unicode contains pre-composed characters with various dotting patterns, the invention of a new letter with a unique dotting pattern will not be usable with existing software until or unless a new character is standardized. Consider choosing a letter that is a fair match for the rest of the orthography, and is already encoded, even if that usage is for another language or phonetic value.

Examples of Good and Bad Practice

Example 1: Language X has 7 vowels. The creators of the new orthography want to stay away from diacritical marks as much as possible, so they use A, E, I, O, U, but need two more "vowels". "Bad practice” would be picking the numerals "6" and "7" as the other two vowels, giving: A, E, I, O, U, 6, 7 as "letters" in the orthography. Better practice would be to just use an epsilon, open o, rounded letters or other base vowels from the extended Latin sets.

Example 2: You have a language with a glottal stop, and you're using a Latin-based orthography. Choosing Arabic "ain" as the character for glottal stop would not be advisable. Instead, use one of the various half-rings or other letters for glottal stop.

If no character in Unicode is found

If a user community is convinced that a particular character is needed and missing, or if a particular character has been used widely in print and has been clearly typographically differentiated through time, a character proposal may be appropriate. A proposal can be written and put forward to the two standards committees. To pursue this strategy, first float the proposal on a forum such as the Unicode mail list, and then consult the Submitting New Characters web page for detailed information on process and current proposals.

For questions

For questions on the appropriateness of a character or to determine whether a script proposal should be written (or has already been proposed), send a query to the Feedback Form (http://www.unicode.org/reporting.html) on the Unicode website, or write to the Script Encoding Initiative (http://www.linguistics.berkeley.edu/sei), a project at UC Berkeley that aims to assist user communities in getting their scripts/characters proposed. Another useful resource to consult is the “Orthography Development in Relation to Unicode” web page at http://scripts.sil.org/OrthographyDev.

For further information on available Unicode fonts and software For information on Unicode-compliant fonts and software, consult Alan Wood's website and the listing on the Unicode-enabled products on the Unicode Consortium web page http://www.unicode.org/onlinedat/products.html. Also check the SIL web pages.for additional information on Unicode, encoding, and writing systems (http://scripts.sil.org/).

3 References

[FAQ] Unicode Frequently Asked Questions
http://www.unicode.org/faq/
For answers to common questions on technical issues.
[Glossary] Unicode Glossary
http://www.unicode.org/glossary/
For explanations of terminology used in this and other documents.
[Reports] Unicode Technical Reports
http://www.unicode.org/reports/
For information on the status and development process for technical reports, and for a list of technical reports.
[Versions] Versions of the Unicode Standard
http://www.unicode.org/standard/versions/
For details on the precise contents of each version of the Unicode Standard, and how to cite them.


Modifications

The following summarizes modifications from the previous version of this document.

1 Initial version