Revision | 03 Nov 1992; rev 92/11/25 |
Authors | Burmese proposal was written by Andy Daniels, with contributions by Lloyd Anderson,
Glenn Adams, and Lee Collins. Khmer proposal was written by Andy Daniels. Ethiopian proposal was written by Joe Becker. |
Date | 1992 |
This Version | http://www.unicode.org/unicode/reports/tr1.html |
Previous Version | |
Latest Version | http://www.unicode.org/unicode/reports/tr1.html |
Technical Reports contain material that has been approved by the Unicode Consortium for publication, but that is not necessarily considered part of the Unicode Standard. Often, technical reports are superseded by later standardization, or the informative material that they contain is incorporated into explanatory chapters in subsequent editions of The Unicode Standard. Sometimes minor updates to the Unicode Standard itself are published as Technical Reports. Some Technical Reports are not available for downloading.
Technical Report #1 - Burmese, Khmer, and Ethiopia
This Technical Report is comprised of three concrete proposals to which the Unicode
Technical Committee is strongly committed in their current form.
Status of this document
This document has been considered and approved by the Unicode Technical Committee for publication as a Technical Report. At the current time, the specifications in this technical report are provided as information and guidance to implementers of the Unicode Standard, but do not form part of the standard itself. The Unicode Technical Committee may decide to incorporate all or part of the material of this technical report into a future version of the Unicode Standard, either as informative or as normative specification. Please mail corrigenda and other comments to errata@unicode.org.
This Technical Report is comprised of three concrete proposals to which the Unicode Technical Committee is strongly committed in their current form. These are: Ethiopian, Burmese, and Khmer. These proposals have been reviewed internally and have been relatively stable over a period of time. The committee believes they represent good technical solutions for the proposed scripts, and therefore also recommends that specific codepoints within the body of Unicode be allocated for them, as follows:
Burmese U+0F00 U+0F7F
Khmer U+0F80 U+0FFF
Ethiopian U+1200 U+125F
Specific open issues for each of these are addressed in the respective draft block introductions. These open issues do not detract substantially from the solidity of the proposals.
Burmese U+0F00 -> 0F7F
The Burmese script is used to write Burmese, the majority language of Myanmar (formerly Burma) and Pali. Variations and extentions of the script are used to write other languages of the region, such as Shan and Mon, and also to write Sanskrit.
The Burmese script derives from 11th century Mon. The Mon script itself is probably borrowed directly from South India. The earliest Mon inscription, found at Lopburi in Thailand, dates from the eight century and is written in the Pallava script used at the Hinayana Buddhist center of Conjeeveram in the area of Madras on the east coast of India. In A.D. 1057 one of the first Burmese kings, Aniruddha, conquered Thaton, a major Mon center, and brought back with him to Pagan the most learned monks, artists and artisans of the Mon. The first inscription in Burmese dates from the following year and is written in an alphabet almost identical with that of the Mon inscriptions. Aside from rounding of the originally square characters, this script has remained largely unchanged to the present.
The Burmese script therfore ultimately derives from Brahmi, and so shares the structural features of its relatives: Consonant symbols include an inherent vowel; various signs are placed before, above, below and after a consonant to indicate a vowel other than the inherent one; ligatures and conjuncts are used to indicate consonant clusters.
In the course of its adaptation to non-Indo-Aryan languages, the
Burmese script has acquired some features that distinguish it from
other Indic scripts. The killer, or virama, participates in some
common constructions that would be clumsy to handle the way they
would be in the other Indic scripts, so the control function of
the virama is separated from the diacritic function of the killer.
The virama, 0F4D is used to form conjunct consonants, while the
killer, 0F52, is a simple diacritic and has no effect on character
shaping. The killer is also combined with the VOWEL SIGN O (0F4B)
to form the low level tone vowel "o." When used this way, this
symbol is known as hyei hto, or "thrust forward."
Burmese distinguishes as set of "medial" consonants. Originally
conjunct forms of YA PALE, YA GAU, WA and HA, they are used in
modern Burmese to form new letters and to spell certain vowel and
consonant combinations. They are treated here as no different from
any other conjunct and should be coded using the virama.
ISSUE: There's no reason from the point of view of the rendering
engine to have separate codes for the medials. Some implementors
feel that the medials should nevertheless have separate codes.
Including them introduces alternate spellings for the same syllables,
something that should be avoided. If there are compelling reasons
for including the medials, there is certainly room to add them.
When a syllable has more than one medial, it is recommended that
they appear in the order that such syllables are traditionally
spelled. That is, HA HTOU, before YA PIN or YA YI, before WA HSWE.
Note that YA PIN and YA YI cannot appear in the same syllable in
Burmese. For example, "cwei" ("to drop off") is coded as
0F15+0F5C+0F5E+0F47. "Hmyu" ("to delight, allure") is coded as
0F2E+0F5F+0F5D+0F42. This differs from the order in which medials
are normally written.
ISSUE: This rule is not strictly necessary, but regularizes the
spelling and simplifies rendering, string comparison and other
functions.
Burmese has several glyphs that are used with varying semantics
which are here given separate code points for each different usage.
The following pairs of letters look the same, but must be distinguished
in the text stream:
EHKAYA U (0F09) and NYA GALEI (0F5B)
GA NGE (0F17) and DIGIT HYI (0F6E)
WA (0F35) and DIGIT THOUN NYA (0F66)
YA GAU (0F30) and DIGIT HKUN NI (0F6D)
DIGIT LEI (0F6A) and SYMBOL LAGAUN (0F73)
The last two pairs are distinguished in some fonts but not in
others.
Also, the LETTER 0 (0F13) is distinguished from the sequence
0F48+0F5D, and the ZA MYINZWE is distinguished from 0F1A+0F5C.
Symbols not found as single characters are formed from sequences
of the basic characters given here. For example,tha ji ("great
tha") is coded by the sequence 0F38+0F4D+0F38, i.e., it is a conjunct
formed from two THAs. Kinzi is a conjunct formed from LETTER NGA
followed by some other consonant, that is, the sequence
0F19+0F4D+Consonant. Low level tone "o" has already been noted.
Level tone "ou" is to be coded as 0F41+0F3F. Other combinations
follow similarly.
The LETTER A, though classified here are a vowel, is actually a
consonant. Thus it can combine with any of the vowel symbols.
The tone mark AUKA MYI is often written to the left of a subscript
vowel sign or medial consonant. It should, nevertheless come after
the vowel or medial in the text stream. It is also used with killed
consonants in writing closed syllables. In this case, too, the AUKA
MYI should come after the ATHA in the text stream. For example,
the word /hyun./ (short, high falling tone) should be represented
as 0F30+0F5F+0F5E+0F02+0F51.
The SYMBOL HNAI is only used in the literary combination
0F73+0F19+0F52+0F03, meaning "the aforementioned."
Burmese does not use any whitespace between words. If word boundary
indications are desired, for example for the use of automatic line
layout algorithms, U+200B, ZERO WIDTH SPACE, is to be used.
Block Structure: Burmese characters are mapped to their corresponding
ISCII slots whenever possible. Gaps in the block result mainly from
this mapping. Several ranges of code points are reserved for future
expansion. A notable exception is the pair NYA GALEI and NYA JI.
Historically, NYA GALEI is a simple palatal nasal, while NYA JI is
a ligature representing a double NYA GALEI. NYA JI, however, has
come to be regarded as the primary form of the letter in Burmese,
so it is assigned to the "preferred" ISCII slot for the palatal
nasal (U+0F1E), and NYA GALEI is placed at U+0F5F.
U+0F00 to U+0F01 Unassigned
U+0F02 to U+0F03 Various signs
U+0F04 Unassigned
U+0F05 to U+0F14 Independent vowels
U+0F15 to U+0F39 Consonants
U+0F3A to U+0F3E Unassigned
U+0F3F to U+0F4C Dependent vowel signs
U+0F4D Virama
U+0F4E to U+0F50 Unassigned
U+0F51 to U+0F52 Tone marks
U+0F53 to U+0F5F Unassigned, reserved for extensions
U+0F60 to U+0F63 Additional dependent vowel signs
U+0F64 to U+0F65 Unassigned
U+0F66 to U+0F6F Digits
U+0F70 to U+0F73 Special symbols
U+0F74 to U+0F77 Unassigned, reserved for additional symbols
Note: The transliteration used here follows D. Haigh Roop, An Introduction to the Burmese Writing System (1972). Tone indications are left out of the character names.
ISSUE: As with Khmer, if there is a more standard transliteration,
it should be used.
ISSUE: Old Burmese has a small subscript LETTER A, which is the
precursor of the tone mark AUKA MYA and appears exactly where modern
Burmese would use the latter. This can probably be treated as a
font difference. There is also a superscript form of YA GAU, similar
in use to the Indic repha. This can probably be accommodated in
the shaping rules. This is not a major issue as there is plenty of
room to add these characters. Further investigation is required.
DRAFT 03 Nov 1992; rev 92/11/25
DRAFT BURMESE CHARACTER NAMES
0F00
0F01
@ Various Signs
0F02 BURMESE THEIDHEI TIN
= little thing put on
anusvara, niggahita
0F03 BURMESE HYEIGA PAU
= dots ahead
visarga
0F04
@ Independent Vowels
0F05 BURMESE LETTER A
0F06
0F07 BURMESE PALI EHKAYA I
= letter pali I
0F08 BURMESE EHKAYA I
= letter I
0F09 BURMESE EHKAYA U
= letter U
x Burmese nya galei -> 0F5B
0F0A
0F0B BURMESE LETTER VOCALIC R
Sanskrit
0F0C BURMESE LETTER VOCALIC L
Sanskrit
0F0D
0F0E
0F0F BURMESE EHKAYA EI
= letter EI
0F10
0F11
0F12
0F13 BURMESE LETTER O
x sra
0F14
@ Consonants
0F15 BURMESE KA JI
= great ka
0F16 BURMESE HKA GWEI
= curved hka
0F17 BURMESE GA NGE
= small ga
x Burmese digit hyi -> 0F6E
0F18 BURMESE GA JI
= great ga
0F19 BURMESE LETTER NGA
0F1A BURMESE SA LOUN
= round sa
0F1B BURMESE HSA LEIN
= twisted hsa
0F1C BURMESE ZA GWE
= split za
0F1D BURMESE ZA MYINZWE
= bridle za
x cya
0F1E BURMESE NYA JI
= great nya
0F1F BURMESE TA TALINJEI
= bier-hook ta
0F20 BURMESE HTA WUNBE
= duck hta
0F21 BURMESE DA YINGAU
= crooked-breasted da
0F22 BURMESE DA YEIHMOU
= water-dipper da
0F23 BURMESE NA JI
= great na
0F24 BURMESE TA WUNBU
= pot-bellied ta
0F25 BURMESE HTA HSINDU
= elephant-fetter hta
0F26 BURMESE DA DWEI
= twisted da
0F27 BURMESE DA AUHCAI
= bottom-indented da
0F28 BURMESE NA NGE
= small na
0F29
0F2A BURMESE PA ZAU
= steep-sided pa
0F2B BURMESE HPA OUHTOU
= capped hpa
0F2C BURMESE BA LAHCAI
= top-indented ba
0F2D BURMESE BA GOUN
= hump-backed ba
0F2E BURMESE LETTER MA
0F2F BURMESE YA PALE
= supine ya
0F30 BURMESE YA GAU
= crooked ya
x Burmese digit hkun ni -> 0F6D
0F31
0F32 BURMESE LETTER LA
0F33 BURMESE LA JI
= great la
0F34
0F35 BURMESE LETTER WA
x Burmese digit thoun nya -> 0F66
0F36 BURMESE LETTER SANSKRIT SHA
Sanskrit
0F37 BURMESE LETTER SANSKRIT SSA
Sanskrit
0F38 BURMESE LETTER THA
0F39 BURMESE LETTER HA
0F3A
0F3B
0F3C
0F3D
@ Vowel Signs
0F3E BURMESE YEI HCA
= line drawn down
0F3F BURMESE LOUNJI TIN
= big circle put on
0F40 BURMESE LOUNJI TIN HSAN HKA
= big circle put on with a grain of rice
0F41 BURMESE TAHCAUN NGIN
= one stroke drawn out
0F42 BURMESE HNAHCAUN NGIN
= two strokes drawn out
0F43 BURMESE VOWEL SIGN VOCALIC R
Sanskrit
0F44 BURMESE VOWEL SIGN VOCALIC RR
Sanskrit
0F45
0F46
0F47 BURMESE THAWEI HTOU
= thrust in front
0F48 BURMESE NAU PYI
= thrown backwards
0F49
0F4A
0F4B BURMESE VOWEL SIGN O
0F4C
@ Virama
0F4D BURMESE VIRAMA
x Burmese atha -> 0F52
0F4E
0F4F
0F50
@ Tone Marks
0F51 BURMESE AUKA MYI
= stopped below
0F52 BURMESE ATHA
= killer
= hyei htou, "thrust forward"
x Burmese virama -> 0F4D
0F53
0F54
0F55
0F56
0F57
0F58
0F59
0F5A
0F5B
0F5C
0F5D
0F5E
@ Consonants
0F5F BURMESE NYA GALEI
= little nya
x Burmese ehkaya u -> 0F09
@ Vowel Signs
0F60 BURMESE LETTER VOCALIC RR
Sanskrit
0F61 BURMESE LETTER VOCALIC LL
Sanskrit
0F62 BURMESE VOWEL SIGN VOCALIC L
Sanskrit
0F63 BURMESE VOWEL SIGN VOCALIC LL
Sanskrit
0F64
0F65
@ Digits
0F66 BURMESE DIGIT THOUN NYA
= digit zero
x Burmese wa -> 0F35
0F67 BURMESE DIGIT TI
= digit one
0F68 BURMESE DIGIT HNI
= digit two
0F69 BURMESE DIGIT THOUN
= digit three
0F6A BURMESE DIGIT LEI
= digit four
x Burmese symbol lagaun -> 0F73
0F6B BURMESE DIGIT NGA
= digit five
0F6C BURMESE DIGIT HCAU
= digit six
0F6D BURMESE DIGIT HKUN NI
= digit seven
x Burmese ya gau -> 0F30
0F6E BURMESE DIGIT HYI
= digit eight
x Burmese ga nge -> 0F17
0F6F BURMESE DIGIT KOU
= digit nine
@ Various symbols
0F70 BURMESE SYMBOL YWEI
0F71 BURMESE SYMBOL EHKAYA I
0F72 BURMESE SYMBOL HNAI
0F73 BURMESE SYMBOL LAGAUN
x Burmese digit lei -> 0F6A
0F74
0F75
0F76
0F77
0F78
0F79
0F7A
0F7B
0F7C
0F7D
0F7E
0F7F
Khmer Proposal Description
Khmer U+0F80 -> 0FDF
Cambodian, also known as Khmer, is the official language of Cambodia.
Mutually intelligible dialects are also spoken in northeastern
Thailand and the Mekong Delta region of Vietnam. While not itself
an Indo-European language, much of the administrative, military
and literary vocabulary of Khmer is borrowed from Sanskrit. With
the advent of Theravada Buddhism at the beginning of the fifteenth
century, Khmer began to borrow Pali words, and continues to use
Pali as a major source of neologisms today. There is also much
cross-borrowing between Thai and Khmer, as well as a relatively
recent infusion of French words and a smattering of Chinese and
Vietnamese loanwords in colloquial speech.
The Khmer script, called a'saa kmae ("Khmer letters"), as well as
Thai, Lao, Burmese, Old Mon and others, are all descended from the
Brahmi script of South India. The exact geographical source, or
possibly sources, has not been determined, but there is a great
similarity between the earliest inscriptions in the region and the
Pallawa script of the Coromandel coast of India.
Structurally, the Khmer script stays very close to its southern
Brahmi origins. There is a set of 35 consonants, each with an
inherent vowel sound. Additional signs are placed before, above,
below and after the consonants to indicate vowels other than the
inherent one. Consonant clusters are represented by conjunct
consonants, where the first consonant of the cluster maintains its
full form and succeeding consonants are written as subscripts.
The Khmer language has a much richer set of vowels than the Indo-Aryan
languages for which the ancestral script was used. By the same
token, there is a much smaller set of consonant sounds. The Khmer
script is adapted to the language by adding extra vowel signs and
various diacritic marks, and by using the choice of consonant as
well as of vowel signs to determine the particular vowel sound
represented. Thus most vowel signs do not have a single value but
must be interpreted in the context of the associated consonant.
This is very similar to the situation in Thai and Lao, where
different consonant symbols have the same sound but encode different
tones.
There are two basic styles of script in modern Khmer, each with
two major variations. They are the a'saa criang ("slanted script")
and the a'saa muul ("round script"). There is no fundamental
structural difference between them, however, so the "standing"
variant of the slanted script is chosen here as representative.
Representation:
The Khmer script follows the model of Devanagari and other Indic
scripts. The basic unit is the syllabic cluster consisting of a
series of consonants separated by WIRIAM (0FC5), followed by one
or both of the pronounciation shifters MUSEKATOAN (0FCA) and TRUYSAP
(0FCB), followed by an optional vowel, followed by diacritics and
quality marks. For example, the word /knyom/, "I," is coded as the
string 0F81+0FC5+0F89+0FB5+0FC2.
In cases where there is already some other superscript in the
cluster, the two pronounciation shifters are written as the subscript
symbol kbiah kraom, which looks much like VOWEL SIGN O. This vowel
sign is not to be used for this purpose. It is the responsibility
of the presentation software to select the correct appearance of
the shifter. For example, /sii/, "to eat," should be coded as
0F9F+0FCB+0FB4, not as 0F9F+0FB7+0FB4.
RAWBAT (0FCC) historically corresponds to the Devanagari repha,
that is, to an initial /r/. It has lost this function in Khmer and
instead is considered a simple diacritic similar to TOANDAKHIAT in
both reading and sorting. There are also many cases of consonant
clusters with initial /r/ that should be written with a full RAW
and not a RAWBAT, so a separate character is provided for it.
Khmer writing does not normally separate words with white space as
European languages do. If it is desirable to represent word boundaries
in the text stream, for example, for use by automatic line layout
algorithms, U+200B, ZERO WIDTH SPACE, should be used.
Two relatively rare symbols in modern usage are not included here.
They are pnek moan, the "cock's eye," and "komout." They are
identical in form and function to the Thai characters FONGMAN and
KHOMUT, respectively, so the latter two should be used when these
symbols are needed.
Block Structure:
U+0F80 to U+0FA2 Consonants
U+0FA3 to U+0FB1 Independent Vowels
U+0FB2 to U+0FC1 Vowel Signs
U+0FC2 to U+0FC4 Quality Marks
U+0FC5 Virama
U+0FC6 to U+0FC7 Unassigned
U+0FC8 to U+0FCF Diacritics
U+0FD0 to U+0FD9 Digits
U+0FDA to U+0FDE Symbols and Punctuation
U+0FDF Unassigned
ISSUES: The independent vowels LETTER AO TYPE 2 and LETTER AW TYPE
2 are variant forms of LETTER AO TYPE 1 and LETTER AW TYPE 1,
respectively. It is not believed that they are in free variation:
LETTER AO TYPE 2 occurs only in the combination "aoy," while LETTER
AW TYPE 2 is only cited in a few references, but not used. There
is an opportunity to unify these pairs. Note that LETTER UW and
LETTER OU are also listed as variants, but they are actually not
in free variation, so both must be provided.
It may be desirable to add the vowel sign AM instead of using the
combination AA+NIKAHAT. This would simplify a common special case
in sorting.
The punctuation marks KHAN and BARIYAOSAN may be unified with some
other characters, just as Indic dandas have been. A likely candidate
for the former is Thai PAI YAN NOI. Such a unification, as well as
that of the "cock's eye" and "cow piss" characters presents an
interesting challenge to the font mechanism of a Unicode rendering
engine: Different glyphs may be required for the same character
when used in conjunction with different scripts. This seems like
a needless complication for what are otherwise simple, non-combining
characters.
It may be more desirable from a political standpoint to follow
either the Thai or the ISCII coding schemes. Sample charts have
been produced showing how this may be done. If this is indeed the
path taken, those charts should be expanded to include all characters
in this proposal.
The vowel encoding takes an ISCII-like approach, coding as single
characters vowels that consist of two or more disjoint glyphs. If
vowel symbols are instead decomposed into their constituent glyphs
and those coded separately, there is then no advantage to the code
point assignments made here. In such a case, the assignments should
be made according to the Thai pattern.
The romanization scheme here is rather ad-hoc. If a more commonly
accepted one exists, the character names should be changed accordingly.
Draft 03 October 1992; rev 92/11/25
DRAFT KHMER CHARACTER NAMES
@ Consonants
0F80 KHMER LETTER KAA
0F81 KHMER LETTER KHAA
0F82 KHMER LETTER KAW
0F83 KHMER LETTER KHAW
0F84 KHMER LETTER NGAW
0F85 KHMER LETTER CAA
0F86 KHMER LETTER CHAA
0F87 KHMER LETTER CAW
0F88 KHMER LETTER CHAW
0F89 KHMER LETTER NYAW
0F8A KHMER LETTER DAA
0F8B KHMER LETTER RETROFLEX THAA
0F8C KHMER LETTER DAW
0F8D KHMER LETTER RETROFLEX THAW
0F8E KHMER LETTER NAA
0F8F KHMER LETTER TAA
0F90 KHMER LETTER THAA
0F91 KHMER LETTER TAW
0F92 KHMER LETTER THAW
0F93 KHMER LETTER NAW
0F94 KHMER LETTER BAA
0F95 KHMER LETTER PHAA
0F96 KHMER LETTER PAW
0F97 KHMER LETTER PHAW
0F98 KHMER LETTER MAW
0F99 KHMER LETTER YAW
0F9A KHMER LETTER RAW
0F9B KHMER LETTER LAW
0F9C KHMER LETTER WAW
0F9D KHMER LETTER SHAA
Sanskrit
0F9E KHMER LETTER SSAA
Sanskrit
0F9F KHMER LETTER SAA
0FA0 KHMER LETTER HAA
0FA1 KHMER LETTER LAA
0FA2 KHMER LETTER QAA
glottal stop
@ Independent Vowels
0FA3 KHMER LETTER E
0FA4 KHMER LETTER EY
0FA5 KHMER LETTER O
0FA6 KHMER LETTER UW
0FA7 KHMER LETTER OU
0FA8 KHMER LETTER AE
0FA9 KHMER LETTER AY
0FAA KHMER LETTER AO TYPE 1
0FAB KHMER LETTER AO TYPE 2
0FAC KHMER LETTER AW TYPE 1
0FAD KHMER LETTER AW TYPE 2
0FAE KHMER LETTER RIK
0FAF KHMER LETTER RII
0FB0 KHMER LETTER LIK
0FB1 KHMER LETTER LII
@ Vowel Signs
0FB2 KHMER VOWEL SIGN AA
0FB3 KHMER VOWEL SIGN E
0FB4 KHMER VOWEL SIGN EY
0FB5 KHMER VOWEL SIGN U
0FB6 KHMER VOWEL SIGN UI
0FB7 KHMER VOWEL SIGN O
x kbiah kraom
0FB8 KHMER VOWEL SIGN OU
0FB9 KHMER VOWEL SIGN UA
0FBA KHMER VOWEL SIGN AU
0FBB KHMER VOWEL SIGN IE
0FBC KHMER VOWEL SIGN IU
0FBD KHMER VOWEL SIGN EI
0FBE KHMER VOWEL SIGN AE
0FBF KHMER VOWEL SIGN AY
0FC0 KHMER VOWEL SIGN AO
0FC1 KHMER VOWEL SIGN AW
@ Quality Marks
0FC2 KHMER SIGN NIKAHAT
= sra am
= damla
0FC3 KHMER SIGN REAHMUK
= wihsakea
= wihsancani
0FC4 KHMER SIGN YUKALEAPINTU
= coc pi
@ Virama
0FC5 KHMER SIGN WIRIAM
virama
0FC6
0FC7
@ Diacritics
0FC8 KHMER VOWEL SIGN BANTA
= sangkat
= reahsannya
0FC9 KHMER VOWEL SIGN SANYOK SANNYA
0FCA KHMER SIGN MUSEKATOAN
= tmin kandao
vowel pronounciation shifter
0FCB KHMER SIGN TRUYSAP
vowel pronounciation shifter
0FCC KHMER SIGN RAWBAT
= rephea
0FCD KHMER SIGN TOANDAKHIAT
= samlap
= patdesaet
0FCE KHMER SIGN KAKABAT
= caung kaek
0FCF KHMER SIGN AHSDA
= leik prabuy
@ Digits
0FD0 KHMER DIGIT ZERO
0FD1 KHMER DIGIT ONE
0FD2 KHMER DIGIT TWO
0FD3 KHMER DIGIT THREE
0FD4 KHMER DIGIT FOUR
0FD5 KHMER DIGIT FIVE
0FD6 KHMER DIGIT SIX
0FD7 KHMER DIGIT SEVEN
0FD8 KHMER DIGIT EIGHT
0FD9 KHMER DIGIT NINE
@ Symbols and Punctuation
0FDA KHMER CURRENCY SYMBOL RIAL
0FDB KHMER LEIK TO
= amendit sannya
repetition sign
0FDC KHMER CAMNOC PI KUH
x (division sign -> 00F7)
x (tibetan comma -> 1038)
colon, semicolon
0FDD KHMER KHAN
full stop, ellipsis, abbreviation
0FDE KHMER BARIYAOSAN
end of section
0FDF
Proposal for Ethiopian Encoding
The Ethiopian proposal consists of a list of questions/issues, a
chart, a character names list, and a block introduction. The
content is based on UTC/1991-026 On the Extended Ethiopic Alphabet
of February 26, 1991 and its later adjustments by Lloyd Anderson,
unioned with features of the Xerox Amharic implementation by Joe
Becker. The character names are based on those in DP 10646, which
came from WG2/N459 "Ethiopian character sets" by Michael Mann.
QUESTIONS FOR REVIEWERS:
1. Is this collection missing any important, well-established
"extension" letters for writing less-common languages?
2. Are the glyphs in the charts appropriate?
3. Can you supply documentation to support the specification of
the following two characters?
121D ETHIOPIAN CONSONANT GG 1237
ETHIOPIAN VOWEL PHONETIC AE In particular, does U+1237
occur (as a vowel, not as a mark of "w" rounding) on any consonant
other than U+1211? Should the combination of U+1237 with U+1211
simply be encoded as a distinct consonant (to be added between
current U+1211 and U+1212)?
4. Are the following characters specified correctly?
1256 ETHIOPIAN COMMA
modern usage like colon
1257 ETHIOPIAN COLON
modern usage like semicolon
1259 ETHIOPIAN NEW COMMA
modern usage
5. Do syllable glyph variants ever occur distinctively within the
same text, or are they merely font design choices like the glyph
variants of Latin "a" or "g"?
ISSUES:
* In this design, no provision is made for coding the syllable
glyphs; it is intended that they be excluded from Unicode/10646
BMP. If we learn that glyph variants may occur distinctively, then
we may need to define some additional means for specifying glyph
variants within plain text.
* Should we define an Ethiopian White Space character which can be
easily guaranteed to have the same (minimum) width as U+1255
ETHIOPIAN WORDSPACE? Currently opinion is that this is unnecessary.
Ethiopian (U+1200 -> U+125F)
The Ethiopian script, which originally evolved for the archaic
language Ge'ez, is currently used to write several languages of
Eastern Africa, including Amharic, Tigre, and Oromo. The script
continues to be extended for writing languages that have little
tradition of printed typography; new characters to cover such
extensions may added to the standard later as definitive information
about them becomes available.
Encoding Principles. The visible glyphs of the Ethiopian script
are not the objects shown in the encoding chart. The elements of
the encoding are the alphabet underlying the script, thus the
encoding is (roughly) phonetic rather than glyphic. These alphabetic
letters are expected to be the units of keyboard input and all text
representation short of rendering.
Rendering. Each visible glyph of the Ethiopian script represents
a syllable rather than a single letter. The syllables can all be
treated as simple (consonant + vowel) pairs, so that each glyph
can be thought of as a ligature of two underlying letters. Thus
the syllable "MA" would be represented in the encoding as U+1203
ETHIOPIAN CONSONANT M plus U+1233 ETHIOPIAN VOWEL A. The syllable
glyphs themselves are not intended to be incorporated in this
encoding. The individual consonant or vowel codes should not be
isolated (i.e. unpaired) in normal final text, and their rendering
in such circumstances is an option of the implementation. One
possibility is to use special symbols for the individual letters,
as is done in the code charts here.
Chart Symbols Representing Individual Letters. Since the Ethiopian
glyphs are normally syllabic, the script provides no unambiguous
way of representing the underlying individual letters. Therefore
in the code charts and names list, a convention has been adopted
in which consonant letters are represented by their "first" form
surrounded by a dotted circle, and vowel letters are represented
by a typical glyph fragment attached to a dotted circle. This is
not intended to imply direct glyphic composition of those forms,
but merely to signify the underlying letters.
Encoding/Rendering of "First Form" Syllables. The circled consonants
in the charts U+1200 -> U+1224 are underlying letters, they should
not be confused with rendered full first form syllable glyphs. As
with all glyphs in the script, the first form syllables are encoded
as simple (consonant + vowel) pairs. Thus the glyph "MAE" would
be represented in the encoding as U+1203 ETHIOPIAN CONSONANT M plus
U+1230 ETHIOPIAN VOWEL AE. This pair would then be rendered via
a "ligature" MAE whose appearance would resemble the chart symbol
for U+1203 ETHIOPIAN CONSONANT M without the circle.
Encoding/Rendering of Lone Consonants ("Sixth Form" Syllables).
The sixth form syllable glyphs are sometimes pronounced as though
they were lone consonants (i.e. the vowel is dropped in speech),
but this does not change their encoding. As with all glyphs in the
script, the sixth form syllables are encoded as simple (consonant
+ vowel) pairs. Thus the spoken lone consonant "M" would be
represented in the encoding as U+1203 ETHIOPIAN CONSONANT M plus
U+1235 ETHIOPIAN VOWEL SCHWA.
Variant Glyph Forms. The script sometimes provides different glyph
forms to represent the same syllables. It is assumed that these
alternatives do not vary freely, in other words that is appropriate
for a given font to contain only one selected glyph form for each
syllable. Therefore no mechanism is provided for specifying glyph
variants within a plain text stream of characters. The situation
is analogous to that of the glyph variants of Latin "a" or "g".
Letter Names. The Ethiopian script often has multiple letters
corresponding to the same Latin letter, making it difficult to
assign unique Latin names. Therefore the names list makes use of
certain devices (such as doubling a Latin letter in the name) merely
to create uniqueness; this has no relation to the phonetics of the
Ethiopian letters.
Encoding Order and Sorting. The order of the letters in the encoding
is based on the traditional alphabetical order. This order differs
from the sort order used for one or another language, if only
because in many languages various pairs or triplets of letters are
treated as equivalent in the first sorting pass. For example, an
Amharic dictionary is likely to start out with a section headed by
three letters:
U+1200 ETHIOPIAN CONSONANT H
U+1202 ETHIOPIAN CONSONANT HH
U+120E ETHIOPIAN CONSONANT X
Thus the encoding order cannot and does not implement a collation
procedure for any particular language using this script.
Space Characters. The traditional word separator is U+1255 ETHIOPIAN
WORDSPACE ( : ), but in modern usage a plain white wordspace is
becoming common. The ASCII character U+0020 SPACE is suitable for
the latter usage, although its (minimum) width is not guaranteed
to be the same as that of the traditional wordspace.
Diacritical Marks. The mark U+030E NON-SPACING DOUBLE VERTICAL
LINE ABOVE may occasionally be used to indicate emphasis or
gemination. If this or other diacritical marks are used, they
follow the vowel letter of the syllable to which they apply.
Encoding Structure. The Unicode block for the Ethiopian script is
divided into the following ranges:
U+1200 to U+1224 Consonant phonetic letters
U+1225 to U+122F Currently unassigned
U+1230 to U+123D Vowel phonetic letters (U+1239 is an intentional gap)
U+123E to U+123F Currently unassigned
U+1240 to U+1254 Numbers (U+1240 is an intentional gap)
U+1255 to U+125B Punctuation
U+125C to U+125F Currently unassigned
Draft October 30, 1992; rev 93/01/08
ETHIOPIAN CHARACTER NAMES LIST
@ Consonant phonetic letters
1200 ETHIOPIAN CONSONANT H
1201 ETHIOPIAN CONSONANT L
1202 ETHIOPIAN CONSONANT HH
1203 ETHIOPIAN CONSONANT M
1204 ETHIOPIAN CONSONANT SZ
1205 ETHIOPIAN CONSONANT R
1206 ETHIOPIAN CONSONANT S
1207 ETHIOPIAN CONSONANT SH
1208 ETHIOPIAN CONSONANT Q
1209 ETHIOPIAN CONSONANT QH
120A ETHIOPIAN CONSONANT B
120B ETHIOPIAN CONSONANT V
120C ETHIOPIAN CONSONANT T
120D ETHIOPIAN CONSONANT C
120E ETHIOPIAN CONSONANT X
120F ETHIOPIAN CONSONANT N
1210 ETHIOPIAN CONSONANT NY
1211 ETHIOPIAN CONSONANT GLOTTAL
1212 ETHIOPIAN CONSONANT K
1213 ETHIOPIAN CONSONANT XX
1214 ETHIOPIAN CONSONANT W
1215 ETHIOPIAN CONSONANT NULL
1216 ETHIOPIAN CONSONANT Z
1217 ETHIOPIAN CONSONANT ZH
1218 ETHIOPIAN CONSONANT Y
1219 ETHIOPIAN CONSONANT D
121A ETHIOPIAN CONSONANT DD
Oromo
121B ETHIOPIAN CONSONANT J
121C ETHIOPIAN CONSONANT G
121D ETHIOPIAN CONSONANT GG
Bilen
121E ETHIOPIAN CONSONANT TH
121F ETHIOPIAN CONSONANT CH
1220 ETHIOPIAN CONSONANT PH
1221 ETHIOPIAN CONSONANT TS
1222 ETHIOPIAN CONSONANT TZ
1223 ETHIOPIAN CONSONANT F
1224 ETHIOPIAN CONSONANT P
1225
1226
1227
1228
1229
122A
122B
122C
122D
122E
122F
@ Vowel phonetic letters
1230 ETHIOPIAN VOWEL AE
1231 ETHIOPIAN VOWEL U
1232 ETHIOPIAN VOWEL I
1233 ETHIOPIAN VOWEL A
1234 ETHIOPIAN VOWEL E
1235 ETHIOPIAN VOWEL SCHWA
1236 ETHIOPIAN VOWEL O
1237 ETHIOPIAN VOWEL PHONETIC AE
used primarily with U+1211 ETHIOPIAN CONSONANT GLOTTAL
1238 ETHIOPIAN VOWEL WAE
1239
123A ETHIOPIAN VOWEL WI
123B ETHIOPIAN VOWEL WA
123C ETHIOPIAN VOWEL WE
123D ETHIOPIAN VOWEL W
123E
123F
@ Numbers
1240
1241 ETHIOPIAN NUMBER ONE
1242 ETHIOPIAN NUMBER TWO
1243 ETHIOPIAN NUMBER THREE
1244 ETHIOPIAN NUMBER FOUR
1245 ETHIOPIAN NUMBER FIVE
1246 ETHIOPIAN NUMBER SIX
1247 ETHIOPIAN NUMBER SEVEN
1248 ETHIOPIAN NUMBER EIGHT
1249 ETHIOPIAN NUMBER NINE
124A ETHIOPIAN NUMBER TEN
124B ETHIOPIAN NUMBER TWENTY
124C ETHIOPIAN NUMBER THIRTY
124D ETHIOPIAN NUMBER FORTY
124E ETHIOPIAN NUMBER FIFTY
124F ETHIOPIAN NUMBER SIXTY
1250 ETHIOPIAN NUMBER SEVENTY
1251 ETHIOPIAN NUMBER EIGHTY
1252 ETHIOPIAN NUMBER NINETY
1253 ETHIOPIAN NUMBER HUNDRED
1254 ETHIOPIAN NUMBER TEN THOUSAND
@ Punctuation
1255 ETHIOPIAN WORDSPACE
1256 ETHIOPIAN COMMA
modern usage like colon
1257 ETHIOPIAN COLON
modern usage like semicolon
1258 ETHIOPIAN PERIOD
1259 ETHIOPIAN NEW COMMA
modern usage
125A ETHIOPIAN QUESTION MARK
archaic
125B ETHIOPIAN PARAGRAPH SEPARATOR
archaic
Copyright © 1992-1998 Unicode, Inc. All Rights Reserved. The Unicode Consortium makes no expressed or implied warranty of any kind, and assumes no liability for errors or omissions. No liability is assumed for incidental and consequential damages in connection with or arising out of the use of the information or programs contained or accompanying this technical report.
Unicode and the Unicode logo are trademarks of Unicode, Inc., and are registered in some jurisdictions.