In message <9801011949.AA07010@unicode.org> Werner Lemberg writes via
unicode@unicode.org:
> Is there an algorithm how to convert long Unicode names like 'LATIN
> CAPITAL LETTER A WITH ACUTE' into short Adobe-ish names like 'Aacute'?
>
> With `short' I mean a name not longer than about 32 characters and no
> spaces in it.
>
> Or are there already short Unicode names defined? U+00C1 is not very
> descriptive...
The following short IDS which match your specifications have been
used in email documents used by those active in ISO/TC46/SC2 and its
working groups to provide readable text that use 7-bit characters,
and are less than 32 characters, and can survive any potential
distortion that may arise through their going through 7-bit character
mechanisms along the way.
Because they are directly related to character names in UCS (ISO/IEC
10646 and Unicode) it is posible to generate these by algorithm, and
also to produce short IDs that can be reversed to their authentic
character name in USO/IEC 10646 and Unicode.
Examples in more detail are given below.
John Clews
* * * * * * * *
Cyrillic transliteration tables: practical examples of short IDs.
This table shows lower-case letters in ISO 9:1995(E), Table 1, in
a Pan-Cyrillic order. In the published standard, the columns Cyrillic ID
and Latin ID will be replaced by specific Cyrillic or Latin characters.
+--------------------------------------------------------------------------+
| No. Source Target Cyrillic Latin Examples/Comments |
| ID ID ID ID |
+--------------------------------------------------------------------------+
+0430 +0061 Cy_a a
+0431 +0062 Cy_be b
+0432 +0076 Cy_ve v
+0433 +0067 Cy_ghe g
+0434 +0064 Cy_de d
+0452 +0111 Cy_dje d_stro
+0453 +01F5 Cy_gje g_acut
+0435 +0065 Cy_ie e
+0451 +00EB Cy_io e_diae
+0454 +00EA Cy_uk-ie e_circ
+0436 +017E Cy_zhe z_caro
+0437 +007A Cy_ze z
+0455 +1E91 Cy_dze z_circ
+0438 +0069 Cy_i i
+0456 +00EC Cy_be-uk-i i_grav
+0457 +00EF Cy_yi i_diae
+0458 +01F0 * Cy_je j_caro
+0439 +006A Cy_short_i j
+043A +006B Cy_ka k
+043B +006C Cy_el l
+0459 +XX Cy_lje l_circ
+043C +006D Cy_em m
+043D +006E Cy_en n
+045A +XX Cy_nje n_circ
+043E +006F Cy_o o
+043F +0070 Cy_pe p
+0440 +0072 Cy_er r
+0441 +0073 Cy_es s
+0442 +0074 Cy_te t
+045B +0107 Cy_tshe c_acut
+045C +1E31 Cy_kje k_acut
+0443 +0075 Cy_u u
+045E +01D4 Cy_shor_u u_caro
+0444 +0066 Cy_ef f
+0445 +0068 Cy_ha h
+0446 +0063 Cy_tse c
+0447 +010D Cy_che c_caro
+045F +XX Cy_dzhe d_circ
+0448 +0161 Cy_sha s_caro
+0449 +015D Cy_shcha s_circ
+044A +0022 Cy_hard_sign quot_mark
+044B +0079 Cy_yeru y
+044C +0027 Cy_soft_sign apos
+044D +00E8 Cy_e e_grav
+044E +00FB Cy_yu u_circ
+044F +00E2 Cy_ya a_circ
+XX = Not in ISO/IEC 10646
* Capital J_caro only available as level 3 characters of ISO/IEC 10646 as
[J] + [caro]
Method for deriving short IDs from the letter names in ISO/IEC 10646:
CAPITAL LETTER retains its letter element in capitals and the term
CAPITAL LETTER is dropped;
SMALL LETTER changes its letter element to small equivalents and the term
SMALL LETTER is dropped.
In all other occurences the words SMALL, CAPITAL, LETTER, ACCENT, WITH, AND
and BY are dropped.
All elements except the actual letter element (e.g. AE above) are in small
letters.
An underline character ( _ ) is used to separate elements in place of spaces;
it may be possible to drop this in databases etc. (but with less
readability).
4 letters is the normal name for elements in IDs, except for 1-, 2- and
3-letter words. 1-, 2- and 3-letter abbreviations are also used: these use a
hyphen.
Notes:
1. Single-letter abbreviations (with hyphen) are mainly positional.
-a for above (e.g. dot-a for DOT ABOVE)
-b for below
m- for middle (e.g. m-dot for MIDDLE DOT)
v- for vertical
i- inverted
l- left
r- right
s- small
This is the complete list of single-letter abbreviations.
2. Two letter codes are only used for Script codes or Language codes
e.g. 'Cy_be-uk-I' "CYRILLIC CAPITAL LETTER BYELORUSSIAN-UKRAINIAN I").
(a) Script codes (start of string; always 1 capital, 1 small, 1 underline)
Gr_ for Greek; Cy_ for Cyrillic; Am_ for Armenian;
Ge_ for Georgian; He_ for Hebrew; etc. Latin: left blank
(following usage in ISO/TC46/SC2 email survey (June/July 1996)
(b) Language codes (always midle of string; always 2 smalls, 1 hyphen)
Language codes are taken from ISO 639: e.g.
be- Byelorussian
uk- Ukranian
3. Three-letter codes (without hyphen) are for 3-letter words, e.g.
dot for DOT, leg for LEG, bar for BAR or eth for ETH.
Three-letter codes (with hyphen) are for 3-letter abbreviations (mainly
phonetic descriptions), which most users will rarely use, e.g.
den- DENTAL
lat- LATERAL
alv- ALVEOLAR
ret- RETROFLEX
glo- GLOTTAL
bil- BILABIL
pha- PHARYNGEAL
voi- VOICED
fri- FRICATIVE
pal- PALATAL
None of these are used in this table.
4. Four-letter codes are for 4-letter words
e.g. left, half, ring, stop, curl, tail, sign, open, baby, long
and abbreviations of 5-letter words
e.g.
lowe lower
brev breve
fina final
acut acute
grav grave
and abbreviations of larger words
e.g.
desc for descender
dotl for DOTLESS
digr for DIGRAP
liga for ligature
reve for REVERSED
apos for APOSTROPHE
scri for SCRIPT
clos for CLOSED
diae for DIAERESIS
stro for STROKE
symb for SYMBOL
circ for CIRCUMFLEX
cedi for CEDILLA
macr for MACRON
modi for MODIFIER
ogon for OGONEK
prec for PRECEDED
ques for QUESTION
excl for EXCLAMATION
abbr for ABBREVIATION
punc for PUNCTUATION
turn for TURNED
cros for CROSSED
* * * * * * * *
UCS: UNOFFICIAL SHORT IDS (used in ISO/TC46/SC2 draft documents)
This section lists the most commonly used accented and modified letters, as
used in the ISO 8859-1 character set standard, from hexadecimal A0 through
hex FF, showing Hex value, Decimal value, Short ID * (as used in earlier
postings of the tc46sc2@elot.gr list) and the Name in ISO/IEC 10646-1:1993.
* Note: the short IDs used in most translitration tables tend to use a much
more simply-named repertoire than many of the characters in this table, and
so any transliteration tables using these conventions will be much simpler to
read than this table of ISO 8859.
Short IDs are readable, and mostly systematically constructed from
the full name in ISO/IEC 10646:
- a Script Code such as Cy for Cyrillic replaces CYRILLIC CAPITAL
LETTER, etc., (omitted for Latin letters);
- the letter name is changed to A or a etc, accordingly;
- WITH is omitted;
- other name elements use only the first letter;
- RING ABOVE or DOT BELOW become ring-a or dot-b etc;
- spaces are changed to _ (LOW LINE)
+--------+------------------+-------------------------------------
| UCS ID | Short ID | Name in ISO/IEC 10646-1:1993(E)
+--------+------------------+-------------------------------------
| | |
| 00A0 | nbsp | NO-BREAK SPACE
| 00A1 | i-excl_mark | INVERTED EXCLAMATION MARK
| 00A2 | cent_sign | CENT SIGN
| 00A3 | poun_sign | POUND SIGN
| 00A4 | curr_sign | CURRENCY SIGN
| 00A5 | yen_sign | YEN SIGN
| 00A6 | brok_bar | BROKEN BAR
| 00A7 | sect_sign | SECTION SIGN
| 00A8 | diae | DIAERESIS
| 00A9 | copy_sign | COPYRIGHT SIGN
| 00AA | femi_ordi_indi | FEMININE ORDINAL INDICATOR
| 00AB | << | LEFT-POINTING DOUBLE ANGLE QUOTATION MARK
| 00AC | not_sign | NOT SIGN
| 00AD | soft_hyph | SOFT HYPHEN
| 00AE | regi_sign | REGISTERED SIGN
| 00AF | macr | MACRON
| | |
| 00B0 | degr_sign | DEGREE SIGN
| 00B1 | plus_minu_sign | PLUS-MINUS SIGN
| 00B2 | supe_2 | SUPERSCRIPT TWO
| 00B3 | supe_3 | SUPERSCRIPT THREE
| 00B4 | acut | ACUTE ACCENT
| 00B5 | micr_sign | MICRO SIGN
| 00B6 | pilc_sign | PILCROW SIGN
| 00B7 | m-dot | MIDDLE DOT
| 00B8 | cedi | CEDILLA
| 00B9 | supe_1 | SUPERSCRIPT ONE
| 00BA | masc_ordi_indi | MASCULINE ORDINAL INDICATOR
| 00BB | >> | RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK
| 00BC | one_quar | VULGAR FRACTION ONE QUARTER
| 00BD | one_half | VULGAR FRACTION ONE HALF
| 00BE | thre_quar | VULGAR FRACTION THREE QUARTERS
| 00BF | i-ques_mark | INVERTED QUESTION MARK
| | |
| 00C0 | A_grav | LATIN CAPITAL LETTER A WITH GRAVE ACCENT
| 00C1 | A_acut | LATIN CAPITAL LETTER A WITH ACUTE ACCENT
| 00C2 | A_circ | LATIN CAPITAL LETTER A WITH CIRCUMFLEX ACCENT
| 00C3 | A_tild | LATIN CAPITAL LETTER A WITH TILDE
| 00C4 | A_diae | LATIN CAPITAL LETTER A WITH DIAERESIS
| 00C5 | A_ring-a | LATIN CAPITAL LETTER A WITH RING ABOVE
| 00C6 | AE | LATIN CAPITAL LIGATURE AE
| 00C7 | C_cedi | LATIN CAPITAL LETTER C WITH CEDILLA
| 00C8 | E_grav | LATIN CAPITAL LETTER E WITH GRAVE ACCENT
| 00C9 | E_acut | LATIN CAPITAL LETTER E WITH ACUTE ACCENT
| 00CA | E_circ | LATIN CAPITAL LETTER E WITH CIRCUMFLEX ACCENT
| 00CB | E_diae | LATIN CAPITAL LETTER E WITH DIAERESIS
| 00CC | I_grav | LATIN CAPITAL LETTER I WITH GRAVE ACCENT
| 00CD | I_acut | LATIN CAPITAL LETTER I WITH ACUTE ACCENT
| 00CE | I_circ | LATIN CAPITAL LETTER I WITH CIRCUMFLEX ACCENT
| 00CF | I_diae | LATIN CAPITAL LETTER I WITH DIAERESIS
| | |
| 00D0 | ETH | LATIN CAPITAL LETTER ETH
| 00D1 | N_tild | LATIN CAPITAL LETTER N WITH TILDE
| 00D2 | O_grav | LATIN CAPITAL LETTER O WITH GRAVE ACCENT
| 00D3 | O_acut | LATIN CAPITAL LETTER O WITH ACUTE ACCENT
| 00D4 | O_circ | LATIN CAPITAL LETTER O WITH CIRCUMFLEX ACCENT
| 00D5 | O_tild | LATIN CAPITAL LETTER O WITH TILDE
| 00D6 | O_diae | LATIN CAPITAL LETTER O WITH DIAERESIS
| 00D7 | mult_sign | MULTIPLICATION SIGN
| 00D8 | O_stro | LATIN CAPITAL LETTER O WITH STROKE
| 00D9 | U_grav | LATIN CAPITAL LETTER U WITH GRAVE ACCENT
| 00DA | U_acut | LATIN CAPITAL LETTER U WITH ACUTE ACCENT
| 00DB | U_circ | LATIN CAPITAL LETTER U WITH CIRCUMFLEX ACCENT
| 00DC | U_diae | LATIN CAPITAL LETTER U WITH DIAERESIS
| 00DD | Y_acut | LATIN CAPITAL LETTER Y WITH ACUTE ACCENT
| 00DE | THORN | LATIN CAPITAL LETTER THORN
| 00DF | sharp_s | LATIN SMALL LETTER SHARP S
| | |
| 00E0 | a_grav | LATIN SMALL LETTER A WITH GRAVE ACCENT
| 00E1 | a_acut | LATIN SMALL LETTER A WITH ACUTE ACCENT
| 00E2 | a_circ | LATIN SMALL LETTER A WITH CIRCUMFLEX ACCENT
| 00E3 | a_tild | LATIN SMALL LETTER A WITH TILDE
| 00E4 | a_diae | LATIN SMALL LETTER A WITH DIAERESIS
| 00E5 | a_ring-a | LATIN SMALL LETTER A WITH RING ABOVE
| 00E6 | ae | LATIN SMALL LIGATURE AE
| 00E7 | c_cedi | LATIN SMALL LETTER C WITH CEDILLA
| 00E8 | e_grav | LATIN SMALL LETTER E WITH GRAVE ACCENT
| 00E9 | e_acut | LATIN SMALL LETTER E WITH ACUTE ACCENT
| 00EA | e_circ | LATIN SMALL LETTER E WITH CIRCUMFLEX ACCENT
| 00EB | e_diae | LATIN SMALL LETTER E WITH DIAERESIS
| 00EC | i_grav | LATIN SMALL LETTER I WITH GRAVE ACCENT
| 00ED | i_acut | LATIN SMALL LETTER I WITH ACUTE ACCENT
| 00EE | i_circ | LATIN SMALL LETTER I WITH CIRCUMFLEX ACCENT
| 00EF | i_diae | LATIN SMALL LETTER I WITH DIAERESIS
| | |
| 00F0 | eth | LATIN SMALL LETTER ETH
| 00F1 | n_tild | LATIN SMALL LETTER N WITH TILDE
| 00F2 | o_grav | LATIN SMALL LETTER O WITH GRAVE ACCENT
| 00F3 | o_acut | LATIN SMALL LETTER O WITH ACUTE ACCENT
| 00F4 | o_circ | LATIN SMALL LETTER O WITH CIRCUMFLEX ACCENT
| 00F5 | o_tild | LATIN SMALL LETTER O WITH TILDE
| 00F6 | o_diae | LATIN SMALL LETTER O WITH DIAERESIS
| 00F7 | divi_sign | DIVISION SIGN
| 00F8 | o_obli_bar | LATIN SMALL LETTER O WITH OBLIQUE BAR
| 00F9 | u_grav | LATIN SMALL LETTER U WITH GRAVE ACCENT
| 00FA | u_acut | LATIN SMALL LETTER U WITH ACUTE ACCENT
| 00FB | u_circ | LATIN SMALL LETTER U WITH CIRCUMFLEX ACCENT
| 00FC | u_diae | LATIN SMALL LETTER U WITH DIAERESIS
| 00FD | y_acut | LATIN SMALL LETTER Y WITH ACUTE ACCENT
| 00FE | thorn | LATIN SMALL LETTER THORN
| 00FF | y_diae | LATIN SMALL LETTER Y WITH DIAERESIS
+--------+------------------+-------------------------------------
Yours sincerely
John Clews
-- John Clews (Chair of ISO/TC46/SC2: Conversion of Written Languages)SESAME Computer Projects, 8 Avenue Road, Harrogate, HG2 7PG, England Email: Converse@sesame.demon.co.uk; tel: +44 (0) 1423 888 432
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:38 EDT