MARBI Character Set Subcommittee-- Interim Report

From: Jeffrey B. Bishop (bishop@blue.weeg.uiowa.edu)
Date: Tue Jul 18 1995 - 13:09:57 EDT


MARBI CHARACTER SET SUBCOMMITTEE

Interim Report to MARBI

July 15, 1995

The Character Set Subcommittee was appointed in June 1994 (following
MARBI discussion of Discussion Paper #73) with the following charge:

        * To review the character set issues related to mapping between
                USMARC and Unicode;

        * To formulate a proposal for review and comment by LC, MARBI,
                and the USMARC Advisory Group;

        * To identify other issues related to character sets which should
                be addressed by MARBI and/or the library community.

Members of the Subcommittee are:

        Joan Aliprand - RLG
        Randy Barry - LC
        Candy Bogar - DRA
        John Espley - VTLS
        Robyn Greenlund - Microlif
        Sally McCallum - LC
        Gary Smith - OCLC
        Paul Weiss - University of New Mexico
        Larry Woods - University of Iowa, Chair

The Subcommittee established five working principles to guide the
mapping:

        1. Round-trip mapping will be provided between USMARC characters
                and Unicode characters in every possible case.

        2. Transliteration tables will remain unchanged unless there is no
                Unicode equivalent for a diacritical mark, in which case
                a change to the trans-literation table may be considered
                by the Library of Congress.

        3. Accented letters (and vocalized consonants in Hebrew and
                Arabic) will continue to be encoded as a base letter
                and non-spacing marks. Use of precomposed accented
                letters is not sanctioned at this stage.

        4. Punctuation in the USMARC Hebrew, Cyrillic, and Arabic
                character sets, and digits in the Hebrew and Cyrillic
                sets, will be "unified" by being mapped to the characters in
                the ASCII block of the Unicode standard (under further
                consideration).

        5. Codes in the Private Use Area will be used only if necessary to
                facilitate round-trip mapping.

The Subcommittee has completed mappings for the following USMARC
character sets:

        * Basic Latin (ASCII) and Extended Latin (ANSEL) except for one
                character (the Right cedilla which is used in the
                transliteration of Thai);

        * Greek Symbols (the Greek lowercase letters Alpha, Beta and
                Gamma);

        * Subscript Characters; and

        * Superscript Characters.

The agreed-upon mappings are listed in Appendix 1

For the most part the mappings were straightforward and non-controversial.
A few engendered discussion, and some recommendations were not unanimous.
Those mappings are listed here along with a summary of the discussion:

        A3 D with crossbar uppercase
        to
        0110 Latin capital D with stroke

The USMARC Latin character A3 (Uppercase D with crossbar) is used to
encode both Croatian and Vietnamese letters, transliterated Macedonian
and Serbian, and is also considered to be the uppercase form of the Eth.
The Unicode standard includes three "crossed D" characters.

Because the Eth is generally regarded as a lowercase letter, the
Subcommittee chose to map A3 to U+0110, on the basis of the most common
usage (Croatian, Vietnamese, etc.).

        AA Subscript patent mark
        to
        00AE Registered trademark sign

It was felt that the loss of subscriptedness (U+00AE is not a subscripted
character) was not crucial for this character.

        EB Ligature first half
        to
        FE20 Combining ligature, left half

        EC Ligature, second half
        to
        FE21 Combining ligature, right half

        FA Double tilde, first half
        to
        FE22 Combining double tilde, left half

        FB Double tilde, second half
        to
        FE23 Combining double tilde, right half

There were two possible mappings for these four characters: to a single
character (which extends over two letters) or to a pair of characters
corresponding to the "halves". Mapping to the "halves" was chosen.

        F7 Left hook with tail
        to
        0326 Combining comma below

This character is used in Latvian, Romanian, and Polish. The issue was
whether mapping should be based on the appearance of the character, or on
its function. The recommendation accepted by a majority of the
Subcommittee was a mapping based on function, and supported with a
reference to the use of a comma-like descender in Romanian typography. Other
members felt that the graphic appearance was important.

        F8 Right Cedilla
        to
        ?
This is still being investigated with assistance from Thailand. It is
used only in Thai romanization.

The Subcommittee recommended mapping the three Greek letters in USMARC to
the corresponding Greek script characters in Unicode rather than try to
retain the "latinness" of those characters by some other mapping (e.g. to
values in the Private Use Area).

A Proposal on the mapping outlined in Appendix 1 will be brought to MARBI
at Midwinter 1996.

Work on Basic and Extended Cyrillic, Hebrew and Basic and Extended Arabic
is continuing and will be followed by work on the East Asian Character
Code (EACC).

For Cyrillic, Hebrew and Arabic USMARC characters, the Subcommittee
plans to address mapping issues in three phases:

        1. Mapping of Cyrillic, Hebrew and Arabic letters and Arabic
                (traditional "Hindi") digits, all of which are non-
                controversial;

        2. ASCII "clones" in each character set (punctuation and digits in
                Cyrillic and Hebrew, punctuation in Arabic);

        3. Other items:

        a. Hebrew Holam which serves in USMARC as both the vowel point
                holam and the sin dot. The holam and sin dot are both discrete
                Unicode characters.
        b. Several Arabic letters which are in the USMARC Extended
                Arabic character set but not in the Unicode standard.

The items in (1) should be straightforward. The items in (2) and (3a)
will require research by the Subcommittee during the Fall of 1995. The
Arabic letters in (3b) should be proposed as additions to the Unicode
Standard and to ISO/IEC 10646. Documentation to support their addition
needs to be gathered.

Glossary and Conventions:

UCS = Universal Character Set (the proper title of International
Standard ISO/IEC 10646).

U+nnnn = An individual Unicode value, where nnnn is a four digit number
expressed in hexadecimal notation.

Private Use Area = Unicode values in the range U+E000 through U+F8FF.
Codes in this range are for the use of software developers and end users
who need a special set of characters for their applications. The code
points in this area do not have defined, interpretable semantics except
by private agreement.

Appendix 1

========================================================================
        Author: Joan Aliprand
        Revised: 9/12/92
        Revised: 12/17/93
        Revised: 5/26/94
        Revised: 6/25/95
        Revised: 6/29/95
                                        
 
 
Mapping of USMARC Characters to Unicode/UCS Values
 
Sources:
 
USMARC sources:
        USMARC Specifications for Record Structure, Character Sets, and
        Exchange Media. 1994 edition. Washington, D.C., Library of Congress,
        1994.
        
        MARBI Proposal No. 93-10, as approved in February 1994.
 
Unicode sources:
        The Unicode Standard, Version 1.0. Vol. 1, 1991.
        The Unicode Standard, Version 1.1. Prepublication edition. 1993.
 
The Unicode Standard, Version 1.1 and the Basic Multilingual Plane (BMP)
of International Standard ISO/IEC 10646-1:1993 are identical in character
repertoire and code-point assignment. The Unicode standard is a profile
of UCS-2, the two-octet form of the Universal Character Set.
 
Previous versions of this mapping used this UCS source: ISO DIS 10646-
1.2.

Both USMARC and Unicode/UCS names should properly be in uppercase
letters. Upper and lowercase have been used in the following table for
ease of reading. Any amendments to UCS names after publication of
ISO/IEC 10646:1 have not been included.
 
ASCII (BASIC LATIN) AND ANSEL (EXTENDED LATIN) CHARACTER SETS
 
USMARC Character Unicode/UCS Character
Code Name Code Name
 
 1B ESCAPE 001B ESCAPE
 1D RECORD TERMINATOR 001D GROUP SEPARATOR
 1E FIELD TERMINATOR 001E RECORD SEPARATOR
 1F SUBFIELD DELIMITER 001F UNIT SEPARATOR
 
 20 SPACE (BLANK) 0020 SPACE
 21 EXCLAMATION MARK 0021 EXCLAMATION MARK
 22 QUOTATION MARK 0022 QUOTATION MARK
 23 NUMBER SIGN 0023 NUMBER SIGN
 24 DOLLAR SIGN 0024 DOLLAR SIGN
 25 PERCENT SIGN 0025 PERCENT SIGN
 26 AMPERSAND 0026 AMBERSAND
 27 APOSTROPHE 0027 APOSTROPHE
 28 OPENING PARENTHESIS 0028 LEFT PARENTHESIS
 29 CLOSING PARENTHESIS 0029 RIGHT PARENTHESIS
 2A ASTERISK 002A ASTERISK
 2B PLUS SIGN 002B PLUS SIGN
 2C COMMA 002C COMMA
 2D HYPHEN-MINUS 002D HYPHEN-MINUS
 2E PERIOD (DECIMAL POINT) 002E FULL STOP
 2F SLASH 002F SOLIDUS
 
 30 DIGIT ZERO 0030 DIGIT ZERO
   THROUGH THROUGH
 39 DIGIT NINE 0039 DIGIT NINE

 3A COLON 003A COLON
 3B SEMICOLON 003B SEMICOLON
 3C LESS-THAN SIGN 003C LESS-THAN SIGN
        (OPENING ANGLE BRACKET)
 3D EQUALS SIGN 003D EQUALS SIGN
 3E GREATER-THAN SIGN 003E GREATER-THAN SIGN
        (CLOSING ANGLE BRACKET)
 3F QUESTION MARK 003F QUESTION MARK
 40 COMMERCIAL AT 0040 COMMERCIAL AT

 41 CAPITAL A 0041 LATIN CAPITAL A
   THROUGH THROUGH
 5A CAPITAL Z 005A LATIN CAPITAL Z

 5B OPENING SQUARE BRACKET 005B LEFT SQUARE BRACKET
 5C REVERSE SLASH 005C REVERSE SOLIDUS
 5D CLOSING SQUARE BRACKET 005D RIGHT SQUARE BRACKET
 5E SPACING CIRCUMFLEX 005E SPACING ACCENT
 5F SPACING UNDERSCORE 005F SPACING LOW LINE
 60 SPACING GRAVE 0060 GRAVE ACCENT

 61 SMALL A 0061 LATIN SMALL A
   THROUGH THROUGH
 7A SMALL Z 007A LATIN SMALL Z

 7B OPENING CURLY BRACKET 007B LEFT CURLY BRACKET
 7C VERTICAL BAR (FILL) 007C VERTICAL LINE
 7D CLOSING CURLY BRACKET 007D RIGHT CURLY BRACKET
 7E SPACING TILDE 007E TILDE
 

 A1 UPPERCASE POLISH L 0141 LATIN CAPITAL LETTER L WITH STROKE
 A2 UPPERCASE SCANDINAVIAN O 00D8 LATIN CAPITAL LETTER O WITH STROKE
 A3 UPPERCASE D WITH CROSSBAR 0110 LATIN CAPITAL LETTER D WITH STROKE
 A4 UPPERCASE ICELANDIC THORN 00DE LATIN CAPITAL LETTER THORN
                                              (Icelandic)
 A5 UPPERCASE DIGRAPH AE 00C6 LATIN CAPITAL LIGATURE AE
 A6 UPPERCASE DIGRAPH OE 0152 LATIN CAPITAL LIGATURE OE
 A7 SOFT SIGN (PRIME) 02B9 MODIFIED LETTER PRIME
 A8 DOT IN MIDDLE OF LINE 00B7 MIDDLE DOT
 A9 MUSICAL FLAT 266D MUSIC FLAT SIGN
 AA SUBSCRIPT PATENT MARK 00AE REGISTERED SIGN
 AB PLUS OR MINUS 00B1 PLUS-MINUS SIGN
 AC UPPERCASE O-HOOK 01A0 LATIN CAPITAL LETTER O WITH HORN
 AD UPPERCASE U-HOOK 01AF LATIN CAPITAL LETTER U WITH HORN
 AE ALIF 02BE MODIFIER LETTER RIGHT HALF RING
 

 B0 AYN 02BF MODIFIER LETTER LEFT HALF RING
 B1 LOWERCASE POLISH L 0142 LATIN SMALL LETTER L WITH STROKE
 B2 LOWERCASE SCANDINAVIAN O 00F8 LATIN SMALL LETTER O WITH STROKE
 B3 LOWERCASE D WITH CROSSBAR 0111 LATIN SMALL LETTER D WITH STROKE
 B4 LOWERCASE ICELANDIC THORN 00FE LATIN SMALL LETTER THORN
                                              (Icelandic)
 B5 LOWERCASE DIGRAPH AE 00E6 LATIN SMALL LIGATURE AE
 B6 LOWERCASE DIGRAPH OE 0153 LATIN SMALL LIGATURE OE
 B7 HARD SIGN (DOUBLE PRIME) 02BA MODIFIER LETTER DOUBLE PRIME
 B8 LOWERCASE TURKISH I 0131 LATIN SMALL LETTER DOTLESS I
 B9 BRITISH POUND 00A3 POUND SIGN
 BA LOWERCASE ETH 00F0 LATIN SMALL LETTER ETH (Icelandic)
 BC LOWERCASE O-HOOK 01A1 LATIN SMALL LETTER O WITH HORN
 BD LOWERCASE U-HOOK 01B0 LATIN SMALL LETTER U WITH HORN
 

 C0 DEGREE SIGN 00BO DEGREE SIGN
 C1 LOWERCASE SCRIPT L 2113 SCRIPT SMALL L
 C2 PHONO COPYRIGHT MARK 2117 SOUND RECORDING COPYRIGHT
 C3 COPYRIGHT MARK 00A9 COPYRIGHT SIGN
 C4 SHARP 266F MUSICAL SHARP SIGN
 C5 INVERTED QUESTION MARK 00BF INVERTED QUESTION MARK
 C6 INVERTED EXCLAMATION MARK 00A1 INVERTED EXCLAMATION MARK
 

 E0 PSEUDO QUESTION MARK 0309 COMBINING HOOK ABOVE
 E1 GRAVE 0300 COMBINING GRAVE ACCENT (Varia)
 E2 ACUTE 0301 COMBINING ACUTE ACCENT (Oxia)
 E3 CIRCUMFLEX 0302 COMBINING CIRCUMFLEX ACCENT
 E4 TILDE 0303 COMBINING TILDE
 E5 MACRON 0304 COMBINING MACRON
 E6 BREVE 0306 COMBINING BREVE (Vrachy)
 E7 SUPERIOR DOT 0307 COMBINING DOT ABOVE
 E8 UMLAUT (DIAERESIS) 0308 COMBINING DIAERESIS (Dialytika)
 E9 HACEK 030C COMBINING CARON
 EA CIRCLE ABOVE (ANGSTROM) 030A COMBINING RING ABOVE
 EB LIGATURE, FIRST HALF FE20 COMBINING LIGATURE LEFT HALF
 EC LIGATURE, SECOND HALF FE21 COMBINING LIGATURE RIGHT HALF
 ED HIGH COMMA, OFF CENTER 0315 COMBINING COMMA ABOVE RIGHT
 EE DOUBLE ACUTE 030B COMBINING DOUBLE ACUTE ACCENT
 EF CANDRABINDU 0310 COMBINING CANDRABINDU
 

 F0 CEDILLA 0327 COMBINING CEDILLA
 F1 RIGHT HOOK (OGONEK) 0328 COMBINING OGONEK
 F2 DOT BELOW 0323 COMBINING DOT BELOW
 F3 DOUBLE DOT BELOW 0324 COMBINING DIAERESIS BELOW
 F4 CIRCLE BELOW 0325 COMBINING RING BELOW
 F5 DOUBLE UNDERSCORE 0333 COMBINING DOUBLE LOW LINE
 F6 UNDERSCORE 0332 COMBINING LOW LINE
 F7 LEFT HOOK (COMMA BELOW) 0326 COMBINING COMMA BELOW
 F8 RIGHT CEDILLA (No recommendation yet)
 F9 UPADHMANIYA 032E COMBINING BREVE BELOW
 FA DOUBLE TILDE, FIRST HALF FE22 COMBINING DOUBLE TILDE LEFT HALF
 FB DOUBLE TILDE, SECOND HALF FE23 COMBINING DOUBLE TILDE RIGHT HALF
 
 FE HIGH COMMA, CENTERED 0313 COMBINING COMMA ABOVE (Psili)
 
 

GREEK LETTERS
 
USMARC Character Unicode/UCS Character
Code Name Code Name
 
 61 ALPHA 03B1 GREEK SMALL LETTER ALPHA
 62 BETA 03B2 GREEK SMALL LETTER BETA
 63 GAMMA 03B3 GREEK SMALL LETTER GAMMA
 
 
SUBSCRIPTS
 
USMARC Character Unicode/UCS Character
Code Name Code Name
 
 28 SUBSCRIPT OPENING PARENTHESIS 208D SUBSCRIPT LEFT PARENTHESIS
 29 SUBSCRIPT CLOSING PARENTHESIS 208E SUBSCRIPT RIGHT PARENTHESIS
 2B SUBSCRIPT PLUS 208A SUBSCRIPT PLUS SIGN
 2D SUBSCRIPT MINUS 208B SUBSCRIPT HYPHEN-MINUS
 30 SUBSCRIPT 0 2080 SUBSCRIPT 0
 31 SUBSCRIPT 1 2081 SUBSCRIPT 1
 32 SUBSCRIPT 2 2082 SUBSCRIPT 2
 33 SUBSCRIPT 3 2083 SUBSCRIPT 3
 34 SUBSCRIPT 4 2084 SUBSCRIPT 4
 35 SUBSCRIPT 5 2085 SUBSCRIPT 5
 36 SUBSCRIPT 6 2086 SUBSCRIPT 6
 37 SUBSCRIPT 7 2087 SUBSCRIPT 7
 38 SUBSCRIPT 8 2088 SUBSCRIPT 8
 39 SUBSCRIPT 9 2089 SUBSCRIPT 9
 
 
SUPERSCRIPTS
 
USMARC Character Unicode/UCS Character
Code Name Code Name
 
 28 SUPERSCRIPT OPENING PARENTHESIS 207D SUPERSCRIPT LEFT PARENTHESIS
 29 SUPERSCRIPT CLOSING PARENTHESIS 207E SUPERSCRIPT RIGHT PARENTHESIS
 2B SUPERSCRIPT PLUS 207A SUPERSCRIPT PLUS SIGN
 2D SUPERSCRIPT MINUS 207B SUPERSCRIPT HYPHEN-MINUS
 30 SUPERSCRIPT 0 2070 SUPERSCRIPT 0
 31 SUPERSCRIPT 1 00B9 SUPERSCRIPT 1
 32 SUPERSCRIPT 2 00B2 SUPERSCRIPT 2
 33 SUPERSCRIPT 3 00B3 SUPERSCRIPT 3
 34 SUPERSCRIPT 4 2074 SUPERSCRIPT 4
 35 SUPERSCRIPT 5 2075 SUPERSCRIPT 5
 36 SUPERSCRIPT 6 2076 SUPERSCRIPT 6
 37 SUPERSCRIPT 7 2077 SUPERSCRIPT 7
 38 SUPERSCRIPT 8 2078 SUPERSCRIPT 8
 39 SUPERSCRIPT 9 2079 SUPERSCRIPT 9



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT