MARBI Character Set Subcommittee-- Interim Report

From: Jeffrey B. Bishop (bishop@blue.weeg.uiowa.edu)
Date: Tue Jul 18 1995 - 13:09:57 EDT

Next message: Kanishka: "Unicode for Visual Basic"
Previous message: Jim MacDonald: "Conference in Northern CA?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

MARBI CHARACTER SET SUBCOMMITTEE

Interim Report to MARBI

July 15, 1995

The Character Set Subcommittee was appointed in June 1994 (following
MARBI discussion of Discussion Paper #73) with the following charge:

* To review the character set issues related to mapping between
USMARC and Unicode;

* To formulate a proposal for review and comment by LC, MARBI,
and the USMARC Advisory Group;

* To identify other issues related to character sets which should
be addressed by MARBI and/or the library community.

Members of the Subcommittee are:

        Joan Aliprand - RLG
        Randy Barry - LC
        Candy Bogar - DRA
        John Espley - VTLS
        Robyn Greenlund - Microlif
        Sally McCallum - LC
        Gary Smith - OCLC
        Paul Weiss - University of New Mexico
        Larry Woods - University of Iowa, Chair

The Subcommittee established five working principles to guide the
mapping:

1. Round-trip mapping will be provided between USMARC characters
and Unicode characters in every possible case.

        2. Transliteration tables will remain unchanged unless there is no
                Unicode equivalent for a diacritical mark, in which case
                a change to the trans-literation table may be considered
                by the Library of Congress.

        3. Accented letters (and vocalized consonants in Hebrew and
                Arabic) will continue to be encoded as a base letter
                and non-spacing marks. Use of precomposed accented
                letters is not sanctioned at this stage.

        4. Punctuation in the USMARC Hebrew, Cyrillic, and Arabic
                character sets, and digits in the Hebrew and Cyrillic
                sets, will be "unified" by being mapped to the characters in
                the ASCII block of the Unicode standard (under further
                consideration).

5. Codes in the Private Use Area will be used only if necessary to
facilitate round-trip mapping.

The Subcommittee has completed mappings for the following USMARC
character sets:

        * Basic Latin (ASCII) and Extended Latin (ANSEL) except for one
                character (the Right cedilla which is used in the
                transliteration of Thai);

* Greek Symbols (the Greek lowercase letters Alpha, Beta and
Gamma);

* Subscript Characters; and

* Superscript Characters.

The agreed-upon mappings are listed in Appendix 1

For the most part the mappings were straightforward and non-controversial.
A few engendered discussion, and some recommendations were not unanimous.
Those mappings are listed here along with a summary of the discussion:

        A3 D with crossbar uppercase
        to
        0110 Latin capital D with stroke

The USMARC Latin character A3 (Uppercase D with crossbar) is used to
encode both Croatian and Vietnamese letters, transliterated Macedonian
and Serbian, and is also considered to be the uppercase form of the Eth.
The Unicode standard includes three "crossed D" characters.

Because the Eth is generally regarded as a lowercase letter, the
Subcommittee chose to map A3 to U+0110, on the basis of the most common
usage (Croatian, Vietnamese, etc.).

        AA Subscript patent mark
        to
        00AE Registered trademark sign

It was felt that the loss of subscriptedness (U+00AE is not a subscripted
character) was not crucial for this character.

        EB Ligature first half
        to
        FE20 Combining ligature, left half

        EC Ligature, second half
        to
        FE21 Combining ligature, right half

        FA Double tilde, first half
        to
        FE22 Combining double tilde, left half

        FB Double tilde, second half
        to
        FE23 Combining double tilde, right half

There were two possible mappings for these four characters: to a single
character (which extends over two letters) or to a pair of characters
corresponding to the "halves". Mapping to the "halves" was chosen.

        F7 Left hook with tail
        to
        0326 Combining comma below

This character is used in Latvian, Romanian, and Polish. The issue was
whether mapping should be based on the appearance of the character, or on
its function. The recommendation accepted by a majority of the
Subcommittee was a mapping based on function, and supported with a
reference to the use of a comma-like descender in Romanian typography. Other
members felt that the graphic appearance was important.

        F8 Right Cedilla
        to
        ?
This is still being investigated with assistance from Thailand. It is
used only in Thai romanization.

The Subcommittee recommended mapping the three Greek letters in USMARC to
the corresponding Greek script characters in Unicode rather than try to
retain the "latinness" of those characters by some other mapping (e.g. to
values in the Private Use Area).

A Proposal on the mapping outlined in Appendix 1 will be brought to MARBI
at Midwinter 1996.

Work on Basic and Extended Cyrillic, Hebrew and Basic and Extended Arabic
is continuing and will be followed by work on the East Asian Character
Code (EACC).

For Cyrillic, Hebrew and Arabic USMARC characters, the Subcommittee
plans to address mapping issues in three phases:

        1. Mapping of Cyrillic, Hebrew and Arabic letters and Arabic
                (traditional "Hindi") digits, all of which are non-
                controversial;

2. ASCII "clones" in each character set (punctuation and digits in
Cyrillic and Hebrew, punctuation in Arabic);

3. Other items:

        a. Hebrew Holam which serves in USMARC as both the vowel point
                holam and the sin dot. The holam and sin dot are both discrete
                Unicode characters.
        b. Several Arabic letters which are in the USMARC Extended
                Arabic character set but not in the Unicode standard.

The items in (1) should be straightforward. The items in (2) and (3a)
will require research by the Subcommittee during the Fall of 1995. The
Arabic letters in (3b) should be proposed as additions to the Unicode
Standard and to ISO/IEC 10646. Documentation to support their addition
needs to be gathered.

Glossary and Conventions:

UCS = Universal Character Set (the proper title of International
Standard ISO/IEC 10646).

U+nnnn = An individual Unicode value, where nnnn is a four digit number
expressed in hexadecimal notation.

Private Use Area = Unicode values in the range U+E000 through U+F8FF.
Codes in this range are for the use of software developers and end users
who need a special set of characters for their applications. The code
points in this area do not have defined, interpretable semantics except
by private agreement.

Appendix 1

========================================================================
        Author: Joan Aliprand
        Revised: 9/12/92
        Revised: 12/17/93
        Revised: 5/26/94
        Revised: 6/25/95
        Revised: 6/29/95


Mapping of USMARC Characters to Unicode/UCS Values

Sources:

USMARC sources:
        USMARC Specifications for Record Structure, Character Sets, and
        Exchange Media. 1994 edition. Washington, D.C., Library of Congress,
        1994.

        MARBI Proposal No. 93-10, as approved in February 1994.

Unicode sources:
        The Unicode Standard, Version 1.0. Vol. 1, 1991.
        The Unicode Standard, Version 1.1. Prepublication edition. 1993.

The Unicode Standard, Version 1.1 and the Basic Multilingual Plane (BMP)
of International Standard ISO/IEC 10646-1:1993 are identical in character
repertoire and code-point assignment. The Unicode standard is a profile
of UCS-2, the two-octet form of the Universal Character Set.

Previous versions of this mapping used this UCS source: ISO DIS 10646-
1.2.

Both USMARC and Unicode/UCS names should properly be in uppercase
letters. Upper and lowercase have been used in the following table for
ease of reading. Any amendments to UCS names after publication of
ISO/IEC 10646:1 have not been included.

ASCII (BASIC LATIN) AND ANSEL (EXTENDED LATIN) CHARACTER SETS

USMARC Character Unicode/UCS Character
Code Name Code Name

1B ESCAPE 001B ESCAPE
1D RECORD TERMINATOR 001D GROUP SEPARATOR
1E FIELD TERMINATOR 001E RECORD SEPARATOR
1F SUBFIELD DELIMITER 001F UNIT SEPARATOR

20 SPACE (BLANK) 0020 SPACE
21 EXCLAMATION MARK 0021 EXCLAMATION MARK
22 QUOTATION MARK 0022 QUOTATION MARK
23 NUMBER SIGN 0023 NUMBER SIGN
24 DOLLAR SIGN 0024 DOLLAR SIGN
25 PERCENT SIGN 0025 PERCENT SIGN
26 AMPERSAND 0026 AMBERSAND
27 APOSTROPHE 0027 APOSTROPHE
28 OPENING PARENTHESIS 0028 LEFT PARENTHESIS
29 CLOSING PARENTHESIS 0029 RIGHT PARENTHESIS
2A ASTERISK 002A ASTERISK
2B PLUS SIGN 002B PLUS SIGN
2C COMMA 002C COMMA
2D HYPHEN-MINUS 002D HYPHEN-MINUS
2E PERIOD (DECIMAL POINT) 002E FULL STOP
2F SLASH 002F SOLIDUS

30 DIGIT ZERO 0030 DIGIT ZERO
THROUGH THROUGH
39 DIGIT NINE 0039 DIGIT NINE

3A COLON 003A COLON
3B SEMICOLON 003B SEMICOLON
3C LESS-THAN SIGN 003C LESS-THAN SIGN
(OPENING ANGLE BRACKET)
3D EQUALS SIGN 003D EQUALS SIGN
3E GREATER-THAN SIGN 003E GREATER-THAN SIGN
(CLOSING ANGLE BRACKET)
3F QUESTION MARK 003F QUESTION MARK
40 COMMERCIAL AT 0040 COMMERCIAL AT

41 CAPITAL A 0041 LATIN CAPITAL A
THROUGH THROUGH
5A CAPITAL Z 005A LATIN CAPITAL Z

5B OPENING SQUARE BRACKET 005B LEFT SQUARE BRACKET
5C REVERSE SLASH 005C REVERSE SOLIDUS
5D CLOSING SQUARE BRACKET 005D RIGHT SQUARE BRACKET
5E SPACING CIRCUMFLEX 005E SPACING ACCENT
5F SPACING UNDERSCORE 005F SPACING LOW LINE
60 SPACING GRAVE 0060 GRAVE ACCENT

61 SMALL A 0061 LATIN SMALL A
THROUGH THROUGH
7A SMALL Z 007A LATIN SMALL Z

7B OPENING CURLY BRACKET 007B LEFT CURLY BRACKET
7C VERTICAL BAR (FILL) 007C VERTICAL LINE
7D CLOSING CURLY BRACKET 007D RIGHT CURLY BRACKET
7E SPACING TILDE 007E TILDE

A1 UPPERCASE POLISH L 0141 LATIN CAPITAL LETTER L WITH STROKE
A2 UPPERCASE SCANDINAVIAN O 00D8 LATIN CAPITAL LETTER O WITH STROKE
A3 UPPERCASE D WITH CROSSBAR 0110 LATIN CAPITAL LETTER D WITH STROKE
A4 UPPERCASE ICELANDIC THORN 00DE LATIN CAPITAL LETTER THORN
(Icelandic)
A5 UPPERCASE DIGRAPH AE 00C6 LATIN CAPITAL LIGATURE AE
A6 UPPERCASE DIGRAPH OE 0152 LATIN CAPITAL LIGATURE OE
A7 SOFT SIGN (PRIME) 02B9 MODIFIED LETTER PRIME
A8 DOT IN MIDDLE OF LINE 00B7 MIDDLE DOT
A9 MUSICAL FLAT 266D MUSIC FLAT SIGN
AA SUBSCRIPT PATENT MARK 00AE REGISTERED SIGN
AB PLUS OR MINUS 00B1 PLUS-MINUS SIGN
AC UPPERCASE O-HOOK 01A0 LATIN CAPITAL LETTER O WITH HORN
AD UPPERCASE U-HOOK 01AF LATIN CAPITAL LETTER U WITH HORN
AE ALIF 02BE MODIFIER LETTER RIGHT HALF RING

B0 AYN 02BF MODIFIER LETTER LEFT HALF RING
B1 LOWERCASE POLISH L 0142 LATIN SMALL LETTER L WITH STROKE
B2 LOWERCASE SCANDINAVIAN O 00F8 LATIN SMALL LETTER O WITH STROKE
B3 LOWERCASE D WITH CROSSBAR 0111 LATIN SMALL LETTER D WITH STROKE
B4 LOWERCASE ICELANDIC THORN 00FE LATIN SMALL LETTER THORN
(Icelandic)
B5 LOWERCASE DIGRAPH AE 00E6 LATIN SMALL LIGATURE AE
B6 LOWERCASE DIGRAPH OE 0153 LATIN SMALL LIGATURE OE
B7 HARD SIGN (DOUBLE PRIME) 02BA MODIFIER LETTER DOUBLE PRIME
B8 LOWERCASE TURKISH I 0131 LATIN SMALL LETTER DOTLESS I
B9 BRITISH POUND 00A3 POUND SIGN
BA LOWERCASE ETH 00F0 LATIN SMALL LETTER ETH (Icelandic)
BC LOWERCASE O-HOOK 01A1 LATIN SMALL LETTER O WITH HORN
BD LOWERCASE U-HOOK 01B0 LATIN SMALL LETTER U WITH HORN

C0 DEGREE SIGN 00BO DEGREE SIGN
C1 LOWERCASE SCRIPT L 2113 SCRIPT SMALL L
C2 PHONO COPYRIGHT MARK 2117 SOUND RECORDING COPYRIGHT
C3 COPYRIGHT MARK 00A9 COPYRIGHT SIGN
C4 SHARP 266F MUSICAL SHARP SIGN
C5 INVERTED QUESTION MARK 00BF INVERTED QUESTION MARK
C6 INVERTED EXCLAMATION MARK 00A1 INVERTED EXCLAMATION MARK

E0 PSEUDO QUESTION MARK 0309 COMBINING HOOK ABOVE
E1 GRAVE 0300 COMBINING GRAVE ACCENT (Varia)
E2 ACUTE 0301 COMBINING ACUTE ACCENT (Oxia)
E3 CIRCUMFLEX 0302 COMBINING CIRCUMFLEX ACCENT
E4 TILDE 0303 COMBINING TILDE
E5 MACRON 0304 COMBINING MACRON
E6 BREVE 0306 COMBINING BREVE (Vrachy)
E7 SUPERIOR DOT 0307 COMBINING DOT ABOVE
E8 UMLAUT (DIAERESIS) 0308 COMBINING DIAERESIS (Dialytika)
E9 HACEK 030C COMBINING CARON
EA CIRCLE ABOVE (ANGSTROM) 030A COMBINING RING ABOVE
EB LIGATURE, FIRST HALF FE20 COMBINING LIGATURE LEFT HALF
EC LIGATURE, SECOND HALF FE21 COMBINING LIGATURE RIGHT HALF
ED HIGH COMMA, OFF CENTER 0315 COMBINING COMMA ABOVE RIGHT
EE DOUBLE ACUTE 030B COMBINING DOUBLE ACUTE ACCENT
EF CANDRABINDU 0310 COMBINING CANDRABINDU

F0 CEDILLA 0327 COMBINING CEDILLA
F1 RIGHT HOOK (OGONEK) 0328 COMBINING OGONEK
F2 DOT BELOW 0323 COMBINING DOT BELOW
F3 DOUBLE DOT BELOW 0324 COMBINING DIAERESIS BELOW
F4 CIRCLE BELOW 0325 COMBINING RING BELOW
F5 DOUBLE UNDERSCORE 0333 COMBINING DOUBLE LOW LINE
F6 UNDERSCORE 0332 COMBINING LOW LINE
F7 LEFT HOOK (COMMA BELOW) 0326 COMBINING COMMA BELOW
F8 RIGHT CEDILLA (No recommendation yet)
F9 UPADHMANIYA 032E COMBINING BREVE BELOW
FA DOUBLE TILDE, FIRST HALF FE22 COMBINING DOUBLE TILDE LEFT HALF
FB DOUBLE TILDE, SECOND HALF FE23 COMBINING DOUBLE TILDE RIGHT HALF

FE HIGH COMMA, CENTERED 0313 COMBINING COMMA ABOVE (Psili)

GREEK LETTERS

USMARC Character Unicode/UCS Character
Code Name Code Name

61 ALPHA 03B1 GREEK SMALL LETTER ALPHA
62 BETA 03B2 GREEK SMALL LETTER BETA
63 GAMMA 03B3 GREEK SMALL LETTER GAMMA

SUBSCRIPTS

USMARC Character Unicode/UCS Character
Code Name Code Name

28 SUBSCRIPT OPENING PARENTHESIS 208D SUBSCRIPT LEFT PARENTHESIS
29 SUBSCRIPT CLOSING PARENTHESIS 208E SUBSCRIPT RIGHT PARENTHESIS
2B SUBSCRIPT PLUS 208A SUBSCRIPT PLUS SIGN
2D SUBSCRIPT MINUS 208B SUBSCRIPT HYPHEN-MINUS
30 SUBSCRIPT 0 2080 SUBSCRIPT 0
31 SUBSCRIPT 1 2081 SUBSCRIPT 1
32 SUBSCRIPT 2 2082 SUBSCRIPT 2
33 SUBSCRIPT 3 2083 SUBSCRIPT 3
34 SUBSCRIPT 4 2084 SUBSCRIPT 4
35 SUBSCRIPT 5 2085 SUBSCRIPT 5
36 SUBSCRIPT 6 2086 SUBSCRIPT 6
37 SUBSCRIPT 7 2087 SUBSCRIPT 7
38 SUBSCRIPT 8 2088 SUBSCRIPT 8
39 SUBSCRIPT 9 2089 SUBSCRIPT 9

SUPERSCRIPTS

USMARC Character Unicode/UCS Character
Code Name Code Name

28 SUPERSCRIPT OPENING PARENTHESIS 207D SUPERSCRIPT LEFT PARENTHESIS
29 SUPERSCRIPT CLOSING PARENTHESIS 207E SUPERSCRIPT RIGHT PARENTHESIS
2B SUPERSCRIPT PLUS 207A SUPERSCRIPT PLUS SIGN
2D SUPERSCRIPT MINUS 207B SUPERSCRIPT HYPHEN-MINUS
30 SUPERSCRIPT 0 2070 SUPERSCRIPT 0
31 SUPERSCRIPT 1 00B9 SUPERSCRIPT 1
32 SUPERSCRIPT 2 00B2 SUPERSCRIPT 2
33 SUPERSCRIPT 3 00B3 SUPERSCRIPT 3
34 SUPERSCRIPT 4 2074 SUPERSCRIPT 4
35 SUPERSCRIPT 5 2075 SUPERSCRIPT 5
36 SUPERSCRIPT 6 2076 SUPERSCRIPT 6
37 SUPERSCRIPT 7 2077 SUPERSCRIPT 7
38 SUPERSCRIPT 8 2078 SUPERSCRIPT 8
39 SUPERSCRIPT 9 2079 SUPERSCRIPT 9

Next message: Kanishka: "Unicode for Visual Basic"
Previous message: Jim MacDonald: "Conference in Northern CA?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:30 EDT