L2/99-195

DATE: 1999-09-02

DOC TYPE:

Expert contribution

TITLE:

Proposal to encode mathematical alphanumeric symbols

SOURCE:

Murray Sargent III, Barbara Beeton

PROJECT:

 

STATUS:

Proposal

ACTION ID:

FYI

DUE DATE:

--

DISTRIBUTION:

Worldwide

MEDIUM:

Paper and html

NO. OF PAGES:

5

 

A. Administrative

1. Title

Proposal to encode mathematical alphanumeric symbols

2. Requester's name

Murray Sargent III, Barbara Beeton

3. Requester type

Expert request.

4. Submission date

1999-9-9

5. Requester’s reference

Scientific and Technical Information Exchange (STIX)

6a. Completion

Complete proposal

6b. More information to be provided?

If requested

 

B. Technical -- General

1a. New script? Name?

No.

1b. Addition of characters to existing block? Name?

No.

2. Number of characters

991 new alphanumeric symbols

3. Proposed category

 

4. Proposed level of implementation and rationale

Level 1

5a. Character names included in proposal?

Yes

5b. Character names in accordance with guidelines?

Yes

5c. Character shapes reviewable?

 

6a. Who will provide computerized font?

None needed (they already exist)

6b. Font currently available?

None needed (standard fonts are adequate)

6c. Font format?

na

7a. Are references (to other character sets, dictionaries, descriptive texts, etc.) provided?

Yes.

7b. Are published examples (such as samples from newspapers, magazines, or other sources) of use of proposed characters attached?

Not attached, but available.

8. Does the proposal address other aspects of character data processing?

No

 

C. Technical -- Justification

1. Contact with the user community?

Yes. Patrick Ion, Barbara Beeton, Murray Sargent III, MathML W3C Math Working Group

2. Information on the user community?

Professional mathematicians, physicists, astronomers, engineers, and other scientific and technical researchers.

3a. The context of use for the proposed characters?

Used in publication of research mathematics and other hard sciences.

3b. Reference

 

4a. Proposed characters in current use?

Yes

4b. Where?

Worldwide, by scientific and technical publishers and other users of mathematics

5a. Characters should be encoded entirely in BMP?

No, entirely in plane 1.

5b. Rationale

Accurate publication of mathematical and scientific research on the Web is impossible without a comprehensive and accurate collection of symbols including various alphabetic variants in common use.

6. Should characters be kept in a continuous range?

Yes in order to fit in one 1024-character surrogate block

7a. Can the characters be considered a presentation form of an existing character or character sequence?

No. A given alphabetic symbol has different semantics when its style is changed and should not be found by the same plain-text search string.

7b. Where?

 

7c. Reference

 

8a. Can any of the characters be considered to be similar (in appearance or function) to an existing character?

Some letterlike symbols look similar to corresponding characters in some alphabets, e.g., some capital script letters. These are left as holes in the proposed code assignments

8b. Where?

Letterlike symbols

8c. Reference

 

9a. Combining characters or use of composite sequences included?

No

9b. List of composite sequences and their corresponding glyph images provided?

No

10. Characters with any special properties such as control function, etc. included?

No

 

D. SC2/WG2 Administrative

To be completed by SC2/WG2

1. Relevant SC 2/WG 2 document numbers:

 

2. Status (list of meeting number and corresponding action or disposition)

 

3. Additional contact to user communities, liaison organizations etc.

 

4. Assigned category and assigned priority/time frame

 

Other Comments

 

 


Mathematics has need for a number of Latin and Greek alphabets that on first thought appear to be font variations of one another, e.g., normal, bold, italic and script H.  However in any given document, these characters have distinct mathematical semantics.  For example, a normal H represents a different variable from a bold H, etc.  If one drops these distinctions in plain text, one gets gibberish.  Instead of the well-known Hamiltonian formula

 

            H = òdt(eE² + mH²),

 

you’d get the integral equation (!)

 

H = òdt(eE² + mH²).

 

Accordingly, the STIX project requests adding normal, bold, italic, script, etc., Latin and Greek alphabets.  Straight encoding leads to 991 characters and loses some useful common information, such as all variants of H might not be trivially recognizable as H’s.  But it does allow plain text to retain the proper character semantics and it allows simple (nonrich) search methods to work.

 

The alphabetic symbols encountered in mathematics are given in the following table (exclusions for 25 Letterlike Symbols already encoded are not subtracted from the counts in this table, but are listed in a subsequent table):

 

Math style

Characters

Count

Proposed Name Prefix

bold

a-z, A-Z, 0-9, a-w, A-Ω (Greek)

120

MATH BOLD

italic

a-z, A-Z, a-w, A-Ω (Greek)

110

MATH ITALIC

bold italic

a-z, A-Z, a-w, A-Ω (Greek)

110

MATH BOLD ITALIC

script (calligraphic)

a-z, A-Z

52

MATH SCRIPT

bold script (calligraphic)

a-z, A-Z

52

MATH SCRIPT BOLD

fraktur

a-z, A-Z

52

MATH FRAKTUR

bold fraktur

a-z, A-Z

52

MATH BOLD FRAKTUR

open-face

a-z, A-Z, 0-9

62

MATH OPEN-FACE

sans-serif

a-z, A-Z, 0-9

62

MATH SANS

sans-serif bold

a-z, A-Z, 0-9, a-w, A-Ω (Greek)

120

MATH SANS BOLD

sans-serif italic

a-z, A-Z

52

MATH SANS ITALIC

sans-serif bold italic

a-z, A-Z, a-w, A-Ω (Greek)

110

MATH SANS BOLD ITALIC

monospace

a-z, A-Z, 0-9

62

MATH MONOSPACE


Note that which normal, script, fraktur, open-face, sans-serif, or monospace fonts are used is beyond the scope of plain-text.  The upper-case Greek letters A-Ω are defined by the Unicode Greek character range U+0391 through U+03A9 plus the nabla Ñ (U+2207). a-w are defined by the Unicode Greek character range U+03B1 through U+03C9 plus the partial differential sign ¶ (U+2202) and the six glyph variants of e, q, κ, f, ρ, and π, given by (new BMP code that resembles U+220A), U+03D1, U+03F0, U+03D5, U+3C1, U+03F1, and U+03D6 (since both glyphs for each of these can appear in the same document with different semantics). The upper-case position U+03A2 corresponding to the final sigma ς is used for the upper-case Θ variant, which looks like the usual Θ except that the “H” in the middle is replaced by a “-”. This gives 25+1 upper-case Greek characters and 25+7 lowercase characters.  In addition, corresponding characters in the BMP are used for upright serifed characters when they occur in mathematical expressions.  

 

 

Proposed Encoding Approach

Outright encoding stores the alphabets in plane 1 for a total of 991 characters as currently entered.  No accented characters are included.  Accented mathematical symbols are always represented by combining character sequences.  These characters fit into 1024 code positions (addressable using one high-surrogate value) by using the following scheme (one column is one hexadecade):

 

Character group

Total number

Layout

# of Columns

13 Latin alphabets

 676

42 full columns + 4 chars

42+ columns

5 Greek alphabets

 290

18 full columns + 2 chars

18+ columns

5 sets of digits

 50

3 full columns + 2. End at end of 1024-char block

3+ columns

Exclusions (below)

25

 

 

3 groups – 25 exclus.

991

 

64 columns

 

This is a total of 64 columns, which is a block of 1024 (one surrogate block).  The suggested block is D400…D7FF on plane 1.  Please see proposed explicit code assignments at the end of this document.

 

An alternative approach to separate encoding of each style of alphanumeric mathematical alphabet was given serious consideration (see L2/99-188). That consisted of encoding a math style variant tag for each significant difference in alphabets. However, while technically feasible for the representation of the required mathematical alphabets, the encoding of such math style variant tags raised unanswerable questions regarding what would happen if they were applied outside their intended mathematical domain--for example, to accented Latin letters, or even to other scripts such as Han characters. This approach was also too close to the introduction of explicit stylistic markup into the character encoding – something regarding which there is general consensus in the character encoding committees that such a step would be undesireable.

 

Block Title

The proposed block title name is “Mathematical Alphabets”.

 

Proposed ISO Naming Conventions

The character names are those used for the corresponding characters in the BMP with the proposed prefixes given in the table above, but simplified as in the Letterlike Symbols block.  For example, the character H  in the Hamiltonian formula above has the proposed name “MATH SCRIPT CAPITAL H”.  The code position for this particular character is marked as reserved, since the character already exists in the Letterlike-Symbols block with the name “SCRIPT CAPITAL H”.  This and other such reserved code positions are listed in the next section.

 

Exclusions

The code positions for the following characters should be left <reserved> (unassigned), since these characters already appear in the Letterlike-Symbols block:

 

Math character

Letterlike Symbol character

Code

MATH OPEN-FACE CAPITAL C

DOUBLE-STRUCK CAPITAL C

2102

MATH SCRIPT SMALL G

SCRIPT SMALL G

210A

MATH SCRIPT CAPITAL H

SCRIPT CAPITAL H

210B

MATH FRAKTUR CAPITAL H

BLACK-LETTER CAPITAL H

210C

MATH OPEN-FACE CAPITAL H

DOUBLE-STRUCK CAPITAL H

210D

MATH ITALIC SMALL H

PLANCK CONSTANT

210E

MATH SCRIPT CAPITAL I

SCRIPT CAPITAL I

2110

MATH FRAKTUR CAPITAL I

BLACK-LETTER CAPITAL I

2111

MATH SCRIPT CAPITAL L

SCRIPT CAPITAL L

2112

MATH SCRIPT SMALL L

SCRIPT SMALL L

2113

MATH OPEN-FACE CAPITAL N

DOUBLE-STRUCK CAPITAL N

2115

MATH OPEN-FACE CAPITAL P

DOUBLE-STRUCK CAPITAL P

2119

MATH OPEN-FACE CAPITAL Q

DOUBLE-STRUCK CAPITAL Q

211A

MATH SCRIPT CAPITAL R

SCRIPT CAPITAL R

211B

MATH FRAKTUR CAPITAL R

BLACK-LETTER CAPITAL R

211C

MATH OPEN-FACE CAPITAL R

DOUBLE-STRUCK CAPITAL R

211D

MATH OPEN-FACE CAPITAL Z

DOUBLE-STRUCK CAPITAL Z

2124

MATH FRAKTUR CAPITAL Z

BLACK-LETTER CAPITAL Z

2128

MATH SCRIPT CAPITAL B

SCRIPT CAPITAL B

212C

MATH FRAKTUR CAPITAL C

BLACK-LETTER CAPITAL C

212D

MATH SCRIPT SMALL E

SCRIPT SMALL E

212F

MATH SCRIPT CAPITAL E

SCRIPT CAPITAL E

2130

MATH SCRIPT CAPITAL F

SCRIPT CAPITAL F

2131

MATH SCRIPT CAPITAL M

SCRIPT CAPITAL M

2133

MATH SCRIPT SMALL O

SCRIPT SMALL O

2134

 

 

Discussion

Rendering mathematics requires a fairly sophisticated 2D layout engine.  Compared to the complexity needed in this engine, handling surrogate pairs is straightforward.  Furthermore it is anticipated that handling surrogate pairs will be easy, partly because they will be handled by computer operating systems thanks to the strong business case for supporting East Asian characters in plane 2.

 

 

Proposed Code Assignments in Plane 1

 

The proposed code assignments will be given in Unicode layout format shortly. A layout in the form of ISO/IEC 10646-2 can be provided to the editor.