L2/99-195
DATE: 1999-06-11
DOC TYPE: |
Expert contribution |
TITLE: |
Proposal to encode mathematical alphanumeric symbols |
SOURCE: |
Murray Sargent III, Barbara Beeton |
PROJECT: |
|
STATUS: |
Proposal |
ACTION ID: |
FYI |
DUE DATE: |
-- |
DISTRIBUTION: |
Worldwide |
MEDIUM: |
Paper and html |
NO. OF PAGES: |
5 |
A. Administrative |
|
1. Title |
Proposal to encode mathematical alphanumeric symbols |
2. Requester's name |
Murray Sargent III, Barbara Beeton |
3. Requester type |
Expert request. |
4. Submission date |
1999-6-11 |
5. Requester’s reference |
Scientific and Technical Information Exchange (STIX) |
6a. Completion |
Complete proposal |
6b. More information to be provided? |
If requested |
B. Technical -- General |
|
1a. New script? Name? |
No. |
1b. Addition of characters to existing block? Name? |
No. |
2. Number of characters |
976 new alphanumeric symbols |
3. Proposed category |
|
4. Proposed level of implementation and rationale |
Level 1 |
5a. Character names included in proposal? |
Yes |
5b. Character names in accordance with guidelines? |
Yes |
5c. Character shapes reviewable? |
|
6a. Who will provide computerized font? |
None needed (they already exist) |
6b. Font currently available? |
None needed (standard fonts are adequate) |
6c. Font format? |
na |
7a. Are references (to other character sets, dictionaries, descriptive texts, etc.) provided? |
Yes. |
7b. Are published examples (such as samples from newspapers, magazines, or other sources) of use of proposed characters attached? |
Not attached, but available. |
8. Does the proposal address other aspects of character data processing? |
No |
C. Technical -- Justification |
|
1. Contact with the user community? |
Yes. Patrick Ion, Barbara Beeton, Murray Sargent III |
2. Information on the user community? |
Professional mathematicians, physicists, astronomers, engineers, and other scientific and technical researchers. |
3a. The context of use for the proposed characters? |
Used in publication of research mathematics and other hard sciences. |
3b. Reference |
|
4a. Proposed characters in current use? |
Yes |
4b. Where? |
Worldwide, by scientific and technical publishers and other users of mathematics |
5a. Characters should be encoded entirely in BMP? |
No, entirely in plane 1. |
5b. Rationale |
Accurate publication of mathematical and scientific research on the Web is impossible without a comprehensive and accurate collection of symbols including various alphabetic variants in common use. |
6. Should characters be kept in a continuous range? |
Yes in order to fit in one 1024-character surrogate block |
7a. Can the characters be considered a presentation form of an existing character or character sequence? |
No. A given alphabetic symbol has different semantics when its style is changed and should not be found by the same plain-text search string. |
7b. Where? |
|
7c. Reference |
|
8a. Can any of the characters be considered to be similar (in appearance or function) to an existing character? |
Some letterlike symbols look similar to corresponding characters in some alphabets, e.g., some capital script letters. These are left as holes in the proposed code assignments |
8b. Where? |
Letterlike symbols |
8c. Reference |
|
9a. Combining characters or use of composite sequences included? |
No |
9b. List of composite sequences and their corresponding glyph images provided? |
No |
10. Characters with any special properties such as control function, etc. included? |
No |
D. SC2/WG2 AdministrativeTo be completed by SC2/WG2 |
|
1. Relevant SC 2/WG 2 document numbers: |
|
2. Status (list of meeting number and corresponding action or disposition) |
|
3. Additional contact to user communities, liaison organizations etc. |
|
4. Assigned category and assigned priority/time frame |
|
Other Comments |
|
Mathematics has need for a number of Latin and Greek alphabets that on first thought appear to be font variations of one another, e.g., normal, bold, italic and script H. However in any given document, these characters have distinct mathematical semantics. For example, a normal H represents a different variable from a bold H, etc. If one drops these distinctions in plain text, one gets gibberish. Instead of the well-known Hamiltonian formula
H = òdt(eE² + mH²),
you’d get the integral equation (!)
H = òdt(eE² + mH²).
Accordingly, the STIX project requests adding normal, bold, italic, script, etc., Latin and Greek alphabets. Straight encoding leads to 976 characters and loses some useful common information, such as all variants of H might not be trivially recognizable as H’s. But it does allow plain text to retain the proper character semantics and it allows simple (nonrich) search methods to work.
Math
style |
Characters |
Count |
Proposed Name Prefix |
bold |
a-z, A-Z, 0-9, a-w, A-Ω (Greek) |
117 |
MATH BOLD |
italic |
a-z, A-Z, a-w, A-Ω (Greek) |
107 |
MATH ITALIC |
bold italic |
a-z, A-Z, a-w, A-Ω (Greek) |
107 |
MATH BOLD ITALIC |
calligraphic (script) |
a-z, A-Z |
52 |
MATH SCRIPT |
bold calligraphic (script) |
a-z, A-Z |
52 |
MATH SCRIPT BOLD |
fraktur |
a-z, A-Z |
52 |
MATH FRAKTUR |
open-face |
a-z, A-Z, 0-9 |
62 |
MATH OPEN-FACE |
open-face italic |
a-z, A-Z |
52 |
MATH OPEN-FACE ITALIC |
sans-serif |
a-z, A-Z, 0-9 |
62 |
MATH SANS |
sans-serif bold |
a-z, A-Z, 0-9, a-w, A-Ω (Greek) |
117 |
MATH SANS BOLD |
sans-serif italic |
a-z, A-Z |
52 |
MATH SANS ITALIC |
sans-serif bold italic |
a-z, A-Z, a-w, A-Ω (Greek) |
107 |
MATH SANS BOLD ITALIC |
monospace |
a-z, A-Z, 0-9 |
62 |
MATH MONOSPACE |
Note that which normal, script, fraktur, open-face, sans-serif, or monospace
fonts are used is beyond the scope of plain-text. The upper-case Greek letters A-Ω are defined by the Unicode Greek
character range U+0391 through U+03A9 plus the nabla Ñ (U+2207). a-w
are defined by the Unicode Greek character range U+03B1 through U+03C9 plus the
partial differential sign ¶ (U+2202) and the glyph variants of e, q
, f, and π,
given by U+220A, U+03D1, U+03D5, and U+03D6 (since both glyphs for each of
these can appear in the same document with different semantics). This gives
24+1 upper-case Greek characters (when allocating, include the <reserved>
character at U+03A2 to facilitate case changes) and 25+5 lowercase characters. In addition, corresponding characters in the
BMP are used for upright serifed characters when they occur in mathematical
expressions.
Outright encoding stores the alphabets in plane 1 for a total of 976 characters as currently entered. No accented characters are included. Accented mathematical symbols are always represented by combining character sequences. These characters fit into 1024 code positions (addressable using one high-surrogate value) by using the following scheme (one column is one hexadecade):
Character
group |
Total
number |
Layout |
#
of Columns |
13 Latin alphabets |
676 |
42 full columns + 4 chars |
43 columns |
5 Greek alphabets |
275 |
17 full columns + 3 chars |
17+ columns |
5 sets of digits |
50 |
3 full columns + 2. End at end of 1024-char block |
3+ columns |
Exclusions (below) |
25 |
|
|
3 groups – 25 exclus. |
976 |
|
64 columns |
This is a total of 64 columns, which is a block of 1024 (one surrogate block). The suggested block is D400…D7FF on plane 1. Please see proposed explicit code assignments at the end of this document.
We also considered using math variant tags for each of the categories above (see L2/99-188). A major problem with such variants is that they could be applied in principle to all Unicode characters, increasing the complexity of layout algorithms and offering the possibility of misusing the math characters for ordinary rich-text formatting. As defined, the characters are to be used only in mathematical expressions.
The
proposed block title name is “Mathematical Alphabets”.
The character names are those used for the corresponding
characters in the BMP with the proposed prefixes given in the table above, but
simplified as in the Letterlike Symbols block.
For example, the character H in the Hamiltonian formula above has
the proposed name “MATH SCRIPT CAPITAL H”.
The code position for this particular character is marked as reserved,
since the character already exists in the Letterlike-Symbols block with the
name “SCRIPT CAPITAL H”. This and other
such reserved code positions are listed in the next section.
The code positions for the following characters should be left <reserved> (unassigned), since these characters already appear in the Letterlike-Symbols block:
Math
character |
Letterlike Symbol character
|
Code |
MATH OPEN-FACE CAPITAL C |
DOUBLE-STRUCK CAPITAL C |
2102 |
MATH SCRIPT SMALL G |
SCRIPT SMALL G |
210A |
MATH SCRIPT CAPITAL H |
SCRIPT CAPITAL H |
210B |
MATH FRAKTUR CAPITAL H |
BLACK-LETTER CAPITAL H |
210C |
MATH OPEN-FACE CAPITAL H |
DOUBLE-STRUCK CAPITAL H |
210D |
MATH ITALIC SMALL H |
PLANCK CONSTANT |
210E |
MATH SCRIPT CAPITAL I |
SCRIPT CAPITAL I |
2110 |
MATH FRAKTUR CAPITAL I |
BLACK-LETTER CAPITAL I |
2111 |
MATH SCRIPT CAPITAL L |
SCRIPT CAPITAL L |
2112 |
MATH SCRIPT SMALL L |
SCRIPT SMALL L |
2113 |
MATH OPEN-FACE CAPITAL N |
DOUBLE-STRUCK CAPITAL N |
2115 |
MATH OPEN-FACE CAPITAL P |
DOUBLE-STRUCK CAPITAL P |
2119 |
MATH OPEN-FACE CAPITAL Q |
DOUBLE-STRUCK CAPITAL Q |
211A |
MATH SCRIPT CAPITAL R |
SCRIPT CAPITAL R |
211B |
MATH FRAKTUR CAPITAL R |
BLACK-LETTER CAPITAL R |
211C |
MATH OPEN-FACE CAPITAL R |
DOUBLE-STRUCK CAPITAL R |
211D |
MATH OPEN-FACE CAPITAL Z |
DOUBLE-STRUCK CAPITAL Z |
2124 |
MATH FRAKTUR CAPITAL Z |
BLACK-LETTER CAPITAL Z |
2128 |
MATH SCRIPT CAPITAL B |
SCRIPT CAPITAL B |
212C |
MATH FRAKTUR CAPITAL C |
BLACK-LETTER CAPITAL C |
212D |
MATH SCRIPT SMALL E |
SCRIPT SMALL E |
212F |
MATH SCRIPT CAPITAL E |
SCRIPT CAPITAL E |
2130 |
MATH SCRIPT CAPITAL F |
SCRIPT CAPITAL F |
2131 |
MATH SCRIPT CAPITAL M |
SCRIPT CAPITAL M |
2133 |
MATH SCRIPT SMALL O |
SCRIPT SMALL O |
2134 |
Rendering mathematics requires a fairly sophisticated 2D layout engine. Compared to the complexity needed in this engine, handling surrogate pairs is straightforward. Furthermore it is anticipated that handling surrogate pairs will be easy, partly because they will be handled by computer operating systems thanks to the strong business case for supporting East Asian characters in plane 2.
The
proposed code assignments are given in Unicode layout format by mathalph.pdf.
A layout in the form of ISO/IEC 10646-2 can be provided to the editor.