Re: Script Names

From: Mark Davis (markdavis@ispchannel.com)
Date: Mon May 22 2000 - 09:48:53 EDT


This is certainly a good place for discussion, and I was hoping that having a visual chart rather than just a list of names would help to spark that.

A little bit about the background: non-letters are specifically excluded from the list because they are so often shared. By treating them as "Common", the script of a sequence of text can be determined by the characters with "strong" script. It also means that implementations only have to bother looking at / storing script values for letters, not for all characters.

That being said, this is at the proposal stage. If there are good reasons for changing this policy we can
certainly do it.

As to the generation, the first version of the list was generated automatically. It was then modified to
break up the blocks that have mixed scripts. It was further refined in the UTC after some discussion. More
discussion will be needed. In particular, the discussion of whether the new math duplicate alphabetic
characters should be symbols vs. letters, would also affect some of the items in
http://www.unicode.org/unicode/reports/tr24/charts/ScriptChart0.html.

As to your specifics:

The letter modifiers are a bit tricky. Suggestions for changing particular ones are welcome.

You make some good points about some of the Indic characters -- any that are really shared across scripts should go into "Common". Why do you think that U+0B83 should be an L*?

Antoine Leca wrote:

> Mark Davis <mark.davis@us.ibm.com> wrote:
> >
> > There is a new proposed technical report on the Unicode site.
> >
> > document: http://www.unicode.org/unicode/reports/tr24/
>
> Interesting stuff. I believe this is good work, but as always there is
> certainly room for improvement (I think Unicode is an endless work).
>
> As Mark doesn't give any address for discussion, I assume this is
> the correct forum. Please tell if I am wrong.
>
> I am not completely comfortable with the assertions that only "letters",
> but OTOH all "letters", have to be classified. Certainly this is an
> easy-to-grasp barrier, but there are some border-case that looks funny
> to me:
> - U+02E4;MODIFIER LETTER SMALL REVERSED GLOTTAL STOP is in the Latin
> "script", while U+02C0;MODIFIER LETTER GLOTTAL STOP and U+02C1;
> MODIFIER LETTER REVERSED GLOTTAL STOP are not...
>
> - Indic vowel marks are discarded: certaibly, they cannot occur at the
> the beginning of a piece of text (being a word or a paragraph); but the
> same can be said for come others codes, the first that striked me
> are the Thai and Lao vowel marks that are *not* ordered in the front of
> a syllable (i.e., I speak about sara a, sara aa, sara am, etc.)
>
> - Indic (not Arabic-Indic) digits are not included, although they are
> used only in the context of the relevant script.
>
> - Devanagari OM is the only coded OM sign, while there exist variations
> in other scripts (Gurmukhi is clear in this respect). I was assuming
> the U+0950 should be used for the latter as well, the difference being
> done by the surrounding informations ("higher-level protocol"). It
> appears from Mark's tables that I was wrong, because U+0950 seems to
> be reserved to rthe Devanagari script; so I wonder if we do not need
> some new characters, such as *U+0A50 as Gurmukhi OM ?
> (it now strikes me that U+0AD0 Gujurati OM is already included)
>
> - same problem, although certainly much more minor, occurs with avagraha:
> it is encoded in the Devanagari block, and variants (which looks like
> a bit different) are encoded in the Gujarati and Oriya script. Fine,
> but when Sanskrit is to be written in Bengali, or in any of the
> South Indian scripts? do we need a bunch of new codes?
>
> - this also remains me of the status of Tamil aytam U+0B83 "TAMIL SIGN
> VISARGA", which is tagged "Mc", while it appears it may be a real letter
> instead (but it cannot begin a word)
>
> Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:03 EDT