>If IPA and Greek are to be mixed, but remain distinguishable, you will have to
use markup, just as if you had mixed Greek and Coptic. Then you can sort them
any way you like. If you want uniform, portable sorting methods for text with
markup, you have to consider whether XML can do what you want.
>Unicode cannot carry the burden of all possible semantics for a particular
character. We cannot do a correct linguistic sort on Unicode plain text with no
language markers, and the proposed set of language marker characters cannot
cover the requirements for 6,700 languages (current Ethnologue count) in more
than 200 writing systems plus IPA.
Ah! Now, Edward, here's a point you and I can agree on. While I stand with
Michael not in favour of your phonetics-using-XML-markup proposal, I also
thought that his argument based on collation didn't hold up for the same
reasons. I don't support markup for phonetics, but I *thoroughly* agree that, in
a multilingual context, documents must contain markup to indicate the language
of given strings, and that there is a need for a standardised set of tags that
covers *all* of the worlds languages, not just the small set currently covered
by certain ISO standards. I expect that not long from now we will see Hmong, New
Tai Lue, Tai Dam, Silheti and other scripts included in Unicode, and there are
speakers of languages that use these scripts who would like to be able to
communicate with others in those languages using email, the web, etc. (Forget
about the future; this is a reality now for various languages of Ethiopia, such
as Amharic, and quite possibly many languages from India and also from the
Americas.) They will want to have tags for their languages. Granted, adding a
tag for a given variety of Hmong to a standard does not mean that browsers will
suddenly display the appropriate script. I do know, however, that if the tag
isn't added, that browsers will very likely not display that script.
I'm particularly concerned about two things related to language tags:
- internet documents (particularly HTML) typically use two-letter codes to
indicate a language; assuming these are ASCII letters only with no case, that
allows for 26^2 = 676 languages out of 6700+.
- MS Windows used LANGIDs in various contexts, and these are 16-bits long,
divided as follows: primary lang id is 10 bits, secondary lang id is 6 bits; in
each case, one bit is reserved for user-definable IDs. Assuming that the
user-defined values will never be standardised, this allows for 2^9 = 512
primary lang ids and 2^5 = 32 secondary lang ids. The number of secondary ids is
ample (as far as I'm aware), but 512 out of 6700+ leaves a big gap.
I would really favour seeing both of these situations change, though I know that
may be almost as hard as selling the Unicode and ISO people on adding a dotless
j.  :-)
Peter
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT