Hannu, you have brought up a number of good points.
Hannu> I don't think any language numbering scheme will work well,
Hannu> because it makes it too hard / impossible to add new
Hannu> languages or variants of languages quickly.
This is dependent on the numbering scheme. For example, if languages
are simply numbered based on the order they appear on the list, new
languages would simply be given the next available number.
Approaches based on "frequently occuring" languages would make it
somewhat more difficult to assign numbers without conflicts.
Hannu> No list of languages will contain everything the users will
Hannu> need, so the only working solution is string-based,
Hannu> i.e. something like the locale names (e.g. en_us, en_uk,
Hannu> ...) used in many places: - The basic language (first 2 or
Hannu> 3 letters, 3 need to be allowed too) are standardized, so
Hannu> you can always get to the major language - They're easy to
Hannu> extend, as the remaining part can be anything, including
Hannu> user-defined stuff. There are many standards about
Hannu> language identification in this way (e.g. HTML 3) where it
Hannu> has been found a very good and working way to mark
Hannu> language.
As was mentioned in an earlier message on this list, basing standards
on what are essentially politically generated country names causes
changes too frequent to provide a stable standard.
I happen to agree with that assessment.
Hannu> The only practical way to add custom languages to a
Hannu> enumerated system is to reserve a range of codes for custom
Hannu> languages, but a unknown number does not say anything.
Hannu> A string approach, for example, en_fi still tells you it's
Hannu> english with probably a Finnish specialty (e.g. currencies
Hannu> or numbers might be in Finnish format), but you still know
Hannu> it's basically english. When you get an unknown number,
Hannu> there is nothing you can do.
Again, depending on the numbering system chosen, even an unknown
number may contain enough information to determine what language to
fall back on.
For example, if you interpret the 16-bit value in our proposal as
Win32 language id's, then given any language id, you at least know
which language family the language belongs to.
Hannu> Every application/system will need a mapping table from the
Hannu> language numbers to whatever they use internally, which is
Hannu> probably something locale-style. Having one more mapping
Hannu> table to keep up-to-date is a added burden.
This may be true. Any language id proposal that gets accepted by a
standardization organization will cause someone problems.
But if the approach is designed well, then this kind of maintenance
may be minimal or even uneccessary. The primary cost would be the
initial conversion to the adopted approach.
Hannu> Also, your proposal would introduce a strange set of
Hannu> combining characters to Unicode, whose parsing is different
Hannu> from everything else and thus would complicate the
Hannu> standard.
We are currently interpreting these codepoints as control code types
with "other neutral" bidirectional behavior. We are still trying to
decide if this a good idea or not.
The only change needed in our code was to check for the existence of
these codepoints in a similar fashion to the pseudo code in the
original mailing.
We hadn't considered viewing them as combining characters, but that
seems to make a certain amount of sense as they must effectively be
"combined." More to think about!
Hannu> Anyway, we have learned from the working and simple
Hannu> internet RFC standard conventions (and many other places),
Hannu> that it's much better to use strings to describe things
Hannu> instead of magic numbers, especially size-limited numbers.
Hannu> Strings are about easy to handle, anyway, and offer
Hannu> infinite extensibility and can be human readable too.
I happen to like strings myself and have a great deal of respect for
the reasons strings are chosen in RFC's, but when processing extremely
large corpora (multi-gigabyte, terabyte), looking up language support
from a string would seem to add a rather noticeable amount of overhead
compared to a numeric approach.
Hannu> Even if you wanted to do a language enumeration system, it
Hannu> would be better to do it using the extension to about 1
Hannu> million codepoints and reserve from there a range of
Hannu> codepoints for language IDs.
This is the ideal situation. The obvious question is: where will we
get those million codepoints? If we extend our approach to 32
codepoints, then we can construct identifiers that will allow 2^32 - 1
possible language ids (somewhat excessive, but over a million
possibilities :-).
Hannu> I agree that we might need some mechanism to indicate
Hannu> language in Unicode *plain text* files, as everywhere else
Hannu> you already have some form of tagging on top of plain
Hannu> Unicode, e.g. SGML-based, so you can use that for language
Hannu> identification too.
Higher-level language identification works quite well. But when you
get documents with different higher-level markup from different
systems, it makes processing that text annoyingly complicated.
Hannu> But better than some binary enumeration, would be e.g. to
Hannu> define 2 additional characters, LANG_ID_START and
Hannu> LANG_ID_END, and define that language would be indicated
Hannu> with LANG_ID_START <2 or 3 character standard language
Hannu> code, ASCII values only _ or . (optional) <1*N character
Hannu> detail code> (optional) LANG_ID_END This is not really good
Hannu> either, but it would be a little better than the proposed
Hannu> numbering approach, I think. It's about as easy to parse,
Hannu> and doesn't suffer from the limitations of numeric
Hannu> enumeration.
This is a reasonable approach. It may even be nearly as efficient
when scanning text, but there is still the cost of determining the
language from some combination of the values between LANG_ID_START and
LANG_ID_END, particularly if those values represent a string.
To reiterate, using a string to determine the language is noticeably
slower compared to a numeric approach. In addition, the process is
often implemented by converting the string to a number before the
lookup happens anyway.
Consider the case of looking for one of these sequences while scanning
backward through the text. You find LANG_ID_END, you go backward till
you find LANG_ID_START, then you have to look at everything between
LANG_ID_START and LANG_ID_END *again* to determine the language. Add
that to a (possibly) string representation of a language id, and you
have even more overhead looking up the language support.
The obvious argument against this is that in general, the number of
times a given program needs to scan backward is comparatively small to
the number of times it needs to scan forward.
Hannu> We should not trade a little of programmer convenience for
Hannu> major long-term limitations in any standard. The
Hannu> programming work will (should be) done once in the
Hannu> framework or OS libraries, anyway, so it won't trouble most
Hannu> programmers, anyway.
I completely agree. Limitations introduced now will come back to hurt
us later. Ideally, we need an approach that is flexible enough to
meet sophisticated research or scholarly needs and is convenient for
implementers.
Hannu> Most applications will have their own higher-level tagging
Hannu> (SGML-style markup will probably dominate), so having a
Hannu> different tagging mechanism for indicating language would
Hannu> require handling two different kinds of markup.
Hannu> The really good and desirable approach, I think, would be
Hannu> to raise the abstraction level of "plain unicode text
Hannu> file", to include e.g. SGML/HTML style tagging as a
Hannu> standardized part of "plain text unicode files". Then we
Hannu> could really easily indicate language, and whatever else
Hannu> you might want, in a standard and easily parsable (and
Hannu> easily ignorable or removable markup) format.
In my opinion, this would be the same as changing the SGML standard to
make the default character set some form of Unicode or 10646. It may
be a good idea, but it would take a *long* time to determine the
impact, and a *longer* time for the changes to propagate.
Hannu> ASCII-style plain text files deserve to die.
"Plain text" certainly makes my life difficult at times, so I guess I
would have to agree :-)
-----------------------------------------------------------------------------
mleisher@crl.nmsu.edu
Mark Leisher "The trick is not gaining the knowledge,
Computing Research Lab but surviving the lessons."
New Mexico State University -- "Svaha," Charles de Lint
Box 30001, Dept. 3CRL
Las Cruces, NM 88003
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT