From: Jukka K. Korpela (jkorpela@cs.tut.fi)
Date: Thu Apr 06 2006 - 15:04:10 CST
On Thu, 6 Apr 2006, Peter Edberg wrote:
> All of this hinges on the definition of what the exemplar set is supposed to
> cover.
Indeed. And this in turn should depend on the intended _use_ of this
definition. How will the "exemplar character sets" be used in text
processing and other applications?
> From UTS #35 (LDML): "The exemplar character set contains the commonly
> used letters for a given modern form of a language...
It says that the "letter" concept be interpreted broadly, but I don't
think we can count ZWJ, for example, as a letter without losing the whole
idea of a "letter" as opposite to a "character". On the other hand, I have
not seen any rationale for defining the set as a set of letters, or for
the odd-looking name "examplar character set" for that matter.
I think we can all imagine many possible uses for information about the
use of characters in a language. The one mentioned first in the LDML
specification, namely the choice of encoding, does not sound like a
particularly important one in the Unicode context. The "charset
conversion" usage looks odd ("'Character set' considered harmful"), and it
probably means conversions between encodings. Then there's collation
mentioned, but I fail to see the relevance.
A considerably more informative definition is needed, and it should be
something that different people around the globe can understand in a
reasonably similar manner. I'm afraid the definitions of "exemplar
character sets" become rather useless, if they are set up according to
greatly varying criteria.
The concept "collection of characters used in a language" is vague and
fuzzy. The multitude of possible interpretations needs to be squeezed down
to a small set of manageable definitions, though I'm afraid just two (the
basic set and the auxiliary set) isn't quite enough.
I hope the discussions can start before people waste far too much time in
considering the sets and debating about them, without knowing what they
are actually trying to define. I'd like to suggest a concrete starting
point, namely to consider whether the following tentative definitions
would be a suitable basis:
- The set of characters that is regarded as the absolute minimum
for writing a language, including punctuation and controls.
Any application should support this set before it can be said
to support the language.
- The set of characters that are considered as the basic repertoire
for use in orthographically correct writing of the language,
without any ASCII-era compromises like ambiguous semantics for "-".
This means roughly speaking the characters you can expect to find
in books in the language for a general audience.
I am mainly thinking of these as definitions to be used when selecting
fonts, or when deciding whether some software can be characterized as
supporting typing of the language, or when defining parameters for
OCR scanning, or designing general-purpose input data checking for
data in the language.
-- Jukka "Yucca" Korpela, http://www.cs.tut.fi/~jkorpela/
This archive was generated by hypermail 2.1.5 : Thu Apr 06 2006 - 15:07:59 CST