From: Peter Kirk (peterkirk@qaya.org)
Date: Mon Nov 08 2004 - 09:00:01 CST
On 08/11/2004 12:47, Michael Everson wrote:
> ... Perhaps Ken Whistler and I, in our abundant spare time, might try
> to wordsmith the standard with regard to this issue. But your
> insistence that some legalistic interpretation of that text will
> determine what is and what is not a script is tiresome.
>
As my spare time may be more abundant than Ken's and yours, I have
drafted the following and submitted it to
http://www.unicode.org/reporting.html as an Error Report:
Subject: Characters, Scripts and Semantic Distinctions
According to the Unicode Standard 4.0 section 2.2. sub-section
"Characters, Not Glyphs", p.15, "Characters are the abstract
representations of the smallest components of written language that have
semantic value." However (as Michael Everson agrees with me) the
distinction between corresponding letters in different scripts is not
properly described as "semantic". It is therefore possible to understand
this sub-section as implying that this distinction between letters
should be treated in Unicode as a glyph distinction rather than a
character distinction. This is of course a misunderstanding, because
Unicode does in fact encode corresponding letters in different scripts
as distinct characters. But this misunderstanding has become widespread
and has fuelled a long and acrimonious debate about the proposed
Phoenician script. Therefore, to ensure consistency and minimise
misunderstandings, the text of this sub-section should be amended to
make it clear that corresponding letters in different scripts are
considered distinct characters.
I note that the issue is mentioned in passing in a different context on
p.19, relating only to cases where there is no graphical distinction
between scripts. But a clearer statement in the correct context would be
much preferable.
I propose the following text to be added to p.15, after the sentence
"They represent primarily, but not exclusively, the letters,
punctuation, and other signs that constitute natural language text and
technical notation.":
"The letters used in natural language text are grouped into scripts,
sets of letters which are used together in writing any one language.
Letters in different scripts, even when they correspond either
semantically or graphically, are represented in Unicode by distinct
characters."
I note that this change also impacts a few special cases such as the use
of the Latin letters Q and W in Cyrillic script for the Kurdish
language. According to the principle clarified here, distinct Cyrillic Q
and W characters should be encoded for Kurdish.
I would also suggest a separate definition of "script", a concept which
is much used in Chapter 2 of the Standard but nowhere clearly defined.
This definition should include a statement of the criteria by which
Unicode distinguishes script differences, e.g. between Indic scripts,
from graphical differences, e.g. between regular Latin, italic style and
Fraktur. The lack of stated criteria for this has also contributed to
serious misunderstandings concerning Phoenician.
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Mon Nov 08 2004 - 09:14:39 CST