From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Jan 05 2005 - 16:24:20 CST
Antoine continued:
> On Wednesday, January 5th, 2005 19:17Z Kenneth Whistler va escriure:
>
> >> The Tibetan characters are _never_ encoded using Unicode in this
> >> process, are they?
> >> Looks like a clear case of nonconformance to me.
> >
> > Not at all.
>
> Indeed, it seems there is no necessity to use Unicode defined code points to
> represent anything.
Not quite. They represent neither more nor less than what
they are supposed to. An assigned Unicode code point associates
that code point with a particular abstract character, to create
an encoded character.
U+0062 is the encoded character LATIN SMALL LETTER B, neither
more nor less.
What people choose to *do* with that "b" is their own business,
including any and all weird semiotic usages they may choose
to put it to.
As I said, somebody may decide that the letter "b" is then
used to represent a chocolate chip cookie recipe, if they
want. Who's to stop them? Who's to stop them from doing so
now, *regardless* of the encoding? That's the *point*.
> > The Unicode *conformance* issue there is whether the Latin
> > letter "b" used in the Wylie transliteration is correctly
> > represented as U+0062, and whether, if using UTF-16, that
> > shows up in stored data and strings as a 16-bit code unit,
> > 0x0062, or if using UTF-8, that shows up in stored data
> > and strings as an 8-bit code unit, 0x62, and so on.
>
> O
> - O
> O
>
> But there are _no_ Latin letter "b" here; we are dealing with Tibetan
> letters, ain't we?
No, we are dealing with the encoded Latin letter "b" that someone is then
using to represent a Tibetan letter.
In some other context, they might be using it to represent the second
element of the English alphabet, or they might be using it to represent
a bilabial voiced stop, or ..., or...
I think you may be confused simply because transliteration involves
the symbolic use of characters from one script to represent
characters from another script, and then people may invent creative
ways of displaying transliterations that involve protocols other
than simple plain text.
> Or did you switch one level lower, disregarding the semantic meaning of the
> translitteration text, to only attach yourself to grapheme used in the
> translitteration,
Yes. Which is the appropriate level to consider here.
> which happens to be English letters in ASCII/UTF-8
> encoding?
Latin letters. In Unicode. (But it doesn't really matter, because
the argument would be exactly the same for *any* character encoding
that includes characters from more than one script, being used
this way.)
> To make a more extreme (and dumb) example, let's assume I have an
> ISCII-based rendering system, using Roman (reversed for you)
> translitterations but not plain English (that is, both A and a would be
> written \xA4 if we speak about the grapheme, or \xAC if we speak about the
> English letter).
This is mixing a couple things -- writing "A" or "a" with \xA4
(= U+0905 DEVANAGARI LETTER A) would be a transliteration system;
writing the English phoneme /ey/ (the pronunciation of the
letter "A") with \xAC (= U+090F DEVANAGARI LETTER E) would be
a transcription system. But never mind, since it doesn't impact
the answer to your question below.
> Furthermore it exchanges them by adding a signaling 0xEC00
> to the ISCII codepoints, while not suming anything to the ASCII codepoints,
> resulting in using the ranges 0x000A-0x0040, 0x005B-0x0060, 0x007B-0x007E,
> and 0xECA1-0xECFA.
>
> Can I claim conformance to Unicode/10646 on the basis I am using codepoints
> 0020 for SPACE, 002C for COMMA etc., that I do not destroy surrogates, I do
> not emit FFFF etc. etc.?
Yes.
What you do with U+ECA1..U+ECFA is your own private business. And
if you want to define those code points as being an EC00-shift
ISCII transliteration (or transcription) system for English, more
power to you.
>
> [ Or is there a special case for the Latin letters that disallow this? ]
No.
> Second question, if the above is "Yes I can claim conformance", what is the
> point of claiming conformance to Unicode/10646 (in such a case)?
The point is that you would be guaranteeing to a recipient of your
data that (assuming you were using UTF-16), 0x0020 was SPACE and
0x002C was COMMA and 0xECA1 was PRIVATE USE CHARACTER-ECA1 and so on.
And you would be guaranteeing to a recipient that such data was
not jpeg or mim or GB2312 plain text or any other conceivable thing
that some bag of binary coming down a wire could be.
What conformance to the Unicode Standard won't buy you is any
comprehension by your recipient of what your strange use of
PUA code points and the particulars of your Devanagari transliteration
of English are, nor how to convert it to display on an ISCII system.
For that, you need to convey your higher-level protocol to your
recipient.
> I remember Peter Constable remarking once that a process that rings the bell
> when submitted the code 7 is Unicode-conformant.
And he's right.
For that matter, a process that dispenses the cup of hot
tea when submitted the code U+2615 is Unicode-conformant.
In either case, the conformance issue comes down to some
pattern of binary bits in a data stream being interpreted
as a character, according to the assignments and code charts
of the standard.
What happens as a result of that interpretation, or what
protocol might be layered on top of that interpretation, is
up to the creative minds of everybody using characters
to do whatever they want.
I think perhaps the difficulty you are expressing comes from
the assumption that "X conforms to the Unicode Standard" should
imply something about a coverage of some particular repertoire
with some minimum standards of input and rendering, and so
on. But I think that constitutes a different class of claims
about software.
Consider it this way. Suppose I have some software that
purports to be an editor that "supports Greek". Now a claim
like that would reasonably be interpreted as being able to
input, edit, display, and print Greek text, and also to
perform other typical tasks, perhaps including spellchecking,
and so on. I would expect such things *regardless* of
whether the implementation internally was using 8859-7 or
Unicode or something else to represent the characters.
People might expect more of an editor that claims to be
"a Unicode implementation that supports Greek", simply because
Unicode contains more Greek characters than 8859-7, and because
you might then expect it to support Greek *and* one or more
other scripts as well. But that is really orthogonal to
the fundamental conformance issues of ensuring that
inside, deep under the covers, 0x039C is being interpreted
as GREEK CAPITAL LETTER MU and not some other random thing,
and that 0x03AC is treated equivalently to <0x03B1, 0x0301>,
and so on.
--Ken
This archive was generated by hypermail 2.1.5 : Wed Jan 05 2005 - 16:33:15 CST