From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 11 2003 - 07:29:44 EST
On 10/12/2003 18:42, Kenneth Whistler wrote:
> ...
>
>>And even then the word "interpretation" needs to be clearly
>>defined, see below.
>>
>>
>
>"Interpretation" has been *deliberately* left undefined. It falls
>back to its general English usage, because attempting a
>technical definition of "interpretation" in the context of
>the Unicode Standard runs too far afield from the intended
>area of standardization. The UTC would end up bogged down
>in linguistic and semiotic theory attempting to nail this
>one down.
>
>What *is* clear is that a "distinction in interpretation of
>a character or character sequence" cannot be confused, by
>any careful reader of the standard, with "difference in
>code point or code point sequence". The latter *is* defined
>and totally unambiguous in the standard.
>
>
Thanks for the clarification. We are again talking at different levels.
I am still looking from the point of view of an application programmer
interested in a string as an abstract entity (an object or an abstract
data type) with a meaning or interpretation, but with no interest in the
exact encoding. You are looking at this at a lower level, either of a
systems programmer or of an application programmer who is forced to get
into this lower level stuff because of inadequate system support at the
more abstract level.
> ...
>
>Well, then please correct your interpretation of interpretation.
>
><U+00E9> has one code point in it. It has one encoded character in it.
>
><U+0065, U+0301> has two code points in it. It has two encoded
> characters in it.
>
>The two sequences are distinct and distinguished and
>distinguishable -- in terms of their code point or character
>sequences.
>
>The two sequences are canonically equivalent. They are not
>*interpreted* differently, since they both *mean* the same
>thing -- they are both interpreted as referring to the letter of
>various Latin alphabets known as "e-acute".
>
>*That* is what the Unicode Standard "means" by canonical equivalence.
>
>
>
Thanks again for the clarification. Again, I am not interested in code
point sequences but in meaning. I have been forced to get involved in
code point issues when I have found that they have not made the
necessary meaning distinctions. But my interest is essentially higher
level, which is why I am trying to push all of these non-meaningful
distinctions into a low level hidden from my view.
>...
>
>If you are operating at a level where the question "is this string
>normalised" is meaningless, then you are talking about text
>content and not about the level where the conformance requirements
>of the Unicode Standard are relevant. No wonder you and others
>are confused.
>
>Of course, if I look on a printed page of text and see the word
>"café" rendered there as a token, it is meaningless to talk about
>whether the é is normalized or not. It just is a manifest token
>of the letter é, rendered on the page. The whole concept of
>Unicode normalization is irrelevant to a user at that level. But
>you cannot infer from that that normalization distinctions cannot
>be made conformantly in the encoded character stores for
>digital representation of text -- which is the relevant field
>where Unicode conformance issues apply.
>
>
>
Ken, now you seem to be trying to define out of existence a level at
which C7-C9 and probably also C10 (at least the part about
canonical-equivalent sequences) are relevant. I accept, because of your
explanation above, that there is a lower level at which they are not
relevant, because it is concerned with encoded character sequences and
not with interpretation. But above that level there is surely a separate
level at which interpretation is relevant, and that is not just the
level of printed texts outside a computer system. If there isn't such a
level, C7-C10 are redundant and meaningless.
At the level I have in mind all kinds of important processes take place
within a computer system. Some of these are defined by Unicode, e.g.
collation, which is independent of the canonically equivalent form
because it starts with normalisation. Others e.g. automatic translation
are not defined by Unicode. For all processing at this level "Ideally,
an implementation would always interpret two canonical-equivalent
character sequences identically" (quote from C9). Rendering is also
effectively at this level. And at this level the question "is this
string normalised?" is meaningless, because we are looking at the text
content and its interpretation, and not at the encoded form. There is of
course an encoded form lying behind that text content, but that should
be no more the concern of the end user than the UTF form or than the
pattern of on and off transistors or magnetic particles in the
computer's memory, and it should be hidden from the end user by an API.
> ...
>
>Standards are not adjudicated by case law. They are not
>interpreted by judges. ...
>
Surely in principle they could be, if there was for example a dispute
over fulfilment of a contract which specified that a product must
conform to Unicode. But this is a red herring here, I realise.
> ...
>
>>Well, I had stated such things more tentatively to start with, asking
>>for contrary views and interpretations, but received none until now
>>except for Mark's very generalised implication that I had said something
>>wrong (and, incorrectly, that I hadn't read the relevant part of the
>>standard). Please, those of you who do know what is correct, keep us on
>>the right path. Otherwise the confusion will spread.
>>
>>
>
>I'll try. :-)
>
>
Thank you, and thank you for giving your time to this issue.
>--Ken
>
>
>
>
>
>
>
-- Peter Kirk peter@qaya.org (personal) peterkirk@qaya.org (work) http://www.qaya.org/
This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 08:15:34 EST