Re: ISO 10646 compliance and EU law

From: Antoine Leca (Antoine10646@leca-marti.org)
Date: Sat Jan 08 2005 - 13:24:04 CST

  • Next message: John H. Jenkins: "Smile for the day"

    Kenneth Whistler answered:

    > Antoine continued:
    >
    <discussion about a Unicode system of rendering Tibetan where the underlying
    character are English translitterations.>
    >>> Not at all.
    >>
    >> Indeed, it seems there is no necessity to use Unicode defined code
    >> points to represent anything.
    >
    > Not quite.

    Not quite what? Was I still wrong? Where?

    > They

    What is "They" here? Tibetan characters? Unicode codepoints?

    Sorry if I am dense, but I guessed I understood a point, and now you are
    writing I did not catch it, so you really lost me.

    > [They] represent neither more nor less than what
    > they are supposed to. An assigned Unicode code point associates
    > that code point with a particular abstract character, to create
    > an encoded character.

    Assuming you are speaking about Tibetan characters just as I did, I did
    understand your earlier point as sayiung that while there was a possible
    association, it was not required for an conforming application to use this
    association; and I agreed with that.
    So what did I miss?

    > U+0062 is the encoded character LATIN SMALL LETTER B, neither
    > more nor less.

    Either I am really stupid, or this is a tautology.

    > As I said, somebody may decide that the letter "b" is then
    > used to represent a chocolate chip cookie recipe, if they
    > want. Who's to stop them? Who's to stop them from doing so
    > now, *regardless* of the encoding? That's the *point*.

    First, I do not want to stop anybody. I do not know for you, but I firmly
    think people should be free to use whatever encoding they want; being
    conformant to determined standards will give them interoperatibility, which
    is the point of these standards. The wider is used the standard, the more
    people may benefit of it using it, and this is the road that is intenting to
    follow Unicode (and 10646). Anyway, these standards are not always the most
    useful for determinate cases, and it may happen that other choices could be
    made.

    In the case of the Tibetan system, it seems that the editors chose to use a
    Latin translitteration system over the use of the Tibetan codepoints as they
    are defined in Unicode; that is fine, of course. I just guess they will not
    expect to interoperate automatically with many systems.

    >> But there are _no_ Latin letter "b" here; we are dealing with Tibetan
    >> letters, ain't we?
    >
    > No, we are dealing with the encoded Latin letter "b" that someone is
    > then using to represent a Tibetan letter.

    I am sorry, I did not see the problem this way (or at this level). I see an
    use of a character (encoded \x62) to represent part of a Tibetan letter,
    yes. I agree with you that if you consider a stream of these characters, it
    could then called LATIN SMALL LETTER B and hence conform with The Unicode
    Standard (and be stored in database that is ignorant of what is happeing,
    etc. etc.) I fail to see how this could be interpreted as a Tibetan
    codepoint on another computer, unless there is an dedicated interface.
    I was just pointing out that I expected (Tibetan) Unicode conforming
    applications to be able to exchange datas without those pesky dedicated
    interfaces, which proved to be a wrong assumption.

    > I think you may be confused simply because transliteration involves
    > the symbolic use of characters from one script to represent
    > characters from another script, and then people may invent creative
    > ways of displaying transliterations that involve protocols other
    > than simple plain text.

    No. I am confused because I am speaking about Tibetan, and you switched the
    discussion toward Latin (or whatever), in a way that may lead an unattentive
    reader to believe that Tibetan should be stored using U+0062.

    >> Or did you switch one level lower, disregarding the semantic meaning
    >> of the translitteration text, to only attach yourself to grapheme
    >> used in the translitteration,
    >
    > Yes. Which is the appropriate level to consider here.
    >
    >> which happens to be English letters in ASCII/UTF-8
    >> encoding?
    >
    > Latin letters.

    I wrote English because it seems to use an English-based scheme. Everybody
    agrees that English is written in Latin script unless told otherwise; but
    not all Latin letters are used in English context (here, using an English
    keyboard layout).

    >> To make a more extreme (and dumb) example, let's assume I have an
    >> ISCII-based rendering system, using Roman (reversed for you)
    >> translitterations but not plain English (that is, both A and a would
    >> be written \xA4 if we speak about the grapheme, or \xAC if we speak
    >> about the English letter).
    >
    > This is mixing a couple things -- writing "A" or "a" with \xA4
    > (= U+0905 DEVANAGARI LETTER A)

    I do not have the same copy of the ISCII standard as you have.
    In mine, the codepoint \xA4 is named "Vowel A". And it is not entangled to
    the Devanagari (or whatever) script. It is true that Indians usually do
    represent it with the same glyph as the one shown for U+0905, but I do not
    feel unconforming to use the a glyph instead, following the first column of
    the table (annex-A) instead of the second.
    And yes, using a here is a translitteration, as is using \u0905.

    > would be a transliteration system;
    > writing the English phoneme /ey/ (the pronunciation of the
    > letter "A") with \xAC (= U+090F DEVANAGARI LETTER E)

    In the version of ISCII I consult, \xAC is "Vowel EY".

    > would be a transcription system.

    Yes; what I would highlight is that the difference between translitteration
    and transcription could affect the Latin letters by giving them multiple
    representations. Which was exactly the point you were making about the use
    of "b" for Tibetan or chocolate chip cookie recipe. Or at least, it was how
    I grasped it.

    >> Furthermore it exchanges them by adding a signaling 0xEC00
    >> to the ISCII codepoints, while not suming anything to the ASCII
    >> codepoints, resulting in using the ranges 0x000A-0x0040,
    >> 0x005B-0x0060, 0x007B-0x007E, and 0xECA1-0xECFA.
    >>
    >> Can I claim conformance to Unicode/10646 on the basis I am using
    >> codepoints 0020 for SPACE, 002C for COMMA etc., that I do not
    >> destroy surrogates, I do not emit FFFF etc. etc.?
    >
    > Yes.
    >
    > What you do with U+ECA1..U+ECFA is your own private business.

    You are missing my point.
    The internal use would be of the single byte, 0xA1 to 0xFA. Since using it
    directly in interchanges would be a violation of Unicode, I explained that
    to circumvent this I shifted to PUA.

    > I think perhaps the difficulty you are expressing comes from
    > the assumption that "X conforms to the Unicode Standard" should
    > imply something about a coverage of some particular repertoire
    > with some minimum standards of input and rendering, and so
    > on. But I think that constitutes a different class of claims
    > about software.

    Of course I was considering somehing along this way; because it is the way
    everybody understands "Unicode xxx" (xxx=application, encoding, font, ...)
    when not a lawyer nor a regular here.
    I am not qualified to decide if it is good or not that way. I just consider
    it is not good that Average Joe is mistaken everytime he is thinking about
    Unicode.

    > Consider it this way. Suppose I have some software that
    > purports to be an editor that "supports Greek". Now a claim
    > like that would reasonably be interpreted as being able to
    > input, edit, display, and print Greek text, and also to
    > perform other typical tasks, perhaps including spellchecking,
    > and so on.

    The mere fact you wrote "perhaps including" above shows that there is no
    standard expectations regarding "Greek suport." I know this is not Unicode
    business to define this. But I also know that a large number of IT
    professionals here and there expect this from Unicode, when it comes to the
    scripts that are not theirs. Because of the gimmicks like "i18n is Unicode"
    and "Unicode is i18n", when really it is just a brick in the basement
    (granted, it is an important brick, on this brick We could build a church.)

    > I would expect such things *regardless* of
    > whether the implementation internally was using 8859-7 or
    > Unicode or something else to represent the characters.

    Of course.
    Yet, if you do not know the encoding used, you have no clue about the
    exchange possibilities of such a software. On the other hand, if someone
    sells you this as "Unicode", the normal buyer will expect such a software to
    interoperate with the others Unicode Greek editors.
    And I guess he would be disgusted when he will understand that the "Unicode
    Greek editor" really uses ISCII shifted to 0xECA1-0xECFA.

    And we are back to the topic made initially (I guess the choice of Greek was
    no hazard): what could probably seek the EU Commission is to enforce use of
    a common repertoire to allow the maximum fluidity of texts throughout the
    union. We have seen that Unicode (or 10646, since it is bought for free from
    the former) conformance is NOT the goal. Also, I have explained to Ms.Keown
    that coercition here would not be the definitive answer, as we agreed above.
    Still, I guess there is an objective to chase.

    > But that is really orthogonal to
    > the fundamental conformance issues of ensuring that
    > inside, deep under the covers, 0x039C is being interpreted
    > as GREEK CAPITAL LETTER MU and not some other random thing,

    The fact it is U+039C hence GREEK CAPITAL LETTER MU does not prevent the
    same software to use it to represent in fact uppercase Omega (if it is using
    ROT-12)? So in fact, without additional information, what means 0x039C or
    even U+039C is kind of random, isn't it?

    Antoine



    This archive was generated by hypermail 2.1.5 : Sun Jan 09 2005 - 03:38:23 CST