Re: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Peter Kirk (peterkirk@qaya.org)
Date: Thu Dec 11 2003 - 11:32:55 EST

Next message: Arcane Jill: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"

Previous message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
In reply to: Mark Davis: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Michael \(michka\) Kaplan: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Reply: Michael \(michka\) Kaplan: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Reply: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Reply: Mark Davis: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 11/12/2003 07:40, Mark Davis wrote:

>Peter, here is your original remark. Ken has gracefully filled the gap in
>explaining the higher-level issues, but let's return to that for a minute.
>
>
>
>>>No, surely not. If the wcslen() function is fully Unicode conformant, it
>>>should give the same output whatever the canonically equivalent form of
>>>its input. That more or less implies that it should normalise its input.
>>>
>>>
>
>Talking about looking at the problem "at levels" really obscures the issues.
>Programmers call functions. Those functions don't magically change when one
>achieves a new Level of Enlightenment.
>
>
Mark, don't patronise me. I'm not talking about levels of enlightenment.
I'm not talking about levels in the sense you just used when you
mentioned "higher-level issues". I'm talking about the well-known
concept of levels or layers of programming and of communication protocols.

>The function wcslen is defined as "Determines the number of characters in a
>wide-character string." In C, those are not even defined to be Unicode
>characters. IF Unicode is used, wide-characters (wchar_t) may be codepoints or
>code units, depending on the implementation. The function is not defined -- and
>could never be redefined, without huge breakage -- to return the number of NFC
>codepoints.
>
>
>
I understand that. There is no problem is wcslen is used for the
function it is defined for, in terms of counting storage locations or
units in one of the UTF's. The problem arose when someone else on the
list suggested that this same function could be used to count Unicode
characters. At the time I suggested that this should not be done and
would have problems with Unicode conformance, and that the only
meaningful counting that should be done was of something like default
grapheme clusters. Ken has since convinced me that it is sensible and
conformant to count the number of Unicode code units or code points in a
string as long as one is working with the string as an entity to be
manipulated programmatically. But this should not be done when one is
dealing with any kind of "interpretation" as that must respect canonical
equivalence.

>Part of the problem is that "character" can be interpreted in a wide variety of
>ways, which is why we were forced into developing more precise terms like code
>units. So in general:
>
>1. If you want a function that returns the number of code units in X, you need
>to call one that is defined to do so.
>2. If you want a function that returns the number of code points in X, you need
>to call one that is defined to do so.
>3. If you want a function that returns the number of code points in toNFC(x),
>you need to call one that is defined to do so.
>4. If you want a function that returns the number of grapheme clusters in X, you
>need to call one that is defined to do so.
>5. If you want a function that returns the number of glyphs in X using font F
>and parameters P, you need to call one that is defined to do so.
>- And so on.
>
>There is a pattern here.
>
>
>
Of course. The original problem was that someone was trying t o use a
function defined for one thing to do something different.

>Of course in reality, there might not be individual functions for these. The
>most commonly used of these functions will always be #1, no matter what one's
>Level of Enlightenment is. That's because people typically need to know how much
>actual storage a string takes. ...
>
Here I disagree. As an application programmer writing for example some
kind of linguistic application, it is totally irrelevant to me how much
actual storage a string takes. Such things should be hidden away from me
by several levels of system software and compilers. An application
programmer doesn't even need to know what this concept means! Seriously!
Beginners, even young children, can be taught simple programming and
string handling without knowing anything about bits and bytes, certainly
without having to know whether the e acute they just typed is stored as
one byte or two. Just as people can and do learn to drive cars without
knowing anything about the nuts and bolts or how the engine works.

-- 
Peter Kirk
peter@qaya.org (personal)
peterkirk@qaya.org (work)
http://www.qaya.org/

Next message: Arcane Jill: "RE: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Previous message: Peter Kirk: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
In reply to: Mark Davis: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Next in thread: Michael \(michka\) Kaplan: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Reply: Michael \(michka\) Kaplan: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Reply: jon@hackcraft.net: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Reply: Mark Davis: "Re: Text Editors and Canonical Equivalence (was Coloured diacritics)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Dec 11 2003 - 12:24:22 EST