From: Mark Davis ☕ (mark@macchiato.com)
Date: Wed Dec 22 2010 - 15:58:19 CST
Mark
*— Il meglio è l’inimico del bene —*
On Wed, Dec 22, 2010 at 09:21, spir <denis.spir@gmail.com> wrote:
> Hello,
>
>
> -1- code validity
> I have long thought only values corresponding to surrogates were invalid
> codes. But I recently discovered both in D's builtin unicode-aware chars &
> strings, and on the site 'fileformat' (
> http://www.fileformat.info/info/unicode/char/ffff/index.htm) that some
> other codes are invalid, like fffe & ffff.
> I'm a bit lost. What's true, then? And where can I find actual and *clear*
> definitions of code validity?
>
See Chapter 3 in http://www.unicode.org/versions/Unicode6.0.0/, especially
"well-formed" vs "ill-formed"
> I also discovered in ICU docs that it does not reject unpaired surrogates (
> http://userguide.icu-project.org/strings#TOC-ICU:-16-bit-Unicode-strings).
> D instead instead rejct unpaired surrogates and more. ???
>
ICU treats unpaired surrogates as if they were unassigned characters, when
manipulating them as strings. That is a common technique, see discussions of
"UnicodeString" in Chapter 3.
> -2- grapheme meaningfulness
> I take the opportunity to ask about grapheme (in the unicode sense *)
> validity as well: the "grapheme cluster boundary" algorithm sems to quietly
> allows building meaningless "graphemes" such as base-less (sequences of)
> combining codes. What are we expected to do with them?
>
It depends on what you are trying to do. You can filter out degenerate cases
or keep them. For more information, see http://unicode.org/reports/tr29/
>
> -3- _unique_ ordering
> The "canonical" ordering algorithm does not provide a unique
> representation: codes with the same ordering class (ccc) are not ordered.
> For instance, most (all?) diacritics placed above have the same class (230).
> Thus, <A><dot above><tilde> and <A><tilde><dot above> can both be output of
> ordering, while they represent the same piece of text.
>
That is incorrect. These *do not* represent the same text.
There are some cases, especially with combining characters with ccc=0 where
the canonical ordering is not sufficient. Moreover, in general the
normalization algorithms do not and cannot always give a unique output for
"the same text", since that phrase is so vague. "A" and "a" are the same
word in English, but are not merged by normalization; moreover, it may vary
by language: "aa" and "å" in Danish.
So you have to be much more precise as to what sense of "the same" that you
are looking for.
> I thought the core point of normalisation was precisely to provide a
> _unique_ form for each text --so that, for instance, one can safely and
> efficiently search/count/replace... But if I search the first form in a text
> that holds the second, I'll miss it.
>
What may help is for you to look at the UCA, in the section on matching.
>
>
> Denis
>
> (*) I mean here "grapheme" not in the common sense of graphical form of a
> phoneme, but in the Unicode sense of character in the common sense ;-)
> -- -- -- -- -- -- --
> vit esse estrany ☣
>
> spir.wikidot.com
>
>
>
>
This archive was generated by hypermail 2.1.5 : Wed Dec 22 2010 - 16:02:56 CST