From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Sep 25 2007 - 16:48:17 CDT
> part de Marion Gunn
> Envoyé : mardi 25 septembre 2007 14:11
> À : Unicode Discussion
> Cc : Mike; Unicode Mailing List
> Objet : Re: New Public Review Issue: Proposed Update UTS #18
>
> Tricky? Perhaps so, Mark, but solutions are the name of the game. In
> any case, we need to add CH as a single letter in both Welsh and
> Breton, C'H as a single letter in Breton, FF as a single letter in
> Welsh, NG as a single letter in Welsh, etc., in all implementations.
> mg
It should be noted that "C'H" is a single collation element in Breton, but
that it is not the only representation of this collation element; this
collation element includes other representations of the apostrophe, and
notably "C’H" which collates equivalently (there's just the "last chance"
difference, at the last level for codepoints, there's no primary or
secondary difference).
One question remains here foruse in regexps: if the input universe is
defined in such a way that "." matches a single collation element in the
input locale or in the current locale context, then it will match several
distinct strings that are not necessarily canonically equivalent (in
addition to possibly distinct but canonically equivalent strings).
So how can a regexp specify that only one specific form of this collation
element match? Suppose that a regexp user wants to find all occurrences in a
text where "c'h" is used in a Breton text instead of the recommended "c’h"
form (which is equivalent linguistically).
This is not restricted to Breton: one could do the same thing about "'" and
"’" used in English or French where they are also perceived as equivalent
instances of the the same collation element, and should be treated
identically. If we want regexps to be usable in linguistic contexts, then we
must be able to collate strings correctly according to languages, and then
be able to make distinctions only when this is *explicitly* specified in the
regexp.
For this, we'll need a special escaping mechanism that will disable the
interpretation as collation element classes, but that will still maintain
the interpretation as unbreakable elements (not part of the input universe).
One mean is the iuse of \uxxxx or \Uxxxxxxxx, but this is not the easiest
way to reer to them; if the collation elements treated identically are not
canonically equivalent, like the various apostrophes, we should beable to
escape them by using them directly within the regexp, without using any
triky hexadecimal notation: if \q{c'h} is used to refer to the Breton
collation element, does it match \q{c’h} ? If it does not match, then how
can we simply search for either "c'h" or "c’h" in the Breton locale context
(which should still be the normal way to look for them)?
So, maybe \q{c'h} will match both "c'h" and "c’h" in the Breton locale
context, and either one of:
* \Q{c'h} will match only "c'h" (using \Q instead of \q means that it won't
use the current locale context where aquivalent collation elements are
recongized, but will only refer to the default collation elements that are
canonically equivalent).
* or may be \Q{C!c'h} where we specify a simpler locale, here the C locale,
where no collation ever occurs, and where not even the canonical
equivalences are recognized, unlike \Q{POSIX!c'h} where canonical
equivalences are possible
* or \Q{U!c'h} in a "U" locale referencing the default Unicode DUCET where
canonical equivalences should be recognized, but without impact here on the
apostrophes.
This archive was generated by hypermail 2.1.5 : Tue Sep 25 2007 - 16:50:34 CDT