From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Dec 29 2003 - 11:14:29 EST
From: "Christopher John Fynn" <cfynn@gmx.net>
> Anyone have a list of other standards, protocols, RFC's etc which specify
> Unicode (in any of it's encoding formats) as the base, default or
preferred
> character set to be used?
For RFCs it's not difficult to get this list using the RFCeditor.org
built-in
search engine.
However a more interesting list would be to seek for standards that were
built on non-Unicode, non-ISO/IEC10646 charsets, registered in IANA, and
that were since mapped onto Unicode, where these standards may perform
some string processing that does not conform to Unicode processing rules.
For example, these other standards may specify canonical equivalences
which do not exist in Unicode:
- For example, I think about some ETSI standards for Teletext, which may
contain more combining marks than those currently encoded in Unicode,
and may create some canonical or compatibility equivalences.
- Or about Asian string processing algorithms, notably for Hangul, Han
and Hiragana/Katakana.
These standards may be supported by documenting the additional
equivalences as Unicode folding rules. For now Unicode and ISO/IEC
have focused on preserving the distinctions in supported character sets,
but I think that there's some work to do with grapheme clusters that are
now distinct in Unicode but equivalent or compatibility equivalent in
other standards.
Documenting folding algorithms that may be used in Unicode is probably
a huge work, that is as much complex as unification of repertoires within
ISO/IEC 10646 assignments of code points, or within Unicode canonical
equivalences. Knowing them would certainly help to perform safe handling
of texts with Unicode, when they were initially coded with legacy charsets.
This archive was generated by hypermail 2.1.5 : Mon Dec 29 2003 - 12:01:14 EST