Re: Is it true that Unicode is insufficient for Oriental languages?

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu May 22 2003 - 11:08:40 EDT

  • Next message: Rick McGowan: "Re: CodePage Information"

    From: "Stefan Persson" <alsjebegrijptwatikbedoel@yahoo.se>
    > John Cowan wrote:
    > > There are no less than 70K Han characters in Unicode 4.0.
    >
    > And if you are quoting some old source, involving any character NOT in Unicode, you may map that character to any preferred PUA code point.
    >

    This makes me think about articles trying to describe another encoding method explaining that this is what Unicode SHOULD have been, ignoring the fact that before Unicode, there was no common processing model to handle the many incompatible encodings found in many places, that were hard to make compatible each other because of a lack of a formal description
    (see the way the encodings were defined in the old charsets registry, which now is being migrated to use more exact description by mapping them through a common intermediate Unicode mapping).

    Unicode will not stop or forbid definitions of other encodings or encoding models, but at least there will always be a well defined mapping between all new "character sets" or "encodings" and Unicode (because Unicodes defines much more character properties and semantics than all previous ("legacy") encodings. Even the string handling algorithms are not mandatory, and applications are free to use their own encodings if this facilitates the implementation of string handling algorithms they need.

    Unicode/ISO/IEC 10646 has really helped IBM to migrate its previous very large sets of conversion tables (and then donnate them to the community by publishing its mappings in its ICU tables) through a unique intermediate representation, and in fact Unicode was extended each time there was enough legacy usage of a "codepage" or "charset" or "encoding" used in applications, and allowed much better interchange of informations between otherwise incompatible systems. And it will remain for a long time a "pivot" representation that works quite well for textual information interchange, even if it's not the most efficient encoding for some *particular* usages on *some* systems.

    I'm pleased to see that Unicode is not a fully closed standard (with some exceptions like "compatible" decompositions) and it can adapt remarkably well to many scripts, and almost all languages and usages as this "pivot" representation. So all other algorithms defined only in "legacy" or "custom" encodings can now be mapped to an equivalent algorithm based on Unicode codepoints.

    There are still some characters that Unicode does not define well, for example encodings or characters whose usage is restricted by copyright (for example the "Apple logo", used in various Mac encodings). I don't know if some of these widely used characters should not have a Unicode mapping, published as a non-mandatory specification but an informative recommendation, and possibly allocated in a "semi-private" plane, that the standard would explicit reserve for such "proprietary" usage on which there's a possible consensus. Unicode would not publish the logos themselves, but could publish only a descriptive list of these allocations, and pointers to vendors that have requested such allocation. This could be used to map the "Apple" logo character, or the Windows logo found on keyboards. Allocating these characters would better use a specific block in a separate plane, that applications can flag as proprietary and possibly not interchangeable.

    Or there could be another registry managed by some ISO bureau to request such allocations and map them in a separate specification, out of ISO10646 and Unicode, but with a Unicode defined algorithm to map them to a preallocated plane. Such registration would require payment by these vendors for this separate registry of reserved logos. This process could be part of an international agreement on the usage of trademarks and logos. Once these trademarks or logos stop existing with reserved rights, and become public, then they could be standardized by ISO10646 into a "standard" plane (probably plane 1, or plane 0 if this logo is largely used now as a recognized general usage character such as dingbats).

    There are plenty of such logo/symbol characters used in various publications, notably when describing road maps, technical drawings, security symbols, mandatory symbols (required by national laws or multinational conventions) in various domains such as road and water signalization. At some time, if these logos/symbols are royaltee-free, they may need such allocations after consulting the specialists of these convention or national laws, or if publishers relax the usage of these symbols to facilitate data interchange.

    Another script missing is the "stenographic" script widely learned in the 70's and still used todays by secretaries in all places where a computer is not usable or available or impractical (notably during "brainstorming" meetings where someone requests that some previous sentences be searched and repeated). Defining a Unicode mapping for stenographic systems (there certainly exists many variants for distinct languages) would facilitate the publication of educational books related to this notational/phonetic system.



    This archive was generated by hypermail 2.1.5 : Thu May 22 2003 - 12:07:11 EDT