Re: Portuguese (Brazil) and Portuguese (Portugal)

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Fri Aug 12 2005 - 14:56:06 CDT

  • Next message: Eric Muller: "Re: Portuguese (Brazil) and Portuguese (Portugal)"

    From: "Michael (michka) Kaplan" <michka@trigeminal.com>
    > Markets and software products and keyboards and fonts work to define
    > characters to use. Unicode does not really do this.

    If you mean the joint Unicode/ISO 10646 standard here, you're right: there's
    only one encoding.

    However the Unicode Consortium hosts the CLDR registry which tries to define
    such minimal subsets for supporting each language. This registry is not a
    standard for now, but a joint effort to harmonize locale data across
    systems/platforms, something needed to build such portable keyboards, fonts,
    applications and so on. So the CLDR project will help increase
    interoperability of systems designed to support well-categorized families of
    languages.

    What is important is not to mix the various weak definitions of the
    "charset" term. In legacy applications, the term is an abbreviation that
    refers both to the three-way association of
    - a set of abstract characters (preferably mapped one-to-one into the
    Unicode/ISO 10646 standard repertoire, but this is not an obligation
    observed by many legacy or application-specific new charsets),
    - with an binary encoding to represent them with code positions,
    - and with a serialization scheme to build and interpret encoded streams of
    bytes as code positions.

    In Unicode/ISO 10646 the code positions are preferably called "code points",
    because Unicode/ISO10646 is now used as the internal codification to map
    almost all other charsets (so when studying these charsets, we need two
    terms to make the distinction between their intrinsic "code positions", and
    the represented Unicode "code points" to which they are mapped).

    "charset" must not be confused with "character set" which refers only to a
    set of abstract characters (this set is called a "repertoire"),
    independantly of its encoding, and independantly of the fact that this
    repertoire *may* contain abstract characters absent from the standard
    Unicode/ISO 10646 repertoire.)

    The Unicode/ISO 10646 repertoire has the vokation of containing almost all
    other repertoires, provided that these repertoires refer to abstract
    characters that are not specific to a private application (for example the
    legacy MacOS Roman repertoire contain an abstract character which represents
    the Apple logo, a abstract character which is absent from the Unicode/ISO
    10646).

    For these last "missing" characters, the Unicode/ISO 10646 offers ways to
    "map" them to codepoints, using a private agreement (which can be formulated
    by a character mapping table) and mapping these characters to special
    characters in the "Private Use Area" where Unicode/ISO 10646 has normally
    defined no semantics, and where no standard abstract character will ever be
    encoded. This way, the ISO/10646 offers effectively a way to map all other
    legacy repertoires, including those that contain abstract characters absent
    from the standard ISO 10646 repertoire.

    The use of non standard characters is not recommanded in applications built
    and tested to work with the ISO 10646 repertoire only. But under this
    limitation, the legacy charsets that contain these characters can be used
    and interchanged safely (for example it's safe to interchange text data
    encoded with MacOS Roman, provided that it does not contain the Apple logo
    character).



    This archive was generated by hypermail 2.1.5 : Fri Aug 12 2005 - 15:00:40 CDT