Re: Proposal to change the script allocation rules for the BMP and SMP

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 29 2008 - 13:42:46 CST

  • Next message: Karl Pentzlin: "Re: Proposal to change the script allocation rules for the BMP and SMP"

    Karl Pentzlin asked:

    > I consider to sumbit the following proposal.
    > Any opinions?

    Yes. I don't think it is necessary.

    > Proposal to change the script allocation rules for the BMP and SMP
    >
    > It is proposed to close the BMP for new scripts which are not
    > included in PDAM7 as of Oct. 2008 or earlier ISO/IEC 10646 amendments.
    > The SMP should explicitly be devised for "contemporary and historical scripts"
    > in a way that shows no discrimination for scripts in current use which
    > are assigned within the SMP.

    While it may be the perception that there are rules of this sort,
    no normative material in either the Unicode Standard nor ISO/IEC 10646
    requires the particular placement of newly encoded scripts in
    either the BMP or the SMP, based on any determination of
    contemporary versus historic usage.

    What the Unicode Standard and 10646 actually say is simply a
    descriptive reflection of what the cumulative committee decisions
    regarding encoding have resulted in over the last two decades.
    From 10646:

    "The Basic Multilingual Plane includes characters in general
    use in alphabetic, syllabic, and ideographic scripts
    together with various symbols and digits."

    "... the SMP is not used to date for encoding CJK Ideographs.
    Instead, the SMP is used for encoding graphic characters used in
    other scripts of the world that are not encoded in the BMP.
    Most, but not all, of the scripts encoded to date in the SMP are
    not in use as living scripts by modern user communities."

    I expect that such descriptions may need minor updates in the
    future as more characters are encoded in the SMP -- but it
    is already the case that some contemporary use scripts are
    encoded in the SMP, and certainly everyone knows that some
    historic scripts are already encoded in the BMP -- as well
    as many historic and obsolete characters from contemporary
    use scripts.

    What may be more pertinent are the WG2 Principles and Procedures.
    That document:

    http://std.dkuug.dk/JTC1/SC2/WG2/docs/n3452.pdf

    does have a specific set of guidelines regarding what should
    be on the BMP or the SMP, but those guidelines do not, in fact,
    tie the committee's hands regarding issues such as the placement
    of Varang Kshiti versus Miao, for example. From the P&P document's
    section on "Goals for encoding new characters into the BMP":

    "Generally, the Basic Multilingual Plane (BMP) should be devoted to
    high-utility characters that are widely implemented in information
    technology and communication systems. ... Characters of more
    limited use should be considered for encoding in supplementary
    planes, for example, obscure archaic characters."

    Essentially that goal was already met long ago, as the
    high-utility characters that are widely implemented in IT were
    encoded in the BMP already in early versions of the standard.
    Everything we are talking about at this point consists of
    characters that effectively have *no* implementation yet
    anywhere in IT systems.

    The P&P document goes on to spell out specific criteria for
    encoding on a supplementary plane:

    "a) If the proposed character is used infrequently, or
     b) If it is part of a set of characters for which insufficient
        space is available in the Basic Multilingual Plane, or
     c) If the proposed character is part of a small number of
        characters to be added to a script already encoded in
        one of the supplementary planes..."
        
    Since the BMP is almost full now, as you noted, criterion b) is
    going to be cited more and more often by the committees, and
    is already fully sufficient to deal with cases like Miao and
    Hungarian Rovas, in my opinion.

    As for the additional point you make about the potential for
    future allocations of Latin characters and Hiragana characters
    requiring more space on the BMP, I also have a few comments.

    First, re Hiragana. The issue here is for hentaigana, the
    early and highly variable forms of historic kana, before
    Japan standardized on the specific lists of Hiragana (and
    Katakana) which have long been in Unicode and 10646. The relatively
    common-use historic kana (both for Hiragana and Katakana) are
    also already in the standard -- those are used, for example
    in editions of Japanese classic literature. For hentaigana
    there is no consensus how to proceed yet -- but in any case
    it is not clear that the script identity of hentaigana with
    modern Hiragana would be a given, nor would there be any
    obvious requirement that such an additional set, if sorted
    out and proposed for encoding, would be required to reside
    on the BMP.

    For Latin as well, everything obvious was encoded long ago.
    The latest Latin additions dug deeply into the medievalist
    Latin tradition to collect more obscure historic letters.
    And a "quick look" at a site like languagegeek.com, for
    example, does *not* lead to the impression that there are
    hundreds of Latin letters in use that aren't already
    representable in the standard. But even *if* such candidates
    turn up -- and I do know of a few examples of further
    proposals coming in for the committees to consider -- there
    is nothing in the standard that normatively requires that
    such extensions to the Latin script be contained solely
    within the BMP.

    What some people fail to note when claiming that scripts
    need to be kept together on a plane in 10646 is that one
    of the highest-use contemporary scripts long ago broke that
    mold and is spread across two planes: Han. If the
    scattering of Latin blocks (there are already 7 of them
    in the standards) eventually requires encoding another
    one on the SMP, c'est le vie at this point.

    These issues of plane allocation and the privileged status
    of the BMP had more relevance back ca. 2000 when people
    were struggling to update to Unicode 3.0, which still had
    everything assigned in the BMP, and when BMP-only
    implementations of Unicode were still quite common.
    I really don't see them as being of much technical relevance
    in 2008.

    And if anything, "closing the BMP" and reserving it for
    additional allocations of truly obscure and mostly useless
    Latin additions might at this point have an ironically contrary
    impact of further entrenching mistaken perceptions that the
    BMP is a realm of European privilege, with the gates closed
    against any further intrusions by lesser-known scripts.

    --Ken



    This archive was generated by hypermail 2.1.5 : Wed Oct 29 2008 - 13:46:45 CST