Re: Proposal to change the script allocation rules for the BMP and SMP

From: Kenneth Whistler (kenw@sybase.com)
Date: Wed Oct 29 2008 - 13:42:46 CST

Next message: Karl Pentzlin: "Re: Proposal to change the script allocation rules for the BMP and SMP"

Previous message: Karl Pentzlin: "Proposal to change the script allocation rules for the BMP and SMP"
Maybe in reply to: Karl Pentzlin: "Proposal to change the script allocation rules for the BMP and SMP"
Next in thread: Karl Pentzlin: "Re: Proposal to change the script allocation rules for the BMP and SMP"
Reply: Karl Pentzlin: "Re: Proposal to change the script allocation rules for the BMP and SMP"
Reply: Karl Pentzlin: "Re: Proposal to change the script allocation rules for the BMP and SMP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Karl Pentzlin asked:

> I consider to sumbit the following proposal.
> Any opinions?

Yes. I don't think it is necessary.

> Proposal to change the script allocation rules for the BMP and SMP
>
> It is proposed to close the BMP for new scripts which are not
> included in PDAM7 as of Oct. 2008 or earlier ISO/IEC 10646 amendments.
> The SMP should explicitly be devised for "contemporary and historical scripts"
> in a way that shows no discrimination for scripts in current use which
> are assigned within the SMP.

While it may be the perception that there are rules of this sort,
no normative material in either the Unicode Standard nor ISO/IEC 10646
requires the particular placement of newly encoded scripts in
either the BMP or the SMP, based on any determination of
contemporary versus historic usage.

What the Unicode Standard and 10646 actually say is simply a
descriptive reflection of what the cumulative committee decisions
regarding encoding have resulted in over the last two decades.
From 10646:

"The Basic Multilingual Plane includes characters in general
use in alphabetic, syllabic, and ideographic scripts
together with various symbols and digits."

"... the SMP is not used to date for encoding CJK Ideographs.
Instead, the SMP is used for encoding graphic characters used in
other scripts of the world that are not encoded in the BMP.
Most, but not all, of the scripts encoded to date in the SMP are
not in use as living scripts by modern user communities."

I expect that such descriptions may need minor updates in the
future as more characters are encoded in the SMP -- but it
is already the case that some contemporary use scripts are
encoded in the SMP, and certainly everyone knows that some
historic scripts are already encoded in the BMP -- as well
as many historic and obsolete characters from contemporary
use scripts.

What may be more pertinent are the WG2 Principles and Procedures.
That document:

http://std.dkuug.dk/JTC1/SC2/WG2/docs/n3452.pdf

does have a specific set of guidelines regarding what should
be on the BMP or the SMP, but those guidelines do not, in fact,
tie the committee's hands regarding issues such as the placement
of Varang Kshiti versus Miao, for example. From the P&P document's
section on "Goals for encoding new characters into the BMP":

"Generally, the Basic Multilingual Plane (BMP) should be devoted to
high-utility characters that are widely implemented in information
technology and communication systems. ... Characters of more
limited use should be considered for encoding in supplementary
planes, for example, obscure archaic characters."

Essentially that goal was already met long ago, as the
high-utility characters that are widely implemented in IT were
encoded in the BMP already in early versions of the standard.
Everything we are talking about at this point consists of
characters that effectively have *no* implementation yet
anywhere in IT systems.

The P&P document goes on to spell out specific criteria for
encoding on a supplementary plane:

"a) If the proposed character is used infrequently, or
b) If it is part of a set of characters for which insufficient
    space is available in the Basic Multilingual Plane, or
c) If the proposed character is part of a small number of
    characters to be added to a script already encoded in
    one of the supplementary planes..."

Since the BMP is almost full now, as you noted, criterion b) is
going to be cited more and more often by the committees, and
is already fully sufficient to deal with cases like Miao and
Hungarian Rovas, in my opinion.

As for the additional point you make about the potential for
future allocations of Latin characters and Hiragana characters
requiring more space on the BMP, I also have a few comments.

First, re Hiragana. The issue here is for hentaigana, the
early and highly variable forms of historic kana, before
Japan standardized on the specific lists of Hiragana (and
Katakana) which have long been in Unicode and 10646. The relatively
common-use historic kana (both for Hiragana and Katakana) are
also already in the standard -- those are used, for example
in editions of Japanese classic literature. For hentaigana
there is no consensus how to proceed yet -- but in any case
it is not clear that the script identity of hentaigana with
modern Hiragana would be a given, nor would there be any
obvious requirement that such an additional set, if sorted
out and proposed for encoding, would be required to reside
on the BMP.

For Latin as well, everything obvious was encoded long ago.
The latest Latin additions dug deeply into the medievalist
Latin tradition to collect more obscure historic letters.
And a "quick look" at a site like languagegeek.com, for
example, does *not* lead to the impression that there are
hundreds of Latin letters in use that aren't already
representable in the standard. But even *if* such candidates
turn up -- and I do know of a few examples of further
proposals coming in for the committees to consider -- there
is nothing in the standard that normatively requires that
such extensions to the Latin script be contained solely
within the BMP.

What some people fail to note when claiming that scripts
need to be kept together on a plane in 10646 is that one
of the highest-use contemporary scripts long ago broke that
mold and is spread across two planes: Han. If the
scattering of Latin blocks (there are already 7 of them
in the standards) eventually requires encoding another
one on the SMP, c'est le vie at this point.

These issues of plane allocation and the privileged status
of the BMP had more relevance back ca. 2000 when people
were struggling to update to Unicode 3.0, which still had
everything assigned in the BMP, and when BMP-only
implementations of Unicode were still quite common.
I really don't see them as being of much technical relevance
in 2008.

And if anything, "closing the BMP" and reserving it for
additional allocations of truly obscure and mostly useless
Latin additions might at this point have an ironically contrary
impact of further entrenching mistaken perceptions that the
BMP is a realm of European privilege, with the gates closed
against any further intrusions by lesser-known scripts.

--Ken

Next message: Karl Pentzlin: "Re: Proposal to change the script allocation rules for the BMP and SMP"
Previous message: Karl Pentzlin: "Proposal to change the script allocation rules for the BMP and SMP"
Maybe in reply to: Karl Pentzlin: "Proposal to change the script allocation rules for the BMP and SMP"
Next in thread: Karl Pentzlin: "Re: Proposal to change the script allocation rules for the BMP and SMP"
Reply: Karl Pentzlin: "Re: Proposal to change the script allocation rules for the BMP and SMP"
Reply: Karl Pentzlin: "Re: Proposal to change the script allocation rules for the BMP and SMP"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Oct 29 2008 - 13:46:45 CST