conflicting property value aliases for scripts (Qaac, Qaai)

From: verdy_p (verdy_p@wanadoo.fr)
Date: Fri Jan 23 2009 - 19:36:09 CST

Next message: Doug Ewell: "Re: Groes Eszett"

Previous message: Tom: "Re: Gro�es Eszett"
In reply to: Michael Everson: "Re: Gro�es Eszett"
Next in thread: Dominikus Scherkl: "Re: Großes Eszett"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

I'm not writing about ISO 15924 (which is fine as it is), but about the Unicode's "PropertyValueAliases.txt" file
which continues to suggest non-conforming preferred values and accepts some aliases that are not meant to represent
those scripts.

There are two entries that make problem:

sc ; Copt ; Coptic ; Qaac
sc ; Qaai ; Inherited

The first one still accepts "Qaac" (as a secondary alias, though) despite it is not the prefered form. I can't see
any place in the rest of the UCD where this code (meant for private used only) is used, so why is it kept there ? I
see absolutely no value in keeping this alias (which may have been used in preliminary encodings where ISO 15924
did not even exist, so that "Qaac" could not even be referenced for Coptic when the preferred value is "Copt" from
the ISO 15924 standard. (only "Coptic" could exist from past Unicode versions)

The second one is problematic, because TUS says that this is the "preferred" form. However, I can't see the
rationale that conducts us to prefer a non-interoperable private-use code here, even if "Inherited" script
properties are needed for some Unicode algorithms. My opinion is that this "Qaai" code should really be removed
there as well, and replaced by a stable (interoperable) code, like "Ziii" to be allocated in ISO 15924 (in a way
similar to the line "sc ; Zyyy ; Common". If no new ISO 15924 code is allocated, I think that "Inherited" should
still be the preferred value, and "Qaai" should be removed from this list of aliases.

Note that the current use in UCD conflicts with ISO 15924. For me "Coptic" is not the same as "Private-Use", and
"Inherited" is certainly not "Private-Use" but an effective script.

I've just seen a bug caused by these two undesirable aliases that had the effect of rejecting valid text because it
used private scripts.... To fix it, I had to drop these aliases from the implementation (this has no major effect
on Coptic, but it meant several changes in various locations to use "Inherited" instead of "Qaai", or to make sure
that it does not get propagated in the rest of the application. They also produced rendering caveats elsewhere :
* For Coptic it caused the text to be rendered using incorrect fallbacks, despite there was a matching font for it.
* For Inherited, this had a similar effect with some diacritics being rendered separately from the base character.

In no action is taken in the UCD to remove these aliases in PropertyValueAliases.txt, then I suggest that a note
should be added in the Annex of TUS describing the UCD, to warn users about these values:
* "Qaac" should be really strongly NOT recommanded and really DEPRECATED/OBSOLETED, and it should suggest that the
preliminary applications still using it be updated to remove it. For new applications, this alias should be
completely ignored (this includes applications like the Unicode Regular Expressions).
* "Qaai" should not be used in the rest of the database, but if it is, applications should make sure that this code
will not be output from interfaces that are querying character properties (these interfaces should return
"Inherited" instead, or any other convenient numeric mapping that applications may be using internally or in their
interface, such as through enumeration datatypes (in such enumerations, the identifier used should not be Qaai as
it will still be needed for application users for their own private properties. This identifier should also
disappear from public interfaces of libraries that are distributed with development tools. "Qaai" should then be
completely "blackboxed" within the library implementing Unicode using Unicode datafiles as their source (but my
opinion is that their internal database should better remove it as well).

I also don't see the rationale that would forbid a specific formal allocation "Ziii" in ISO 15924 for "Inherited",
given that it accepts other values not used in Unicode (they represent differences that have been unified in
Unicode into single scripts, or represent multiple scripts; "Latf" and "Latg" are good examples, but there are
other examples for scripts that have been rejected from encoding in Unicode like Tengwar) : the ISO 15924 standard
is more tolerant than Unicode and ISO 10646, because it encodes scripts independantly of the characters used to
encode them, and the technical need for a standard code in ISO 15924 for use in Unicode's Inherited property value
aliases is enough for me to justify this addition; note that "Common" makes no sense in ISO 15924 in the domain of
bibliographic applications, and is precisely present for technical reasons (the ISO 15924 is not meant to be
reserved only to librarians).

--- unrelated topic: defining scopes in ISO 15924 ---

Also, I think that ISO 15924 should contain an additional "scope" field, similar to ISO 639-3:
* "A" for single generic alphabetic scripts (including all alphabets, abjads, abugidas, ideo-phonographic and
ideographic scripts) : this "scope" should be set in most ISO 15924 codes.
* "V" for single script with variant forms ("Hans", "Hant", "Latf", "Latg", "Syre", "Syrj", "Syrn") used in some
languages with specific orthographic conventions that are not applicable to all texts and all languages using the
generic alphabetic script. Such "multiple" scripts exist only because part of their repertoire are shared and
partly unified. My opinion is that they are not really scripts, but are representing orthographic conventions added
on top of the tuple langue+script and that, in the context of BCP 47 locales, should be remapped as variant subtags
for these orthographic conventions, however this still represents a challenge for existing renderers and librarians
standards, that still don't fully and correctly implement BCP 47, so these "variant" scripts are just legacy script
codes used for technical reasons). (they are working in a way quite similar to ISO 639-5 language families and ISO
639-3 languages with "collection" scope)
* "M" for codes representing multiple scripts that can be used simultaneously in the same text with the same
language and orthographic convention ("Jpan", "Kore"). It is expected that more codes in this scope will be needed
for bibliographic classification or localization purpose (unless the scripts that are borrowing some other script
subsets are extended to include their own local copy of the borrowed characters, like it has been done in Latin for
characters borrowed from Greek or Cyrillic and reencoded as "Latin"). These codes are not very useful for renderers
except as legacy technical codes, but may still be needed by librarians for classification purpose (they are
working in a way quite similar to ISO 639-3 languages with "macrolanguage" scope), and in fact, they should be
replaced by the list of script codes they actually represent, if possible.
* "N" for other notational scripts that require other information than just the encoded text to produce meaningful
content, or that can't be converted simply into normal text for a language without the help of complex conversion
rules and possibly according to user preferences in their locale ("Brai", "Zmth", "Zsym"). These codes can become
significant as "scripts" is they are used along with other codes like a language code (in the subtag of a locale
code), but become mapped to other scripts according to the convention.
* "P" for all private-use codes ("Qaaa".."Qabx") : they are not meant for global interchange but specific to local
implementations within well defined and restricted domains for only some users. They should not be part or
referenced directly or needed by any international standards, and restrictions could be included in those standards
forbidding their use completely in some or all of the defined interoperable interfaces.
* "S" for special codes needed for some technical applications where none of the scopes above can be significant or
when no other codes can be precisely determined (like "Zxxx", "Zyyy", "Zzzz", and... "Ziii": "code for characters
with contextually inherited script", "codet pour caractères à écriture héritée du contexte"). They may be used and
referenced in international technical standards (like Unicode). The ISO 15924 standard should register and exhibit
the standards needing these special codes and justifying their existence, by referencing or linking to the
appropriate documents defining them precisely and defining their meaning and usage policy.

In this context, the "Zxxx" code for unwritten documents (with scope "S") needs a more formal definition, because
its existence cannot be justified by external standards but only by the ISO 15924 standard itself; it may not be
precise enough for effective applications that would need more preceise codes (a code for "aural"/"vocal"
documents, a code for "photographic" documents, a code for drawings and diagrams, a code for artistic "graphic"
documents which have no reading at all but just interpretations, a code for other contents that can't even be
reproduced correctly on printed paper (such as architectural designs, artistic objects, ...).

The absence of any content (textual or not) would also merit its own ISO 15924 code (there's ambiguity in this case
between "Zxxx", "Zyyy", and "Zzzz") to encode that NO script is actually present because there's simply no content
at all; it could be anything if the content is added later and such addition is permitted (in that case this
content will have another code):
* Using "Zxxx" just encodes that if there's content, it cannot be text, but it does not really specifies that other
contents do not exist at all;
* Using "Zyyy" for undetermined scripts is also inappropriate (look at the definition of the "Common" script type
in Unicode, which is aliased to it) because it indicates that there can exist some encoded text;
* Using "Zzzz" (to which "Unknown" is mapped) does not really encode the total absence of text, just the fact that
no text could be encoded correctly (for example the text could contain characters still not encoded in Unicode, and
can be represented only using private-use characters or other means like graphics or fac-similes, those characters
being possible candidates for encoding in a future script with its own code in ISO 15924 and then with its own
characters in Unicode);
* Some applications (or other users than me) may have their own interpretation and may opt to define their own
policy about using one of the three codes above.
* For this case a code like "Zero" would be convenient to encode such emptyness and total absence of content
(textual or not).

Next message: Doug Ewell: "Re: Groes Eszett"
Previous message: Tom: "Re: Gro�es Eszett"
In reply to: Michael Everson: "Re: Gro�es Eszett"
Next in thread: Dominikus Scherkl: "Re: Großes Eszett"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Jan 23 2009 - 19:37:54 CST