US
comments to the ballot of the 3rd FCD 14651 – International string
ordering
October
10, 1999
The US votes NO on the 3rd
FCD 14651, but will gladly change the vote to YES, if the comments below are
accommodated.
p. 1, NOTE 2. This note
references the Unicode Standard Version 2.1, but
the appropriate reference occurs neither in the Normative
References
nor in the Bibliography. We suggest that the appropriate
reference for
the Unicode Standard, Version 2.1, be added to the
Bibliography.
p. 4, definition 4.16. This
definition is incomplete in the text and must
be fixed.
p. 5, NOTE 1. This note
refers to Unicode normalization, but the appropriate
reference occurs neither in the Normative References nor in
the
Bibliography. We suggest that the appropriate reference for
Unicode Technical Report #15, Unicode Normalization, be added
to
the Bibliography, and
a more complete reference be added at this
note.
p. 9, BNF syntax. The
"line_completion" tokens in the production rules
for order_start, order_end, reorder_section_after,
reorder_after,
and reorder_end should be removed. They are redundant with
the
line_completion token in the production rule for
tailoring_line.
p. 14, NOTE. This note
refers to the Unicode collation algorithm, but the
reference occurs neither in the Normative References nor in
the
Bibliography. We
suggest that the appropriate reference for
Unicode Technical Report #10, Unicode Collation Algorithm, be
added to
the Bibliography, and a more complete reference be added at
this
note.
To match cultural
expectations for a correct Thai sort, the
following changes should be
made to the Thai entries in the
Common Template Table.
Incidentally, these changes will put
the Common Template Table in
synch with the principles explained
in Annex B.4
a. The Thai vowel indicator
U+0E47 THAI CHARACTER MAITAIKHU
should be treated exactly
like the Thai tone marks, rather than
being given a primary weight
as for other Thai vowels. This implies
that:
i. collating symbol <D0E47> for THAI CHARACTER MAITAIKHU
be
added just before the collating symbol <D0E46>.
ii. a weight entry for THAI CHARACTER MAITAIKHU be added:
<U0E47> IGNORE;<D0E47>;<MIN>;<U0E47>
just before <U0E46>.
iii. the current weight entry for THAI CHARACTER MAITAIKHU be
removed from the table.
b. U+0E33 THAI CHARACTER
SARA AM and U+0EB3 LAO VOWEL SIGN AM should
be treated as units, rather
than as combinations of the weights for
the NIKHAHIT and the vowel
SARA AA. This implies that:
i. the current weight entry for THAI CHARACTER SARA AM be
changed to
<U0E33>
<SE20>;<BASE>;<MIN>;<U0E33> % THAI CHARACTER SARA AM
ii. the current weight entry for LAO VOWEL SIGN AM be changed to
<U0EB3>
<SE4F>;<BASE>;<MIN>;<U0EB3> % LAO VOWEL SIGN AM
c. The change for MAITAIKHU
impacts the autogenerated primary weight
symbols, so the table should
be regenerated to correct the resulting
sequence of primary weight
symbols.
The third-level weights for
several archaic Greek letters
that have no case pairs in
the Unicode 2.1 repertoire were misassigned
to <MIN> instead of
<CAP>. Those should be corrected. (Note that the
lowercase correspondents of
those letters were added by 10646 amendment
Amendment 30, and will
appear, appropriate weighted in future revisions
to the 14651 Common Template
Table, so the uppercase forms currently in
the table should be
correctly weighted.)
Affected characters are:
<U03DC> GREEK LETTER
DIGAMMA
<U03DA> GREEK LETTER
STIGMA
<U03DE> GREEK LETTER
KOPPA
<U03E0> GREEK LETTER
SAMPI
As for the 4 Greek
characters, one Cyrillic character with no case pair
should have its third-level
weight corrected from <MIN> to <CAP>:
<U04C0> CYRILLIC
LETTER PALOCHKA
The following two lines at
the end of the table:
<U4E00>..<U9FA5>
<S4E00>..<S9FA5>;<BLANK>;<MIN>;<U4E00>..<U9FA5>
% Han
%
<UAC00>..<UD7A3> <SAC00>..<SD7A3>;<BLANK>;<MIN>;<UAC00>..<UD7A3>
% Hangul
have an undefined symbol
<BLANK> in them. That should be corrected to
use the symbol <BASE>,
which is otherwise used in that position in the
table:
<U4E00>..<U9FA5>
<S4E00>..<S9FA5>;<BASE>;<MIN>;<U4E00>..<U9FA5>
% Han
% <UAC00>..<UD7A3>
<SAC00>..<SD7A3>;<BASE>;<MIN>;<UAC00>..<UD7A3>
% Hangul
The U.S. would strongly
object to the inclusion of the B.5 tailorings
for Cyrillic into the Common
Template Table for the following
reasons:
1. To do so would very significantly complicate the
autogeneration
of the Common Template Table, which will be a maintenance and
quality problem for future editions of 14651 that add more
characters.
2. Adding this material to the Common Template Table would
introduce baseform + combining mark weightings into the
CTT, something that is currently not required, but which
would significantly increase the complexity of implementations
of the
table before tailorings. (That would be an additional
implementation penalty to be carried around by all
implementations,
including those which are not primarily concerned with
Cyrillic.)
3. The actual tailorings required for Russian are quite
a bit less than that indicated in Annex B.5. Common
Cyrillic requires only slightly more. Only a full tailoring
for all Cyrillic extensions requires addition of all
the information of Annex B.5.
Our preferred solution for
this issue is to retain B.5 as an annex
describing Cyrillic
tailoring, but to divide it up into three
parts, to show the Russian,
the Common Cyrillic (i.e. Serbian,
Macedonia, Bulgarian,
Byelo-Russian, Ukrainian) tailoring, and
the extended Cyrillic
tailoring. This will make it clear that
the tailoring required for
Russian, for example, is no more
formidable than the Canadian
tailoring of Annex B.1.
The U.S. objects to the
inclusion of this Annex, which is an
attempt to reinject a
dependency between 14651 and PDTR 14652,
from which most of the text
for Annex E derives.
The inappropriateness of the
addition of this material here is
illustrated by the fact that
it includes a number of editorial
and other errors that the
U.S. committee has commented on in
the context of ballot
comments on PDTR 14652. By replicating
that material into an Annex
in 14651, those errors would need
to be corrected once again
in this text, with allowances
for the edited down version
of the text that appears in Annex E.
Furthermore, the suggestions
made in Annex E change the
syntax of at least one
keyword in ways incompatible with
that described in the
normative BNF of Section 6.3 of 14651
(viz. order_start). This
might be appropriate in PDTR 14652, but
is not appropriate in an
informative annex to 14651 itself, since
it is more likely to just
confuse rather than elucidate there.
This problem is not fixed
simply by labelling Annex E
"informative".
Annex E should be removed entirely, with the
focus being on the
correction of its corresponding content in
PDTR 14652, rather than to
try once again to hitch 14652's
wagon to 14651.
If WG20 cannot reach
consensus regarding the removal of
Annex E, the U.S. delegation
will provide a long list of
suggested editorial changes
to make its inclusion less
objectionable in the context
of 14651.
p. iv. 2nd paragraph. result
==> resultant
p. 1, 2nd paragraph.
"two characters strings" ==> "two strings"
p. 4, definition 4.8. remove
extraneous "-" in definition
p. 4, section 5, first
paragraph. "(followed by exact location of
syntax)" is apparently incomplete. This should,
presumably
constitute a reference to Amendment 9, which should then also
be included in the normative references for 14651.
p. 5, 1st paragraph. Remove
extra quotation mark at end of the
paragraph.
p. 7, section 6.2.2.1.
Correct the line break and style for this
section header.
p. 13, NOTE to I6. I1 and I2
should be corrected to I4 and I5,
respectively.
p. 15, NOTE. "too long
comments" ==> "long line lengths"