L2/03-389

Comments on Public Review Issues

The sections below contain comments received on the open Public Review Issues as of October 27, 2003

9 Bengali Reph and Ya-Phalaa

Only document L2/03-388 was received during this period.

13 Unicode 4.0.1 Beta

Two long reports of errors and suggestions for Unihan.txt were received and passed directly on to John Jenkins for evaluation and correction as applicable. On the reporting form nothing else was received.

One of the Unihan documents is here, containing a bunch of new references to kCowles.

The other is here, containing some HKSCS references.

15 Changing General Category of Braille Patterns to "Letter Other"

** See also document L2/03-335 from Jack Maartman **

Date/Time: Mon Oct 6 18:30:42 EDT 2003
Contact: markus.scherer -at- us.ibm.com

Changing Braille to Lo appears to be in conflict with what PDUTR #31 says about UTC commitments on identifiers. Quote from PDUTR #31:

"(In particular, the consortium committed to not allocating characters suitable for identifiers in the range 2190..2BFF, which is being used by XML 1.1.)"

The current Candidate Recommendation for XML 1.1 excludes this range from NameStartChar, see http://www.w3.org/TR/xml11/#sec2.3

I suppose that Braille being Lo would simply mean that Braille would still not be allowed in XML 1.1 names - i.e., XML 1.1 should be ok because it defines name chars excluding the range around Braille instead of defining syntax chars including this range. However, PDUTR #31 suggests that there was a stronger commitment between UTC and W3C.

Best regards, markus

Date/Time: Tue Oct 7 00:39:08 EDT 2003
Contact: asmus -at- unicode.org

Re Issue # 15 Braille

I think, given the fact that braille codes can stand for punctuation, as well as letters and digits in an unpredictable way (unpredictable unless one has the language and domain based mapping information) makes assignment "Lo" for Braille very problematic.

When Braille appears inside regular text, it's not a letter of that text, but a symbol. In an all-Braille data stream, one must use parsing rules based on the particular mapping in effect, in other words, apply a higher level protocol anyway.

From: Marco Cimarosti Date: 2003-10-07 06:02:39 -0700
Subject: Braille is not bidi neutral!

I (Marco Cimarosti) wrote:

> Jony Rosenne wrote:
> > I don't remember whether Hebrew Braille is written RTL or LTR.
> > Braille is always LTR, even for Hebrew and Arabic.

Hwæt! I noticed only now that the Bidirectional Category of braille characters is "ON - Other neutrals"!

AFAIK, that is completely broken!

A run of braille text *must* remain LTR in a bidi context, because braille is never RTL. This "ON" bidi category makes it impossible to correctly encode, e.g., a manual of Braille in Hebrew or Arabic, because the braille runs would get swapped by the Bidirectional Algorithm.

_ Marco

Date/Time: Tue Oct 7 13:56:43 EDT 2003
Contact: asmus -at- unicode.org

At 10:32 AM 10/7/03 +0530, jon -at- spin.ie wrote: The only justification mentioned so far for changing Braille from So to Lo is to be able to use Braille in identifiers. I'm not sure why someone whould want to use Braille in this way, for a start how would these identifiers be translated into Braille?

Braille identifiers only make sense when the whole source file has been translated to Braille. However, the parsing semantics applied to it should then be determined by the properties of the original characters (before applying the Braille mapping). If one does want to work directly with a Braill transcoded stream, then such systm must support *dynamic* property assignments. That's something that's outside the scope of the Unicode Standard.

In conclusion, it seems that the correct set of *default* properties for Braille would be determined by the needs of inserting Braille strings into other text (for educational manuals and similar specifications).

As Marco has pointed out that means BIDI = L and I believe it also means GC=So, and other properties assigned as they are for other characters that share BIDI=L and GC=So.

Date/Time: Wed Oct 8 07:59:42 EDT 2003
Contact: jcowan -at- reutershealth.com

On Issue #15, I recommend leaving the general category alone, but changing the bidi type to L, since Braille is strongly LTR.

Date/Time: Wed Oct 15 13:48:51 EDT 2003
Contact: fyergeau -at- alis.com

Because of its explicitly pointing to XML 1.1, this issue has been examined by both the i18n WG and the XML Core WG of the W3C.

The i18n WG did not think this issue is bound by the promise to not allocate non-symbols in a block in the U+2xxx range.

The XML Core WG (owner of XML 1.1) will not change the identifier syntax of XML 1.1, should the Braille change be effected. The WG notes that the Braille characters would then be the only thing that Unicode calls letters that XML 1.1 wouldn't allow in identifiers. Consequently, the WG would prefer the change not to occur.

Regards,

François Yergeau
for the i18n and XML Core WGs

Date/Time: Fri Oct 17 11:46:32 EDT 2003
Contact: easjolly -at- ix.netcom.com

Issue 15 Changing General Category of Braille Patterns to "Letter Other"

It is my opinion that the Braille Patterns are symbols.

In the first place, the names of the various Patterns describe their appearance and do not imply any particular use.

Moreover, there are arbitrarily many systems for mapping the Braille Patterns, more commonly called braille cells, to print. Just within English Braille American Edition--the most widely-used system in the US--we find that some cells are mapped to letters, some to other kinds of characters such as punctuation marks, and some to strings, e.g. "the" or "ch", that don't correspond to any single Unicode character.

There are two other unusual aspects of these systems. First, some cells are mapped to markup indicators that don't correspond to any Unicode character (or even string of Unicode characters) at all. Second, the same braille cell often has several unrelated, context-dependent uses in the same system such as representing a string of letters in one context and a punctuation mark in another.

Another thing to be aware of is that there is often switching from one system to another within a single braille document--say from a system for literary text to one for mathematics or music.

At the moment there don't seem to be any official systems which utilize all 256 Unicode Braille Patterns. Most common are the systems for the six-dot braille subset which is comprised of 64 cells; eight-dot braille as currently used only adds mappings for 30 or so of the additional Patterns.

FYI, existing output devices for raised braille--embossers and braille displays--don't support Unicode but, rather, a mapping from ASCII codes to braille cells. So the first likely use of Unicode for Braille Patterns, as others have pointed out, is for embedding examples of braille in texts written in one of the natural languages. Even this use is uncommon at the moment because most braille fonts are based on the so-called North American ASCII Braille "codepage".

Susan Jolly www.dotlessbraille.org

Date/Time: Thu Oct 23 07:26:46 EDT 2003
Contact: kentk -at- cs.chalmers.se

PRI 15, Braille

Since it appears that the suggestion to change their GC to Lo derives from my suggestion to let them have a default collation with weights at level 1:

I still stand by that suggestion, but that does no imply that I support the change of GC for Braille characters. Indeed, I do not support the latter.

16 Update to UAX #29 Text Boundaries

Date/Time: Fri Oct 24 06:08:06 EDT 2003
Contact: jshin -at- mailaps.org

Re : UAX #29 : Text boundaries

Section 3 has the following:

-------------quote--------------
As far as a user is concerned, the underlying representation of text is not important, but it is important that an editing interface present a uniform implementation of what the user thinks of as characters. Grapheme clusters commonly commonly behave as units in terms of mouse selection, arrow key movement, backspacing, and so on. When this is done, for example, and an accented character is represented by a combining character sequence, then using the right arrow key would skip from the start of the base character to the end of the last combining character.

However, in some cases editing a grapheme cluster element by element may be the preferred way. For example, a system might have backspace delete by code point, while the delete key may delete an entire cluster. Moreover, there is not a one-to-one relationship between grapheme clusters and keys on a keyboard. A single key on a keyboard may correspond to: a whole grapheme cluster, a part of a grapheme clusters, or a sequence of more than one grapheme clusters.
---------quote------

For some scripts, depending on the context, some input methods would overload a single key (e.g. back space delete) instead of using two key bindings. For instance, in most Korean input methods, backspace during the syllable formation deletes a Jamo (a part of the grapheme cluster being formed) while once a syllable is committed, backspace deletes works per syllable. In addition, most Korean input method editors (on Win32, Unix and Mac OS) have an option to configure the behavior of arrow keys and delete/back space keys. In a recent thread on Unicode mailing list, a few people wrote that a similar behavior is expected of 'input method editors' for Biblical Hebrew (or vowelized Hebrew) and I believe the same is true of Indic scripts.

Besides, in some applications like language analyzers (for search engine, natural language processing, etc), the grapheme cluster shouldn't be considered as atomic. For example, a Korean lexical analyzer has to decompose even Jamos into simpler Jamos (that is, NFD is not sufficient).

In summary, I think a bit more emphasis has to be made on the fact that there are various appication needs and user expectation when it comes to 'define' what end-users regard as grapheme clusters.

Date/Time: Mon Oct 27 16:32:18 EST 2003
Contact: verdy_p -at- wanadoo.fr

The proposed update to UAX#29 contains this text:

Apostrophe is another tricky case. Usually considered part of one word ("can 't", "aujourd'hui") it may also be considered two ("l'objectif"). Also, one cannot easily distinguish the cases where it is used as a quotation mark from those where it is used as an apostrophe, so one should not include leading or trailing apostrophes. In some languages, such as French and Italian, tailoring it to break words when the character after the apostrophe is a vowel may yield better results in more cases. This can be done by adding a rule 5a: Break between hyphens and vowels (French, Italian) hyphens ÷ vowels (5a)

However in French the situation is a bit more complex, as there's the case of a leading h which may or may not be "aspiré" (never pronounced and admitting a vocal link with the previous consonnant). When it is not, the article can/must be elided with an apostrophe. These examples all contain word breaks after the apostrophe:

l'habit; d'habit
singular "un habit" (the first "s" is pronounced [z])
plural "les habits" (the first "s" is pronounced [z])
m'habiller; t'habiller; s'habiller
l'helvète; d'helvète
singlular "un helvète" (the first "n" is pronounced [n])
plural "les helvètes" (the first "s" is pronounced [z])
l'heur; d'heur
singlular "un heur" (the first "n" is pronounced [n])
plural "les heurs" (the first "s" is pronounced [z])
l'hiatus;
singlular "un heur" (the first "n" is pronounced [n])
plural "les heurs" (the first "s" is pronounced [z])
l'hier; d'hier;
l'honneur; d'honneur;

etc...

This does not affect the cases where the leading h is not pronounced (but in that case there's no elision of the previous article (or pronoun if it's a verb)

So in French we also have the additional word break rule:

hyphens ÷ LatinLetterH

This case is not documented...

17 UTS #18 Unicode Regular Expressions

On the reporting form nothing was received.

18 Draft UTR #23 The Unicode Character Property Model

On the reporting form nothing was received.

19 Proposed Draft UTR #30 Character Foldings

** See also document L2/03-341 from Rick McGowan **

Date/Time: Sun Oct 12 11:48:59 EDT 2003
Contact: locales -at- geez.org

These are comments for TR 30:

As per the scope and introduction sections which describe the utility of folding for "loose" and "fuzzy" searches, basic syllabic folding should be supported. By "syllabary" I refer to an "open alphasyllabary" such as Cherokee, Canadian Aboriginal Syllabics, Ethiopic, etc, where a single symbol represents a CV (consonant + vowel) pattern.

Very briefly, the fold is conceptually against the vowel component. Vowels often appear in the wrong form in electronic text due largely to: typographical errors (from myriad of incompatible input methods), different spelling conventions, and grammatical word inflections.

Basic syllabic folding converts the target syllables in a string into a reference form. While a reference form is generally considered to be the first form (the graphical base) in a syllabry's matrix for the syllabic series in the consent "C", a vowel-less form has been found to be more practical in folding. For Ethiopic this form is consistently in the 6th position and may move around in other syllabaries.

This note is intended to be enough just enough detail to convey the concept. If you are open to including basic syllabic olding I can certainly go on at length about it and provide tables, papers, software, etc.

20 Proposed Draft UTR #31 Identifier and Pattern Syntax

Date/Time: Sun Oct 12 15:54:10 EDT 2003
Contact: locales -at- geez.org

Concerning UTR 31: Section 4.0 / 4.1

I'm a little confused about context here. U+1361 should be matched by the regex metacharacter \s as a white space element. But U+1361 appearing in a pattern, as part of the pattern, should only match U+1361 and not other white space characters. Hence in a pattern context, U+1361 should be a Pattern_White_Space.

21 Changing U+200B Zero Width Space from Zs to Cf

Date/Time: Mon Oct 6 18:36:39 EDT 2003
Contact: markus.scherer -at- us.ibm.com

I personally welcome this proposed change. In working with the ECMAScript working group and reviewing Perl and other API's behavior for "white space", I have seen several times that "white space" is usually based on Z or Zs characters, sometimes subtracting what the author saw as inappropriate. Although there is a real White_Space property now, I believe that it would serve to reduce confusion among implementers to make this change. (Even after pointing out the White_Space property to the ECMAScript committee, I still had to explain why it is not a superset of Z, and had to use the difference for motivation to adopt White_Space.)

Best regards, markus

22 Collation Mechanism for Syllabic Scripts

Date/Time: Sun Oct 12 12:50:49 EDT 2003
Contact: locales -at- geez.org

Concerning PR #22

The focus seems to be on Hangul which, if I'm not mistaken, is not a syllabary in the same sense as other syllabaries. Hangul syllables use vowels as seperate code points as I understand it.

So, some clarification as to what type of syllabary the collation rules apply to could use some clarification. Also, "long" and "short" syllables are referred to but not defined, this also should be clarified.

23 Terminal Punctuation Characters

Date/Time: Sun Oct 12 15:44:48 EDT 2003
Contact: locales -at- geez.org

Concerning PR #23:

I was surprised to see the SYRIAN END OF PARAGRAPH and ETHIOPIC PARAGRAPH SEPARATOR as Sentence_Terminal characters. Is it implied that these characters will terminate a sentence in lieu of a FULL STOP? I will review if this occurs in Ethiopic.

U+1361 ETHIOPIC WORDSPACE as a Terminal_Character. To what degree "Terminal"? The character will terminate a word in Ethiopia but not a subsection of a sentence. In Eritrea the modern practice is to use U+1361 in place of comma. Hence a locale sensitivity, but the Eritrean use should be seen as the exception and not the rule.

Date/Time: Thu Oct 23 17:13:29 EDT 2003
Contact: cowan -at- ccil.org

#23 Terminal Punctuation Characters lists U+037E as terminal but not sentence-terminal. Even though it is canonically equivalent to semicolon, it is *functionally* a sentence terminator. I think this should be changed.

 

[end of document]