L2/04-391R
The sections below contain comments received on the open Public Review Issues as of November 12, 2004, since the previous cumulative document was issued prior to UTC #100 (August 2004).
Contact: From: Christopher Fynn
Date/Time: 2004.10.09 18:01
Subject: Tibetan Terminal Punctuation
Ken
I see the following Tibetan characters have been defined as "Terminal Punctuation":
U+0F08 TIBETAN MARK SBRUL SHAD U+0F0D TIBETAN MARK SHAD U+0F0E TIBETAN MARK NYIS SHAD U+0F0F TIBETAN MARK TSHEG SHAD U+0F10 TIBETAN MARK NYIS TSHEG SHAD U+0F11 TIBETAN MARK RIN CHEN SPUNGS SHAD U+0F12 TIBETAN MARK RGYA GRAM SHADI don't quite know what this "Terminal Punctuation" property implies, but Shad characters in Tibetan are "pauses", placed were a person would pause when reading a text. They *do not* indicate the end of a sentence or paragraph. (i.e. they are more like a comma than a full-stop). Shad characters only provide a line break or wrap opportunity according to very specific rules.
U+0F14 should probably be classified with the SHAD characters since it is used in place of U+0F0D...U+0F11 in certain types of text and has the same properties as those characters. It has no other function.
Please also note that
U+0F08 TIBETAN MARK SBRUL SHAD U+0F12 TIBETAN MARK RGYA GRAM SHADSpecifically mark the *beginning* of a new topic or section, not the end of something.
regards
- Chris
Date/Time: Tue Oct 19 19:11:06 CST 2004
Contact: Ken Whistler
New text in UAX #31 stating a specification for caseless identifier matching treats case folding *normatively* as a part of the specification. Conformance to the UAX #31 conformance clauses requires use of the full case folding apparatus.
This calls into question a number of statements in the standard and in the UCD.html documentation regarding the declared status of case mappings and case foldings as *informative*, not *normative*.
I believe that introduction of a normative specification for caseless identifier matching now requires the UTC to go back and revisit its decisions regarding the status of case mappings and case foldings, to formally state their normative status. And the editorial committee needs to be tasked to comb the relevant portions of the text of the standard, to update claims made about case mappings accordingly.
Date/Time: Fri Aug 27 13:14:13 CDT 2004
Contact: Joerg Knappen
I have read the short proposal and my answer is YES, of course, the UTC should accept these two characters. Keeping LATIN SMALL LETTER AT and COMMERCIAL AT separate will keep internet protocolls sane. Unifying the two will cause potential damage depending on the locale (guess of @ being capitalized and mapped to something strange ...)
--Joerg Knappen
Date/Time: Thu Sep 2 09:19:03 CDT 2004
Contact: Kent Karlsson
Hmm. The "COMMERCIAL AT" is in origin a ligature of a and d. (Just like & origianally is a ligature of e and t.) It appears that the use of the "letter at" is not such a ligature.
The proposal document does not say what the phonetic value for that letter is (as used in Koalib). Perhaps it would be better to name it after that phonetic value (if a suitable "ASCII" approximation can be found) instead of (almost) referring in the name to the ad ligature (COMMERCIAL AT).
Date/Time: Thu Oct 28 05:12:16 CST 2004
Contact: Bev Cope
I, Bev Cope, am working on the computing aspect of some Sudanese languages. I know a linguist who has recently lived among the Koalib people in Khartoum in order to learn and study their language. He has the following observations:
@ (arabic 'ayn) is not part of the traditional set of Koalib consonants. Old people do not pronounce it, even in borrowings from Arabic. Of course, if you analyse the speech of younger speakers, you will find some @, because they pronounce the arabic words without integrating them into the "traditional" Koalib phonologic system. My main observations to this respect are:
* @ appears only in borrowings from Arabic (to my knowledge)
** the Koalib speakers who pronounce it are all quite fluent in Arabic, and in Khartoum, they are sometimes more fluent in Arabic than in Koalib: so, when @ appears in Arabic borrowings, is it really one of the Koalib consonants or just a case of code switching / mixing?
*** if you take into account the arabicized pronunciation of Koalib (why not?), then you should include the diverse laryngeal and glottal sounds which can be pronounced too in borrowings from Arabic by bilingual Koalib speakers (to put it more clearly: if you include @, why do you not also include /h/, /x/, etc.)
**** from a socio-linguistical point of view, from what I could see, the @ is far from being consensual among educated Koalibs and some of them even reject it openly, because they feel that this phoneme is not part of their language: of course, I am unable to give you accurate statistics about that.Dr. Nicolas Quint
Chargé de Recherches
LLACAN-UMR8135 (CNRS-INALCO-Université Paris VII)
FRANCE
Date/Time: Thu Aug 26 17:37:38 CDT 2004
Contact: Peter Kirk
I would like to express my support for the proposed INVISIBLE LETTER character. The current situation in which non-spacing versions of many combining marks have to be represented by SPACE or NBSP followed by the combining mark causes serious problems, especially for representation of combining marks in isolation e.g. as part of a list of marks. Implementers are likely to be confused by the need to give these combinations line breaking characteristics different from those of their base characters alone. It makes a lot of sense to clearly define a proper new practice, and to deprecate the confusing existing one.
I would like to see it clearly recommended that the width of INVISIBLE LETTER (in a variable width font) should be as required to carry the combining mark. Indeed, when the combining mark is actually a spacing character itself the width of INVISIBLE LETTER should collapse to zero. This is required so that spacing combining marks can be positioned properly e.g. in aligned columns relative to spacing letters. Alternatively, there may be a need for a separate ZERO WIDTH INVISIBLE LETTER.
In certain situations in biblical Hebrew (Qere without Ketiv) there is a need to display two or three combining marks together without a base character. (See the figure in section 3.1 of http://www.qaya.org/academic/hebrew/Issues-Hebrew-Unicode.html, which would include three INVISIBLE CHARACTERs in place of NBSPs.) The proposal should not rule out this possibility.
Date/Time: Thu Sep 2 09:07:15 CDT 2004
Contact: Kent Karlsson
The INVISIBLE LETTER should have width if an Mn is attached to it. If only Mc characters are attached to it, it should be zero-width. Especially if the Mc is a reordrant one. This way tables of "letters", where the letters may be Mc, look better.
Date/Time: Wed Sep 15 13:43:54 CDT 2004
Contact: Peter Kirk
In addition to my previous comments on this issue, I wish to point out an additional reason why this Invisible Letter would be beneficial. As currently specified, a spacing diacritic which forms part of a word (something which happens at least in Hebrew) should be represented by SPACE or NBSP followed by the combining form of the diacritic. Because there is no remaining visible space, a word break is not generally intended. But, according to UAX #29, by default there is a word boundary before and after the space-diacritic combination - and it would probably be very difficult to tailor the boundary rules to avoid this issue without disabling SPACE itself as a word boundary character. The easiest way to resolve this issue seems to be to encode INVISIBLE LETTER as proposed, defined as not a word boundary character, and to deprecate use of SPACE or NBSP for formation of spacing diacritics.
Date/Time: Fri Nov 5 21:37:15 CST 2004
Contact: Gihan Dias (Sri Lanka)
The ICT Agency of Sri Lanka supports this proposal.
This issue is of importance in Sinhala, as well as other Indic languages, in which modifiers and modifying sequences need to be depicted in isolation for pedagogic and other purposes.
The base character used for this purpose should not be used for any other purpose, and thus we do not recommend the use of any form of space character for this purpose.
We do not recommend that this character be used for "unknown letter", but specifically for depicting isolated modifiers and modifying sequences.
This character may be encoded a suitable block, not necessarily "control pictures".
Gihan Dias
Advisor, Technical Architecture and Standards, ICT Agency,
160/24, Kirimandala Mawatha, Colombo 5, Sri Lanka.
Tel: +94 11 236 9096 http://www.icta.lk
Date/Time: Tue Nov 9 19:33:48 CST 2004
Contact: François Yergeau
This proposal should be rejected. The UCS already offers two glyph-less characters to be used as a base for displaying combining marks in isolation: U+0020 SPACE and U+00A0 NON-BREAKING SPACE (NBSP).
N2822 points out real problems with SPACE in some important environments (XML and HTML) and then offers a solution, completely ignoring the already existing solution: NBSP. NBSP doesn't get collapsed with other whitespace in XML or HTML, in fact it is widely used precisely for that reason. This fact simply removes the "problem" that N2822 purports to solve; the proposed INV is therefore uncalled for.
The properties of NBSP admittedly do not match exactly those proposed for INV, but the proposal does not make a case that those exact properties are necessary to obtain the desired effect of displaying combining marks in isolation. The properties of NBSP are certainly adequate to support the "typical representation between two words:
Finally, the rationale offered for encoding INV in the BMP ("General usefullness for most BMP scripts.") is completely bogus. Placing it in the Control Pictures block seems misguided, since it is not a picture of a control character.
-- François Yergeau
Date/Time: Wed Sep 8 03:25:23 CDT 2004
Contact: Peter Kirk
It seems to me that sequence names should be kept clearly distinct from character names, e.g. by using names like LATIN CAPITAL LETTER SEQUENCE A B C. This will avoid confusion when people see a sequence name and try to find it in a list of character names. It may also avoid naming difficulties if at a later stage it is decided to encode a single character for what has been represented as a sequence.
But then I wonder about the usefulness of sequence names, since they are intended as an addition to a list of character names which has already lost its usefulness. I say this because a list of names is not useful which includes many names known to be misleading and several acknowledged errors, which cannot be corrected. In my opinion, for this reason use of official character names in user interfaces should be deprecated, and extensions to the name space are unhelpful.
Date/Time: Thu Nov 11 08:11:32 CST 2004
Contact: Andrew
West
I have noticed that NamedCompositeEntities.txt does not currently include any entries for Tibetan entities, when Resolution M44.20 suggests that precomposed BrdaRten Tibetan stacks may be potential candidates for inclusion in such a list.
China is now moving towards standardizing the precomposed BrdaRten stacks by mappings to the PUA. Set A (in the BMP PUA) has already been defined, and Set B (in the PUA of Plane 16) is pending. I don't know whether the UTC thinks it is appropriate to add the precomposed Tibetan stacks defined by China to the list of Named Composite Entities at present (it may be best to wait for the dust to settle), but I am appending a list of the 1,536 Set A stacks, in case it does wish to consider these for inclusion in NamedCompositeEntities.txt. Note that the names have been generated algorithmically by myself, and are only provisional suggestions.
In addition to the 1,536 conjuncts defined in Set A by China, I suggest that the following single conjunct (which the Chinese standard unfortunately maps to U+0F00) also be included in any list of Tibetan Named Composite Entities:
TIBETAN CONJUNCT A WITH VOWEL SIGN O AND RJES SU NGA RO;0F68 0F7C 0F7E (~1500 lines omitted for brevity) TIBETAN CONJUNCT HA PLUS SUBJOINED WA WITH VOWEL SIGN O;0F67 0FAD 0F7C
Date/Time: Wed Sep 8 22:44:48 CDT 2004
Contact: Jony Rosenne
FB1D, HEBREW LETTER YOD WITH HIRIQ, should be assigned to the unknown group. It is not a Hebrew character, notwithstanding the misleading name.
Date/Time: Tue Oct 26 22:32:46 CST 2004
Contact: Doug Ewell
The proposed update to UAX #24, "Script Names" (PRI #43) looks good and adds a good deal of useful background information. There is no explicit mention of doing away with the Katakana_Or_Hiragana script value, which I thought was a goal of this revision, but perhaps that is only planned to be reflected in the updated Scripts.txt.
Date/Time: Thu Sep 9 03:00:42 CDT 2004
Contact: Matitiahu Allouche
I want to give my position on Public Review 44 "Bidi Category of Fullwidth Solidus".
For Hebrew, the change from ES to CS does not make any difference. For Arabic, it does, but I estimate that the mass of existing Unicode text including Arabic in a context where Fullwidth Solidus is also used must be infinitesimal, so there should be no data integrity issue.
As for the future, it makes sense that U+002F SOLIDUS and U+FF0F FULLWIDTH SOLIDUS be assigned the same Bidi category.
Matitiahu Allouche
Date/Time: Fri Oct 15 01:23:19 CST 2004
Contact: smontagu AT smontagu.org
Also, Unicode 4.0.1 changes the bidi category of U+002B PLUS SIGN and U+002D HYPHEN-MINUS from ET to ES without changing their fullwidth equivalents. There are several other characters which have compatibility mappings to U+002B and U+002D and I suggest the UTC considers changing the Bidi category of all of them to ES. Here is the full list:
207A;SUPERSCRIPT PLUS SIGN 208A;SUBSCRIPT PLUS SIGN FB29;HEBREW LETTER ALTERNATIVE PLUS SIGN FE62;SMALL PLUS SIGN FE63;SMALL HYPHEN-MINUS FF0B;FULLWIDTH PLUS SIGN FF0D;FULLWIDTH HYPHEN-MINUS
Date/Time: Wed Oct 13 03:05:12 CST 2004
Contact: Behdad Esfabod
I second that NNBS should have the same category (in almost all respects) as NBSP. What I'm not sure right now is that if they really are Common Number Separator (CS). IMO they are not CS, because there's nothing "weak" about them. And they don't go with other CS characters (mostly comma, colon, solidus, and full stop characters). These two are simply ON.
But then I can see the reason for making them CS: you can use them to write something like "123 456" in an RTL context without getting reordered as "456 123". But then:
* I'm sure 99.9999% of RTL users would not learn/use that.
* It IS confusing. NBSP is supposed to be a non-breakable space. So, I like to put it between my "123" and "456", to prevent line-breaking. What I don't want is to get them reordered differently.
So, I propose changing bidi category of both No-Break Space and Narrow No-Break Space to Othe Neutral (ON).
Date/Time: Sat Nov 6 03:09:47 CST 2004
Contact: Tim Partridge
Following brief discussion with other members of the Mongolian shaping working party, we are agreed that this change is unlikely to impact Mongolian text.
NNBSP is not normally used between Mongolian digits, and even if it was would have no impact as Mongolian digits do not change shape.
In the normal context of separating a suffix from the stem of a word, there would be no impact in the context of the bidi algorithm.
NNBSP would not normally occur at the start or end of a line, or next to a tab in Mongolian text. (In the event that a hyphenation algorithm was permitted to break at a NNBSP, the renderer would have to implement special actions to preserve shaping - it is presumed that these would cater for any bidi difficulties.)
In conclusion I have no objection to the change.
I would note that in the past U+180E MONGOLIAN VOWEL SEPARATOR has had its properties kept in step with NNBSP as in Mongolian text they have similar but not identical uses. I can't imagine anyone looking at MVS as an obvious candidate for putting between digits though, so in this case it probabaly doesn't matter.
Tim Partridge
Date/Time: Tue Sep 14 06:55:24 CDT 2004
Contact: Peter Kirk
I would like to congratulate Peter Constable on his clear and accurate statement of the issues, and indicate my continuing support for the Meteg representations proposed in his document.
Date/Time: Tue Oct 19 12:49:46 CST 2004
Contact: François Yergeau
I interpret the issue as being how to optimize the base table, in the case of characters that can be perceived as modifications of some other base character(s), such as an additional diacritic or an historical ligature. BTW the same issue may apply also to Cyrillic etc., not only to Latin.
One way to optimize is to use weigths appropriate for languages that actually use the character. Another is to assign weights that will make sense for the majority of users of the *script*. If these two match (e.g. é, always considered a secondary difference from e), there is no problem. When they mismatch (e.g. æ, either a secondary difference from ae or a distinct letter coming after z), then we have to choose.
I think the second strategy is the better one. My motivation:
a) Everybody has to tailor according to 14651.
b) Users of a certain character, say Ł, are very much aware of the existence and collation requirements of this character, and will make sure to check the base table and tailor it as required to sort Ł properly. Their requirements will therefore be met.
c) Other users may not even be aware that Ł exists, will not check what the base table does with it and will not account for it in their tailoring, in order to get reasonable behaviour (which would be here to interfile it with the Ls). Their requirements will not be met, unless those who do the tailoring check all Latin (and similarly for Cyrillic etc.) characters and tailor them. This is unlikely and leads to a lot of tailoring.
Case in point: I recently had the opportunity to look at a delta recently adopted (or close to be adopted?) by the Québec government. Unsurprisingly, it doesn't tailor Ł. Therefore, if I were to use this delta to look up a database of people for names staring with L, I would miss Mr Łukasiewicz, of Reverse Polish Notation fame.
Date/Time: Wed Nov 3 10:07:10 CST 2004
Contact: Peter Kirk
I support the changes proposed in PR #47, which make for a much more logical and practical default collation.
The same principles should be extended to similar letters in the Greek and Cyrillic alphabets. I already proposed one change along these lines to sort Greek koppa and archaic koppa together. There are some other Greek examples and a large number of Cyrillic examples which ought to be treated similarly.
Date/Time: Wed Oct 13 02:54:03 CST 2004
Contact: Matitiahu Allouche
1) The definition in BD3a does not clarify whether the boundaries are determined between characters adjacent in the original sequence (logical order) or in the new sequence (visual order) computed by the Bidi Algorithm. I believe that the author's intent is the former, but since we are addressing presentation and glyphs issues, the casual reader could understand the latter. There is a difference between the two, as in the following example:
my address is ABC STREET 123
which displays as
my address is 123 TEERTS CBA
According to the definition, the location after the space following "is" is a directional boundary on the logical sequence but not on the visual sequence.2) Because the anomaly in the example above, I think that "level runs" are more significant than directional runs, and shaping should be limited to characters with the same level. This definition of runs gives the same result whether we consider the sequence of characters in logical order or in visual order.
3) I suggest that the text of UAX#9 about shaping (section 3.5) should give some directives about if and how to shape Arabic characters with LTR direction (possible only after LRO or equivalent markup).
4) I suggest that the sample in section 3.5 of UAX#9 would be a little less contrived if the 2 last characters (LAM and MEEM) were preceded by RLE (and not RLO), which would also better match the text which says "the next two are embedded, but with the normal RTL direction" (Note: "embedded" and not "overridden"). In fact, the example would loose nothing if the last 2 characters were neither overridden nor embedded. The example would be more understandable and the final result would be exactly the same.
From: Kent Karlsson
Date: 2004-10-13 04:49:12 -0700
> IIRC the consensus in Toronto UTC meeting was to replace > "directional run" with "level run". Mark, is there any reason > for reverting back to this definition? It has considerable > impacts on reordering joining and bidi. I will file my stand, > but I think we can at least put that as another option in the PRI > itself.I agree that the definition and term should be changed to refer to and define "level run" (which is already defined in UAX 9) rather than "directional run". The latter is a rather strange concept. [Nit: can we please get rid of the # when referring to UAX/UTR/... numbers? Please!]
In addition, shaping is actually NOT limited to each level run (nor even directional run), due to ZWJ.
Thus: "Shaping is logically applied to each level run (of the original string) ***extended with any ZWJ at either end of the run***."
I'm not sure how that really differs from "Shaping is logically applied to the original string, with PDF, LRO, LRE, RLO, and RLE given shaping class U."... But BiDi still confuses me; any hint?
/Kent K
From: Kenneth Whistler
Date: 2004-10-12 13:25:42 -0700
> But a "directional run" is not defined in the document. The proposal is > to add a definition of directional run as:This is still confused and confusing. And BTW, "have" -- not "has" -- was correct.
> directional run: A maximal contiguous sequence of characters that has the > same embedding direction after applying the BIDI algorithm. > - The boundaries of the directional runs within a string are between > characters > that have different embedding directions, and at the start and end of the > string. > - That is, there is a boundary between two characters where one is > right-to-left (an odd > embedding level) and the other is left-to-right (an even embedding level).I suggest the following reformulation, which I think will be *much* easier to comprehend.
BD3a Directional Boundary: The start or end of a sequence of characters, or any point between two characters in the sequence for which the embedding level for the first character differs in even/odd status (and thus L versus R direction) from the second character.
* In other words, there is a directional boundary between any two characters where the embedding direction changes. BD3b Directional Run: A maximal continguous sequence of characters, all of which have the same embedding direction after application of the BIDI algorithm.
* Note that a directional run is terminated at both ends by a directional boundary, and that boundary may consist of the start or end of the sequence of characters.
By the way, I don't think the definition should introduce the notion of a "string", because bidi is really defined on sequences of characters. Once you have strings, you have the problem of boundaries between code units as well as between characters that the code units represent.
--Ken
Date/Time: Tue Oct 26 22:32:46 CST 2004
Contact: Doug Ewell
The proposed update to UTS #6 (SCSU; PRI #49) also looks good, except for one sentence (described below). The update consists almost entirely of uncontroversial editorial changes; I kept waiting for the "kicker" technical change that would have justified sending it out for public review. I am particularly glad to see the details on the Japanese sample moved into the main document. I have not reviewed the sample scsumini.c code; I will try to do that before the November 8 deadline.
The one sentence in the SCSU revision which gave me problems is in Section 8.5:
"When there is a string of characters that fit into the same new dynamic window, then one should be defined so that text is compressed for which there is not a predefined window."
I can't parse this.
Date/Time: Wed Oct 20 12:42:18 CST 2004
Contact: Peter Linsley
UTS#18: 0.1 Notation "\n as used within regular expressions, expands to the text matching the nth parenthesized group in regular expression"
PETER: Most engines limit n to be [1-9] where \456 would be the backreference to the 4th group followed by the literal '56'. May want to enhance the description to indicate this.
UTS#18: 1.1 Hex Notation
PETER: Should this be limited to hex notation or should the requirement be "any base notation" so as to allow decimal or octal too?
UTS#18: 1.6 Line Boundaries, 4. Arbitrary character pattern "Note that ^.*$ (an empty line pattern) should not.."
PETER: '^.*$' is not an empty line pattern, that would be '^$'.
UTS#18: 2.1 Canonical Equivalents "For example, the expression [a-z ä] can be internally turned into [a-z ä] | (a \u0308)."
PETER: This should be "For example, the expression [a-z ä] can be internally turned into ([a-z ä] | (a \u0308))." otherwise an expression such as 'a [a-z ä]' would be incorrectly turned into 'a[a-z ä] | (a \u0308)' placing the first 'a' on one side of the alternation.
UTS#18: 2.6 Wildcard Properties "Examples, where ".*" matches any character,"
PETER: should be "Examples, where ".*" matches any number of characters,". Also, in the table, it should be stated that the expression is implicitly anchored.
UTS#18: 3.5 Tailored Ranges "in traditional Spanish, for example, [b-d] would match against 'ch'.
PETER: Would [^d-f] match 'ch' or just 'c'? Perhaps we should document this either way. I don't have a good answer for how it should behave.
UTS#18: 3.6 Context Matching
PETER: I'm a bit confused by the justification for this section in that I don't see how it is directly related to Unicode. Script transliteration is mentioned as one usage case but I'm sure there are many linguistic string manipulations that require other features of regexp such as non-greedy matching or DFA longest leftmost matches which are obviously not the scope of this document. I don't know the details of script transliteration but I'm sure it may be possible with use of backreferences rather than look ahead/behinds.
UTS#18: Unicode Set Sharing "If these sets separately stored, "
PETER: "If these sets are separately stored, ". It seems that the function of 'script transliteration' has driven several of these conformance requirements. This section seems to be a means to generate a more performant engine allowing the user to cache lookup tables; should it not be up to the implementation to decide how to make their engine performant? What happens when a method of lookup is developed that is faster than unicode set sharing? I may be misunderstanding but I feel this entry is out of place in the standard.
=GENERAL= PETER: For substringing operations such as replace should we recommend minimal match? Should we also allow Unicode related operations (such as normalization) on the replace string and backreferences?
Date/Time: Thu Oct 21 10:57:20 CST 2004
Contact: Markus Scherer
http://www.unicode.org/reports/tr18/tr18-10.html
"and also to maintain (as much as possible compatibility) with the usage in practice." -> move the closing paren from after compatibility to before it
" there is a mismatch between what what would be natural" -> remove one "what" (Is this just for reviewers?)
"While they could be applied to non-alphabetics, their principle use is on alphabetics." -> should this not be "principal"?
I personally find most troubling the name collision (alpha/upper/lower) together with the POSIX constraint - you mention this in the review notes.
markus
Date/Time: Thu Oct 21 17:07:27 CST 2004
Contact: Weiran Zhang
One comment on UTS#18 line boundary handling (1.6)
"To meet this requirement, if an implementation provides for line-boundary testing, it shall recognize not only CRLF, LF, CR, but also NEL (U+0085), PS (U+2029) and LS (U+2028)."
"Arbitrary character pattern (often ".") ... In 'multiline mode', these would match, and \u000D\u000A matches as if it were a single character."
The treatment of multicharacter newline sequence as a single character could introduce ambiguities for certain regular expression operations. As some examples:
User would expect the pattern ".{n}" to return n characters but it could now return n+m characters where m is the number of multicharacter newline sequences encountered in the match under multiline mode.
How about a pattern like "[^a]" (not 'a')? Should it also match a multicharacter newline sequence?
Also, for match and replace functions in general, when a match cannot be found at a particular position, the normal practice is to advance by one character position and retry the match. It's not clear how multicharacter newline sequence should be handled in this case. Should they be skipped over as one character (which is expensive because it requires checking for such sequences at every character position even though they don't occur very often)?
The same kind of overhead will apply to interpreting the arbitrary character "." and the anchor operators "^" and "$".
Most engines (Perl5, Java) do not support this. It would be more consistent and efficient to externalize the newline logic and request the input data be normalized to transform all newline sequences into a single newline character before regular expression processing.
Regards, -Weiran
No feedback received this period.
Date/Time: Wed Nov 3 10:32:24 CST 2004
Contact: Peter Kirk
In the draft UTR #36 on security issues, section "Cross-Script Spoofing", a recommendation is made that mixed script text should be specially formatted. However, there are some languages which as currently defined should be written with mixed scripts, e.g. Kurdish Cyrillic which uses Latin Q and W, and Syriac which uses some Arabic combining marks and punctuation. If these languages are to be displayed in a way acceptable to users, either these cases must be made special exceptions to mixed script detection, or the problem could be solved by encoding e.g. distinct Cyrillic Q and W characters for use in Kurdish.
I would suggest that a special anti-spoofing folding should be developed which is designed to fold together all characters and strings which can be easily confused. This can be used to detect spoofing attempts.
No feedback received this period.
No feedback received this period.
No feedback received this period.
Date/Time: Fri Nov 5 19:30:48 CST 2004
Contact: Peter Kirk
First, I would like to protest at the grossly inadequate time given for public review of this issue. A notice was sent to the public Unicode list on a Friday, after working hours almost anywhere except the US West Coast, with a deadline for the following Monday. Only those of us working over the weekend have a chance to respond.
Then, on a technical issue: I applaud the deprecation of use of SPACE as a base character for combining marks in isolation. But the text recommending use of NBSP conflicts with the INVISIBLE LETTER proposal which is still under public review, issue #41. These conflicting proposals need to be harmonised. I continue to prefer INVISIBLE LETTER as this has the clear letter-like properties needed for spacing combining marks occurring in text, and avoids the danger that this use of NBSP may lead to the some of the same complications in text processing which have led to the deprecation of this use of SPACE.
Date/Time: Fri Nov 5 21:51:21 CST 2004
Contact: Gihan Dias (Sri Lanka)
In section 5.1, under CM — Attached Characters and Combining Marks (XB) — (normative)
The report says:
"The preferred base character for showing combining marks in isolation is U+00A0 No-Break SPACE. If a line break before or after the combining sequence is desired, U+200B ZERO WIDTH SPACE can be used."
The ICT Agency of Sri Lanka does not support the use of a space (00A0 or 200B) as a base for showing combining marks in isolation. We support the use of a separate character for this purpose, as proposed by Everson et. al. in PR-41 (L2/04-268).
Gihan Dias
Advisor, Technical Architecture and Standards, ICT Agency,
160/24, Kirimandala Mawatha, Colombo 5, Sri Lanka.
Tel: +94 11 236 9096 http://www.icta.lk