Marc-Andre Lemburg wrote:
> Do you have references which we could look at
> to determine which of these boundary kinds would actually be
> useful in daily programming ?
There are two things utterly useful in daily programming... One is to get
a "character", whether it's a surrogate or not; another is to get a base
character and all associated combining marks.
It's useful to find the range covered by a "character" at some given
index. That allows the programmer to easily write an increment loop:
while (index i is valid) {
c = next_char_at_Index [i] of string s;
i += lengthOfChar_at_Index [i] of string s;
// do something with c...
}
or similar...
Also, in a similar vein, finding the "range" covered by the combining
character sequence or "locale-independent grapheme" at the given index.
Please also see the FAQ pages on combining marks and the Tech Report #18
of Unicode, section 3.3:
http://www.unicode.org/unicode/reports/tr18/#Locale-Independent Graphemes
There is now some work going on with regard to more precise definition of
such useful chunking units.
I would also take a look at the specifications for NSString and NSText in
Apple's Cocoa environment. Python has a some of these operations already
built-in of course.
Rick
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:19 EDT