Re: Fwd: Wired 4.09 p. 130: Lost in Translation

From: Mark Davis (mark_davis@taligent.com)
Date: Wed Aug 28 1996 - 08:57:31 EDT


This is not rocket science.

Here is a quick & dirty Java method to do it.

static int getEndOfCharacterSequence (String s, int start) {
 for (int i = start + 1; i < s.length(); ++i)
  if (!isCombining(s.charAt(i)))
   return i;
 return s.length();
}

It really depends when you need to do this processing. In our experience
for most string manipulations that most programmers do, it is immaterial
what the internal character sequence boundaries are. The times you do
need to know what boundaries are, it is pretty simple to do it; and you
will also work with non-Latin languages. If you have more questions, see
my Text Boundaries paper from the last Unicode conference. (Or if you
have a Java-enabled browser, you can see a live demonstration if you go
through www.taligent.com.)

Mark

Stringunicode@Unicode.ORG wrote:
>
> At 16:30 1996-08-28, Martin J Duerst wrote:
> >
> >Please be careful. To know whether an A is just only an A, you only have
> >to check the next position. If that next position is not a combining
> >character, you know it is an A, if it is a combining character, you
> >know it is "something else".
>
> Yes, but it's not a once-off look, is it? Because you can stack combining
> characters. So you know it's not an A, but you have to keep looking and
> looking and looking, don't you? Doesn't this make processing much more
> complex than Level 1 processing?
>
> Michael



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:31 EDT