Re: Internationalised Computer Science Exercises

From: Richard Wordingham via Unicode <unicode_at_unicode.org>
Date: Mon, 29 Jan 2018 08:57:41 +0000

On Mon, 29 Jan 2018 07:16:04 +0100
Philippe Verdy via Unicode <unicode_at_unicode.org> wrote:

> 2018-01-28 23:44 GMT+01:00 Richard Wordingham via Unicode <
> unicode_at_unicode.org>:

> > In the search you have in mind, the converted regex for use with NFD
> > strings is actually intelligible and simple:
> >
> > <LATIN SMALL LETTER A>
> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
> > <COMBINING DOT BELOW>
> > [[ [^[[:cc=0:]]] - [[:cc=above:][:cc=below:]] ]] *
> > <COMBINING CIRCUMFLEX>
> >
> > Informal notation can simplify the regex still further.
> >
> > There is no upper bound to the length of a string matching that
> > regex,
>
> Wrong, you've not read what followed immediately that commented it
> already: it IS bound exactly because you cannot duplicate the same
> combining class, and there's a known finite number of them for
> acceptable cases: if there's any repetition, it will always be within
> that bound.

Are you talking about regular expressions or strings that match them?
Natural language text can very easily contain adjacent combining
characters of the same combining class - look no further than the
full decomposition of U+01D6 LATIN SMALL LETTER U WITH DIAERESIS AND
MACRON. For a few combining characters, such as U+1A7F TAI THAM
COMBINING CRYPTOGRAMMIC DOT, repetition is of their very essence.
One can find pairs of combining circumflexes in plain text maths.

Incidentally, I was talking about regular expressions, which imply
*finite* state machines, albeit huge, rather then 'regexes', which are
similar but may formally require unbounded memory.

Richard.
Received on Mon Jan 29 2018 - 02:58:11 CST

This archive was generated by hypermail 2.2.0 : Mon Jan 29 2018 - 02:58:11 CST