Re: Specification of Encoding of Plain Text from Richard Wordingham on 2017-01-12 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Thu, 12 Jan 2017 18:42:42 +0000

On Thu, 12 Jan 2017 14:12:09 +0100
Mark Davis ☕️ <mark_at_macchiato.com> wrote:

> I agree that comprehension is a goal. I'd imagine using a BNF regex,
> like the following. This is simple, since I'm just doing Latin, but
> you can see what I mean.

> word = base* ;
> base = (latinLetter latinMn*) ;
> latinLetter = [[:scx=Latn:]&[:L:]] ;
> latinMn = [[:scx=Latn:][:scx=Common:]&[:Mn:]] ;
>
> which turns into the single regex expression:
>
> ([[:scx=Latn:]&[:L:]][[:scx=Latn:][:scx=Common:]&[:Mn:]]*)*

Ouch! That's alarmingly wrong. You've excluded the likes of
English 'Ca‍esar' with ZWJ, Welsh 'Llan͏gollen' with CGJ (the word
doesn't contain the letter 'ng') and the ISO-sanctioned transliteration
of Thai SO SUEA as 's̄'. Fixinɡ it isn't easy. At least, I assume
Arabic harakat don't attach to Latin letters in your conception of
Latin script text, so replacing 'scx=Common' by 'sc=Inherited' doesn't
work well.

The problem may be conflicting requirements on the Script_Extensions
property.

Richard.
Received on Thu Jan 12 2017 - 12:44:02 CST

This archive was generated by hypermail 2.2.0 : Thu Jan 12 2017 - 12:44:02 CST