Re: Deleting Lone Surrogates from Richard Wordingham on 2015-10-04 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sun, 4 Oct 2015 22:35:56 +0100

On Sun, 4 Oct 2015 21:48:12 +0200
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> 2015-10-04 21:30 GMT+02:00 Richard Wordingham <
> richard.wordingham_at_ntlworld.com>:

> > On Sun, 4 Oct 2015 15:44:32 +0200
> > Mark Davis ☕️ <mark_at_macchiato.com> wrote:

> > > When I use http://unicode.org/cldr/utility/breaks.jsp, it does
> > > show the sequence 𑒏�𑒺 as just two grapheme clusters.

> > But that's the sequence <U+1148F, U+FFFD, U+114BA>, which has no
> > lone surrogates at all!

> Mark just said that it was what was shown, i.e. the lone surrogate got
> treated as U+FFFD.

That's not what the English says, and I'm surprised if that's what a
literal translation into French means. I do half suspect that he
actually tried to post a lone surrogate.

> However my opinion is that 𑒏�𑒺 (using U+FFFD substitution) gives 2
> grapheme clusters, I would prefer a solution that gives 3 grapheme
> clusters, as if the lone surrogate was a line-break control, so that
> the third character (combining, but just after the lone surrogate)
> will not combine with it but will be handled as a defective combining
> sequence with no starter at all before it.

I'd much prefer to be able to delete the first character of a grapheme
cluster. It's annoying to have to retype 4 characters because one's
mistyped the first of the 4 characters in a grapheme cluster. Removing
the restriction would be much more useful.

Richard.
Received on Sun Oct 04 2015 - 16:37:08 CDT

This archive was generated by hypermail 2.2.0 : Sun Oct 04 2015 - 16:37:08 CDT