Re: 8-bit text which is supposed to be UTF-8 but isn't

From: Mark E. Davis (markdavis@ispchannel.com)
Date: Sat Jan 29 2000 - 20:30:03 EST


For the precise limitations, take a look at
"http://www.unicode.org/unicode/reports/tr22/#Full Validity Checks". It is
not in regular expression form, but could easily be mapped to it. That
defintion also explicitly removes the "non-shortest forms" by breaking down
the equivalent of UTF8-xtra-tail into several categories.

Mark

John Cowan wrote:

> Erland Sommarskog scripsit:
>
> > [A]n agent which discovers this should barf, or at least
> > replace the illegal characters with block characters or similar. Am I
> > right?
>
> Yes, in general, but circumstances alter cases. Plan9, for example,
> maps ill-formed UTF-8 byte sequences to the defined-but-unused
> character U+0080.
>
> > (Basically, we say: "just pass on, and what
> > happens at presentation is undefined. If the agent settles on to
> > present it as, say, Latin-1, that is fine with me.)
>
> Reasonable for Usenet news, where most content is read by human
> beings.
>
> > This is the BNF which is the draft right now:
> >
> > UTF8-xtra-head = %d192-255
> > UTF8-xtra-tail = %d128-191
> > UTF8-xtra-char = UTF8-xtra-head 1*UTF8-xtra-tail
>
> Here's a more precise version:
>
> UTF-8-xtra-head-2 = %d192-223
> UTF-8-xtra-head-3 = %d224-239
> UTF-8-xtra-head-4 = %d240-247
> UTF8-xtra-tail = %d128-191
> UTF8-xtra-char = UTF8-xtra-head-2 UTF8-xtra-tail
> | UTF8-xtra-head-3 2*UTF8-xtra-tail
> | UTF8-xtra-head-4 3*UTF8-xtra-tail
>
> Bytes %d247-253 are technically legal but will never be needed,
> as Unicode/ISO 10646 will never grow beyond hex 0010FFFF except for
> deprecated additional private-use zones that predate Unicode,
> and bytes %254-255 are outright illegal.
>
> > Generally, we have been careful to get into too much detail who
> > UTF-8 will work when it comes to case equivalence and such, as there
> > is no one on the list with deeper knowledge in the area. We're just
> > assuming that there will be good libraries that programmers can use.
>
> IMHO (and other i18n type will probably agree), case folding is a bad
> idea in general. For backward compatibility, only case-fold the
> ASCII characters, and leave the others alone.
>
> --
> John Cowan cowan@ccil.org
> I am a member of a civilization. --David Brin



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:58 EDT