Re: UTF-8 syntax

From: Peter_Constable@sil.org
Date: Sat Jun 09 2001 - 01:39:27 EDT


On 06/08/2001 07:26:54 PM Jianping Yang wrote:

>The issue comes from unpaired surrogates as <ED A0 80> and <ED B0 80> can
be
>in UTF-8 and your search for <F0 90 80 80> (which is Unicode scalar value
>U-00010000) cannot find it. But however, when the UTF-8 string converted
into
>UTF-16, <ED A0 80> and <ED B0 80> will become
><D800 DC00>, and you can find the same character by searching <D800 DC00>
in
>UTF-16.
>
>Unless this unpaired surrogate will be totally eliminated from UTF forms,
this
>issue could be hit.

Because it is irregular and is not something that processes are allowed to
generate, this sequence of "unpaired surrogates" is not normally expected
to be in data and isn't likely to be encountered. On the other hand, your
proposal would make it frequent, which seems to be an exacerbation and
therefore making a case against your proposal. Also, you suggested that a
solution is to transcode into a different encoding form in which the
differences are neutralised. The same case could be made in relation to
differences in binary collation order: "if the UTF-8 data isn't sorting in
the same order as the UTF-16 data, then you can fix that by first
converting the UTF-8 data into UTF-16." Again, this strikes me as feeding
the case against your proposal.

Peter



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:18 EDT