Re: UCD 3.2.0

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Apr 05 2002 - 15:11:40 EST


Theo Venker wrote:

> I'd like to make a few remarks about the UCD files.

First of all, while I'd like to thank Theo for going to the
trouble of checking the data files so carefully, and coming
up with some genuine errors in the data, I have a couple of
comments for people who are checking and reporting errors.

1. The preferred mechanism for reporting errors in data files
   or other errors in the standard is to make use of the
   reporting form on the Unicode website, rather than broadcasting
   email to the open list, in hope that someone will notice and
   take action. Please use:

   http://www.unicode.org/unicode/reporting.html

   (which you can also find by following the "Contact Us" link
   on the home page)

2. There is a reason why the UTC announces an extended BETA period
   before the release of a Unicode version, and encourages people
   to report errors in the data files during that period, *before*
   the actual release is finalized. Errors reported then can be
   fixed before the release. But at this point, the Unicode 3.2.0
   data files are finalized, warts and all. Reporting an error
   immediately *after* a release is actually one of the worst times
   to do so, since that is the maximal time before the next release,
   meaning that the chance of an error report being lost or forgotten
   before the next opportunity to fix it is greatest. So in the
   future, please do take the BETA period as your best opportunity
   for getting errors in the data files fixed in a timely manner.

>
> The following things I ran into when checking out the 3.2.0 release:
>
> o In PropertyValueAliases-3.2.0.txt line 79:
> ccc; 202; ATBL ; Attached_Below_Left
> whereas in UnicodeData-3.2.0.html I read:
> 200: Below left attached
> 202: Below attached
> What is is correct value for "attached below left", 200 or 202?

200. The error is in PropertyValueAliases-3.2.0.txt, where the entry
should be for Attached_Below, rather than Attached_Below_Left:

ccc; 202; ATB ; Attached_Below

>
> o In SpecialCasing-3.2.0.txt lines 234 and 235 are missing the closing
> semicolon. This problem also appeared in 3.1.1.

Noted. To be fixed.

>
> o Typo in UnicodeCharacterDatabase-3.2.0.html:
> "DerivedNormalizationProperties", should be "DerivedNormalizationProps".

Noted. To be fixed.

>
> Minor points that I find a bit annoying:
>
> o Many of the UCD files have a comment header with lines longer than 80
> characters. Viewing these files using the page utility on a 80 column
> terminal window to gives ugly output due to the forced line wrapping.

Noted. This could be corrected, but is not a high priority. There are
many other lines which exceed 80 characters in the data, too.

>
> o All UCD files except CaseFolding-3.2.0.txt and SpecialCasing-3.2.0.txt
> *separate* columns by semicolons. For the two exceptions the semicolon
> *terminates* a column, why not keep it the same for all UCD files?

This is an issue for the UTC to decide.

>
> o UnicodeData-3.2.0.txt still uses this notation:
> 1234;<Blah, First>;Lo;0;L;;;;;N;;;;;
> 5678;<Blah, Last>;Lo;0;L;;;;;N;;;;;
> instead of
> 1234..5678;<Blah, First>..<Blah, Last>;Lo;0;L;;;;;N;;;;;
> Since all other UCD files use the latter notation why not change this
> one too? IMHO backward compatibility with existing UCD file parsers
> shouldn't be an issue in this particular case.

It is an issue for some parsers. (And a burden on me, personally,
to fix them, since some of them are used in utilities which maintain
other parts of the Unicode Standard, or the Unicode Collation Algorithm.)
And we don't know how many other old parsers would blow up if we
just changed it. The UTC decided to leave it alone for now -- although
it might modify it in the future.

--Ken



This archive was generated by hypermail 2.1.2 : Fri Apr 05 2002 - 15:59:30 EST