From: Ken Whistler (kenw@sybase.com)
Date: Fri May 27 2011 - 13:47:23 CDT
On 5/27/2011 10:09 AM, Chris Clark wrote:
> I've been looking at the version 6.0 UnicodeData.txt data file at
> http://www.unicode.org/Public/UNIDATA/ and I can't find a
> UnicodeData.html to go with it. For older versions there is a html
> explanation file, e.g.
> http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html
>
> Is UnicodeData.txt described else where now?
You're a couple generations behind. UnicodeData.html was replaced by
UCD.html
for several versions.
Now, the documentation about UnicodeData.txt (and the rest of the data
files of
the Unicode Character Database (UCD)) is gathered in UAX #44:
http://www.unicode.org/reports/tr44/
When looking for the documentation about any particular version of the UCD,
whether current or earlier, always start from the component listing for
that version. The component listings give explicit links to the
documentation
for each version. Start from:
http://www.unicode.org/versions/enumeratedversions.html
which is also accessible from the home page on the link "Archive of Unicode
Versions" in the menus.
>
> I'm finding the notation for ranges in UnicodeData.txt a little
> non-intuitive, e.g. the omitted Hangul Syllables has 2 entries:
>
> AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
> D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
>
> Would it make more sense to have a single entry? Something along the
> lines of:
>
> AC00..D7A3;<RANGE: Hangul Syllables>;Lo;0;L;;;;;N;;;;;
>
> A single line would be easier to detect and deal with when parsing the
> file. No need to maintain processing state between each line.
That existing notation is a bit awkward to parse, but is left that way
in part
because it has *always* been that way. Changing it to accommodate some
new parsers would just break old parsers.
>
> http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html does
> explicitly list the ranges of characters (which I find REALLY useful
> and clear), it also mentions that CJK Ideographs and Hangul Syllables
> are omitted as they can be easily derived. It then links to Unicode
> Standard and Unicode Standard Annex #15 (i.e.
> http://unicode.org/reports/tr15/). I can find the Hangul algorithm at
> http://unicode.org/reports/tr15/#Hangul but CJK Ideographs are not
> covered. I know this is a pretty obvious algorithm but I was expecting
> to see it explicitly detailed.
See UAX #44 for current information.
The explicit ranges of characters defined by ranges in UnicodeData.txt
is not
listed in UAX #44, but they are trivially derivable from UnicodeData.txt
itself:
% grep First UnicodeData.txt
% grep Last UnicodeData.txt
will get you all of them for any particular version of UnicodeData.txt.
--Ken
This archive was generated by hypermail 2.1.5 : Fri May 27 2011 - 13:49:52 CDT