Re: UnicodeData.txt questions

From: Ken Whistler (kenw@sybase.com)
Date: Fri May 27 2011 - 13:47:23 CDT

Next message: Vinodh Rajan: "Lao Script Block - Missing Letters"

Previous message: Chris Clark: "UnicodeData.txt questions"
In reply to: Chris Clark: "UnicodeData.txt questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

On 5/27/2011 10:09 AM, Chris Clark wrote:
> I've been looking at the version 6.0 UnicodeData.txt data file at
> http://www.unicode.org/Public/UNIDATA/ and I can't find a
> UnicodeData.html to go with it. For older versions there is a html
> explanation file, e.g.
> http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html
>
> Is UnicodeData.txt described else where now?

You're a couple generations behind. UnicodeData.html was replaced by
UCD.html
for several versions.

Now, the documentation about UnicodeData.txt (and the rest of the data
files of
the Unicode Character Database (UCD)) is gathered in UAX #44:

http://www.unicode.org/reports/tr44/

When looking for the documentation about any particular version of the UCD,
whether current or earlier, always start from the component listing for
that version. The component listings give explicit links to the
documentation
for each version. Start from:

http://www.unicode.org/versions/enumeratedversions.html

which is also accessible from the home page on the link "Archive of Unicode
Versions" in the menus.

>
> I'm finding the notation for ranges in UnicodeData.txt a little
> non-intuitive, e.g. the omitted Hangul Syllables has 2 entries:
>
> AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
> D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
>
> Would it make more sense to have a single entry? Something along the
> lines of:
>
> AC00..D7A3;<RANGE: Hangul Syllables>;Lo;0;L;;;;;N;;;;;
>
> A single line would be easier to detect and deal with when parsing the
> file. No need to maintain processing state between each line.

That existing notation is a bit awkward to parse, but is left that way
in part
because it has *always* been that way. Changing it to accommodate some
new parsers would just break old parsers.

>
> http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html does
> explicitly list the ranges of characters (which I find REALLY useful
> and clear), it also mentions that CJK Ideographs and Hangul Syllables
> are omitted as they can be easily derived. It then links to Unicode
> Standard and Unicode Standard Annex #15 (i.e.
> http://unicode.org/reports/tr15/). I can find the Hangul algorithm at
> http://unicode.org/reports/tr15/#Hangul but CJK Ideographs are not
> covered. I know this is a pretty obvious algorithm but I was expecting
> to see it explicitly detailed.

See UAX #44 for current information.

The explicit ranges of characters defined by ranges in UnicodeData.txt
is not
listed in UAX #44, but they are trivially derivable from UnicodeData.txt
itself:

% grep First UnicodeData.txt
% grep Last UnicodeData.txt

will get you all of them for any particular version of UnicodeData.txt.

--Ken

Next message: Vinodh Rajan: "Lao Script Block - Missing Letters"
Previous message: Chris Clark: "UnicodeData.txt questions"
In reply to: Chris Clark: "UnicodeData.txt questions"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 27 2011 - 13:49:52 CDT