From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Sep 06 2007 - 15:35:44 CDT
> This leads to another issue in the database format, which I prefer to
> discuss here first: why are they ranges in UnicodeData.txt rather than
> explicit records for every character?
Several reasons.
UnicodeData.txt was the very first data file. Its format was
created ad hoc, somewhere in the 1993 timeframe, in support of
the publication and implementation of Unicode 1.1, by developers
at Taligent.
It was released publicly with some other new data files for
Unicode 2.0 in 1996.
One of the things to note about that is that when the Hangul
syllables were recoded between Unicode 1.1 and Unicode 2.0
(and expanded from 6,656 to 11,172 in number), the UTC (and WG2,
for that matter) made an explicit decision to create Hangul
syllable names for the new set algorithmically. For the
UTC the decompositions could also be done algorithmically.
So it was a desired and *mandated* decision by the UTC to
*remove* the explicit Hangul records from UnicodeData.txt and
to note that those values were all algorithmically derived
instead.
UnicodeData-1.1.5.txt didn't have the First/Last convention.
Instead, it had a single entry for Han characters, to wit:
4E00;<CJK IDEOGRAPH REPRESENTATIVE>;Lo;0;L;;;;;N;;;;;
UnicodeData-2.0.14.txt, for Unicode 2.0, innovated the
First/Last convention for CJK and other ranges, because it
became apparent that there were other ranges to document
besides the initial CJK URO, and because having the start
and stop values for the range was important.
As Eric hinted, back in the mid-90's data size was more of
an issue. With 70% of Unicode consisting of Han characters,
and with the UnicodeData.txt values redundant across all
of them, a simple database normalization decision that
reduces all those records to a few range points was an
obvious way to go.
Furthermore, UnicodeData.txt all along has been maintained
with fairly simple tools and diffs. Unlike Unihan.txt, which
is actually a report generated from a relational database,
UnicodeData.txt *is* the data source itself. It has had to
be meticulously maintained in many, many deltas going back
over a decade now, and having all of those versions bloated
with massive amounts of redundant CJK and Hangul records that
never change would simply have been inefficient and useless.
> Being explicit would avoid
> generating names for the implicit records (something which is not
> obvious and not well documented, IMHO).
Well, if you go to ISO/IEC 10646, clause 28 is "Character names
and annotations", and in that clause, subclause 28.2 "Character
names for CJK Ideographs" gives the rules for naming of
CJK unified and compatibility ideographs, and subclause 28.3
"Character names and annotations for Hangul syllables" does
the same for Hangul syllables. It is not as if anyone who
reads the standard could miss it.
The Unicode Standard (by necessity) follows the same rules,
and documents them in Chapter 17 "Code Charts", with Section 17.2
"CJK Unified Ideographs" spelling out the CJK rule and
Section 17.3 "Hangul Syllables" noting the Hangul name rule
and pointing to Section 3.12 "Conjoining Jamo Behavior" for
the details of the algorithm.
But I grant that perhaps this is not obvious to the
casual observer of Unicode, as opposed to folks who have
been working on the standard for years.
Perhaps sticking something in the FAQ on Chinese, Japanese
and Korean issues would help:
http://www.unicode.org/faq/han_cjk.html
Or perhaps a FAQ just dedicated to Unicode character names
might be in order. In any case, make specific suggestions,
and perhaps the situation can be improved.
>
> Or, a variant, why not a DervivedUnicodeData.txt file with the
> all the characters?
Eric Muller pointed out the ultimate answer to this, which
is a fully rationalized and complete XML representation of
*all* of the Unicode Character Database.
Note, however, as regards names in particular, that some
Unicode characters (e.g., noncharacters, private-use characters) don't
have character names, so any notion of simply expanding the
conventions of UnicodeData to all assigned code points gets
you into the same kind of trouble that including control
codes in the current UnicodeData.txt does -- you may give
up one set of arbitrary conventions (First/Last range
compression), but end up having to invent other arbitrary
conventions for special cases.
--Ken
This archive was generated by hypermail 2.1.5 : Thu Sep 06 2007 - 15:39:01 CDT