L2/01-054

From: Mark Davis [mark.davis@us.ibm.com]
Sent: Tuesday, January 23, 2001 8:21 PM
Subject: Agenda Item: Ranges in UnicodeData

At the last UTC, we decided to change the format for files such as
Blocks.txt so that any ranges would be expressed with "..". Thus we have
the format in the new Blocks.txt (currently
http://www.unicode.org/Public/3.1-Update/Blocks-4d3.beta.txt)

# Start Code..End Code; Block Name
0000..007F; Basic Latin
0080..00FF; Latin-1 Supplement
0100..017F; Latin Extended-A
...

This simplifies parsing and unifies our notation. When the first field in
any of our files is parsed, ".." always indicates that there is a range.
(Although at first I was against this change when Asmus proposed it, as I
have upgraded my tools for supplementary code points, I have come to really
see the value of it.)

I was discussing that with Markus Scherer today, and he mentioned that it
would also be much cleaner to apply this approach to the main file,
UnicodeData.txt. (currently
http://www.unicode.org/Public/3.1-Update/UnicodeData-3.1.0d5.beta.txt). We
have quite a number of ranges, with a very clumsy mechanism for indicating
those ranges. Example:

3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

Parsers would be much cleaner and simpler if they could handle all ranges
in all the Unicode data files the same. So, the proposal is to change
ranges like the above into:

3400..4DB5;<CJK Ideograph Extension A>;Lo;0;L;;;;;N;;;;;

Mark
___
Mark Davis, IBM GCoC, Cupertino
(408) 777-5850 [fax: 5891], mark.davis@us.ibm.com, president@unicode.org
http://maps.yahoo.com/py/maps.py?Pyt=Tmap&addr=10275+N.+De+Anza&csz=95014