From: Chris Clark (Chris.Clark@ingres.com)
Date: Fri May 27 2011 - 12:09:16 CDT
I've been looking at the version 6.0 UnicodeData.txt data file at
http://www.unicode.org/Public/UNIDATA/ and I can't find a
UnicodeData.html to go with it. For older versions there is a html
explanation file, e.g.
http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html
Is UnicodeData.txt described else where now?
I'm finding the notation for ranges in UnicodeData.txt a little
non-intuitive, e.g. the omitted Hangul Syllables has 2 entries:
AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
Would it make more sense to have a single entry? Something along the
lines of:
AC00..D7A3;<RANGE: Hangul Syllables>;Lo;0;L;;;;;N;;;;;
A single line would be easier to detect and deal with when parsing the
file. No need to maintain processing state between each line.
http://www.unicode.org/Public/3.2-Update/UnicodeData-3.2.0.html does
explicitly list the ranges of characters (which I find REALLY useful and
clear), it also mentions that CJK Ideographs and Hangul Syllables are
omitted as they can be easily derived. It then links to Unicode Standard
and Unicode Standard Annex #15 (i.e. http://unicode.org/reports/tr15/).
I can find the Hangul algorithm at
http://unicode.org/reports/tr15/#Hangul but CJK Ideographs are not
covered. I know this is a pretty obvious algorithm but I was expecting
to see it explicitly detailed.
I went ahead and implemented Python versions of both, i.e. java->python
for Hangul and a new CJK name function. I'm not sure if they are any use
to anyone but me but I thought I'd share them just in case, see end of
mail for inline version. It was tested with Python 2.x and Jython 2.5.2
(and it will probably work with 3.x too)
Chris
class MyBaseException(Exception):
'''Base exception'''
class IllegalArgumentException(MyBaseException):
'''Java IllegalArgumentException'''
# Hangul constants
SBase = 0xAC00
#LBase = 0x1100, VBase = 0x1161, TBase = 0x11A7,
LCount = 19
VCount = 21
TCount = 28
NCount = VCount * TCount # 588
SCount = LCount * NCount # 11172
JAMO_L_TABLE = [
"G", "GG", "N", "D", "DD", "R", "M", "B", "BB",
"S", "SS", "", "J", "JJ", "C", "K", "T", "P", "H"
]
JAMO_V_TABLE = [
"A", "AE", "YA", "YAE", "EO", "E", "YEO", "YE", "O",
"WA", "WAE", "OE", "YO", "U", "WEO", "WE", "WI",
"YU", "EU", "YI", "I"
]
JAMO_T_TABLE = [
"", "G", "GG", "GS", "N", "NJ", "NH", "D", "L", "LG", "LM",
"LB", "LS", "LT", "LP", "LH", "M", "B", "BS",
"S", "SS", "NG", "J", "C", "K", "T", "P", "H"
]
def getHangulName(single_unicode_character):
"""Python straight conversion of Java reference
implementation getHangulName() from
http://unicode.org/reports/tr15/#Hangul
Non-pythonic, no change to names/code unless for syntax reasons
Parameter:
single_unicode_character - single Unicode character
"""
# add assert unicode
s = ord(single_unicode_character)
SIndex = s - SBase;
if (0 > SIndex or SIndex >= SCount):
raise IllegalArgumentException("Not a Hangul Syllable: " + s);
LIndex = SIndex / NCount;
VIndex = (SIndex % NCount) / TCount;
TIndex = SIndex % TCount;
return "HANGUL SYLLABLE " + JAMO_L_TABLE[LIndex] \
+ JAMO_V_TABLE[VIndex] + JAMO_T_TABLE[TIndex];
def getCJKName(single_unicode_character):
"""names and functionality based on
implementation of getHangulName() from
http://unicode.org/reports/tr15/#Hangul
Parameter:
single_unicode_character - single Unicode character
"""
# add assert unicode
s = ord(single_unicode_character)
SIndex = s - SBase;
# U+4E00 .. U+9FA5
if (0x4E00 < s > 0x9FA5):
raise IllegalArgumentException("Not a CJK Unified Ideograph: " + s);
LIndex = SIndex / NCount;
VIndex = (SIndex % NCount) / TCount;
TIndex = SIndex % TCount;
return "CJK UNIFIED IDEOGRAPH-%x" % s
This archive was generated by hypermail 2.1.5 : Fri May 27 2011 - 13:09:06 CDT