|
|
Page 1 of 1
|
[ 7 posts ] |
|
| Author |
Message |
|
thesun
|
Post subject: Getting Unihan chars in "ꯍ" format? Posted: Tue Jan 24, 2012 11:07 am |
|
Joined: Fri Apr 15, 2011 12:28 am Posts: 16
|
|
I've been lucky to get some hugely helpful tips and help about getting regular (non-Unihan) characters in ꯍ format...but now that I'm trying to go through the Unihan documents I'm finding it very opaque and so different from the UnicodeData.txt file that I'm not able to figure out what I need to do next. What I am looking for is the Unihan equivalent of UnicodeData.txt, a single list of all the currently active, currently mapped-to-an-actual-character values. No duplicates, no unused values that don't resolve to a character.
I have unzipped Unihan.zip and am not finding anything that's close to the UnicodeData.txt file, where each and every char mapping is set out sequentially in a long delimited list. So my question is whether such a list already exists, or how I might construct one (if it doesn't!) from the various Unihan files.
Hopefully this list already exists somewhere, but if not:
Can I safely generate values using a script in the ranges? Is each and every value in (for example) the CJK range of the following an actual ideograph? 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;; 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
Or are there unused or <control> values mixed in?
I don't need any of the readings...just the "ꯍ" numbers of actual valid ideographs in all the various CJK and Extendeds (or anything not included in UnicodeData.txt).
Thank you for the help!
|
|
| Top |
|
 |
|
asmus
|
Post subject: Re: Getting Unihan chars in "ꯍ" format? Posted: Tue Jan 24, 2012 3:57 pm |
|
 |
| Unicode Guru |
Joined: Tue Dec 01, 2009 2:49 pm Posts: 172
|
You've asked essentially the same question in the How To Forum. Here's a link to the reply I gave you there: viewtopic.php?p=625#p625I believe that reply gives you the best approach for your task. If for some other reason you simply had to work from the Unihan database, you would simply process the first line for each character code and ignore all other lines having the same code. A simple Perl script should be able to do that for you. But no matter that it's not hard, it's unnecessarily complicated, so follow my link above.
|
|
| Top |
|
 |
|
thesun
|
Post subject: Re: Getting Unihan chars in "ꯍ" format? Posted: Tue Jan 24, 2012 6:44 pm |
|
Joined: Fri Apr 15, 2011 12:28 am Posts: 16
|
|
I know it sounds like a similar question but I was not able to see a document similar to UnicodeData.txt in the Unihan.zip -- I see a bunch of files, none of which seem to be a unified list of ALL the Unihan chars. Thus, it's hard to know which documents to start from.
In UnicodeData.txt, I'm in good shape (thanks to your help!!!) but here I'm back at square one. I've found that the character charts are on Wikipedia too, but many of the more esoteric characters aren't rendering in my browser, making it hard to know whether they're mapped to actual characters or not.
Again, I appreciate any and all help. And I agree -- it won't be hard to strip the chars' additional info from UnicodeData.txt and leave just the chars, remove the lf/crs, and have exactly the file I need. But for Unihan, it seems more of a challenge. :-(
|
|
| Top |
|
 |
|
Tseng
|
Post subject: Re: Getting Unihan chars in "ꯍ" format? Posted: Tue Jan 24, 2012 7:10 pm |
|
Joined: Wed Feb 10, 2010 2:51 pm Posts: 16 Location: Salt Lake City
|
A complete listing of the characters considered part of Unihan can be found in the proposed update to UAX #38 ( http://www.unicode.org/reports/tr38/proposed.html), although that applies to the upcoming Unicode 6.1.0. You can also get it out of Unihan.zip from the Unihan_RadicalStrokes.txt. You can do something really simple like Code: egrep kRSUnicode Unihan_RadicalStrokeCounts.txt | sed -e 's/^\(U.[0-9A-F]\{4,5\}\)*/\1/g' and you'll get a complete list; every character in Unihan is guaranteed to have a kRSUnicode value.
|
|
| Top |
|
 |
|
thesun
|
Post subject: Re: Getting Unihan chars in "ꯍ" format? Posted: Tue Jan 24, 2012 8:51 pm |
|
Joined: Fri Apr 15, 2011 12:28 am Posts: 16
|
|
Thank you, John, as that was what was very unclear to me -- what file would contain all the Unihan chars and how to remove duplicates. That code you provided worked wonders though now I've got the line and " kRSUnicode (number)" appended to the line. Is there a way to trim it so that there's only the first 5 chars? (In php I'd use the "substr" function...but I don't know sed very well).
I'd like just the number:
U+3400; U+3401; U+3402; (etc.)
Or even better, if the function is almost the same, would be without line breaks:
U+3400;U+3401;U+3402; etc.
And then I can search and replace on the "U+" to end up with "㐀㐁㐂" etc. That's what I'm eventually aiming for.
I figure I can use a text editor to remove each line but gosh, there are so many that a command line method would be ideal.
Thank you again for the help. I'm almost there! :-)
|
|
| Top |
|
 |
|
thesun
|
Post subject: Re: Getting Unihan chars in "ꯍ" format? Posted: Tue Jan 24, 2012 9:30 pm |
|
Joined: Fri Apr 15, 2011 12:28 am Posts: 16
|
|
I found some sed and awk commands that should do the trick! :-) Thanks!
|
|
| Top |
|
 |
|
thesun
|
Post subject: Re: Getting Unihan chars in "ꯍ" format? Posted: Thu Feb 02, 2012 11:25 am |
|
Joined: Fri Apr 15, 2011 12:28 am Posts: 16
|
|
I've discovered that there's something "odd" (sorry, I don't know exactly what) about the CJK Extended B and C ranges that are causing failures when I try to include them.
I'm guessing, but am not sure, that this might be due to gaps within the range. Either that or I've incorrectly generated the values.
Am I correct that the Unihan.txt file does NOT include the Ext A, B, C, D ranges?
If so, is there a similar ".txt" file for each of the Ext ranges? Currently I've been generating values by inputting a start and end to the range and the script creates each and every value within the range. But this won't work if some of those generated chars match a char value that's not actually used. A and D seem to work, but B and C do not.
Lastly, is there a way to know what "kind" of Chinese/Han the B and C ranges are used for? Will I need those to support, for example, Mandarin versus Cantonese? Old Chinese versus Modern? Mainland versus Taiwan?
Again, thank you for any help or suggestions!!!!
|
|
| Top |
|
 |
|
Page 1 of 1
|
[ 7 posts ] |
|
Who is online |
Users browsing this forum: No registered users and 2 guests |
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot post attachments in this forum
|
|
|