The Unicode Consortium Discussion Forum

The Unicode Consortium Discussion Forum

 Forum Home  Unicode Home Page Code Charts Technical Reports FAQ Pages 
 
It is currently Fri Oct 24, 2014 6:18 am

All times are UTC - 6 hours [ DST ]




Post new topic Reply to topic  [ 7 posts ] 
Author Message
 Post subject: Getting Unihan chars in "ꯍ" format?
PostPosted: Tue Jan 24, 2012 11:07 am 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
I've been lucky to get some hugely helpful tips and help about getting regular (non-Unihan) characters in ꯍ format...but now that I'm trying to go through the Unihan documents I'm finding it very opaque and so different from the UnicodeData.txt file that I'm not able to figure out what I need to do next. What I am looking for is the Unihan equivalent of UnicodeData.txt, a single list of all the currently active, currently mapped-to-an-actual-character values. No duplicates, no unused values that don't resolve to a character.

I have unzipped Unihan.zip and am not finding anything that's close to the UnicodeData.txt file, where each and every char mapping is set out sequentially in a long delimited list. So my question is whether such a list already exists, or how I might construct one (if it doesn't!) from the various Unihan files.

Hopefully this list already exists somewhere, but if not:

Can I safely generate values using a script in the ranges? Is each and every value in (for example) the CJK range of the following an actual ideograph?
3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;

Or are there unused or <control> values mixed in?

I don't need any of the readings...just the "&#x0abcd;" numbers of actual valid ideographs in all the various CJK and Extendeds (or anything not included in UnicodeData.txt).

Thank you for the help!


Top
 Profile  
 
 Post subject: Re: Getting Unihan chars in "&#x0abcd;" format?
PostPosted: Tue Jan 24, 2012 3:57 pm 
Offline
Unicode Guru

Joined: Tue Dec 01, 2009 2:49 pm
Posts: 189
You've asked essentially the same question in the How To Forum.
Here's a link to the reply I gave you there: viewtopic.php?p=625#p625
I believe that reply gives you the best approach for your task.

If for some other reason you simply had to work from the Unihan database, you would simply process the first line for each character code and ignore all other lines having the same code. A simple Perl script should be able to do that for you.

But no matter that it's not hard, it's unnecessarily complicated, so follow my link above.


Top
 Profile  
 
 Post subject: Re: Getting Unihan chars in "&#x0abcd;" format?
PostPosted: Tue Jan 24, 2012 6:44 pm 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
I know it sounds like a similar question but I was not able to see a document similar to UnicodeData.txt in the Unihan.zip -- I see a bunch of files, none of which seem to be a unified list of ALL the Unihan chars. Thus, it's hard to know which documents to start from.

In UnicodeData.txt, I'm in good shape (thanks to your help!!!) but here I'm back at square one. I've found that the character charts are on Wikipedia too, but many of the more esoteric characters aren't rendering in my browser, making it hard to know whether they're mapped to actual characters or not.

Again, I appreciate any and all help. And I agree -- it won't be hard to strip the chars' additional info from UnicodeData.txt and leave just the chars, remove the lf/crs, and have exactly the file I need. But for Unihan, it seems more of a challenge. :-(


Top
 Profile  
 
 Post subject: Re: Getting Unihan chars in "&#x0abcd;" format?
PostPosted: Tue Jan 24, 2012 7:10 pm 
Offline

Joined: Wed Feb 10, 2010 2:51 pm
Posts: 16
Location: Salt Lake City
A complete listing of the characters considered part of Unihan can be found in the proposed update to UAX #38 (http://www.unicode.org/reports/tr38/proposed.html), although that applies to the upcoming Unicode 6.1.0. You can also get it out of Unihan.zip from the Unihan_RadicalStrokes.txt. You can do something really simple like

Code:
egrep kRSUnicode Unihan_RadicalStrokeCounts.txt | sed -e 's/^\(U.[0-9A-F]\{4,5\}\)*/\1/g'


and you'll get a complete list; every character in Unihan is guaranteed to have a kRSUnicode value.


Top
 Profile  
 
 Post subject: Re: Getting Unihan chars in "&#x0abcd;" format?
PostPosted: Tue Jan 24, 2012 8:51 pm 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
Thank you, John, as that was what was very unclear to me -- what file would contain all the Unihan chars and how to remove duplicates. That code you provided worked wonders though now I've got the line and " kRSUnicode (number)" appended to the line. Is there a way to trim it so that there's only the first 5 chars? (In php I'd use the "substr" function...but I don't know sed very well).

I'd like just the number:

U+3400;
U+3401;
U+3402;
(etc.)

Or even better, if the function is almost the same, would be without line breaks:

U+3400;U+3401;U+3402; etc.

And then I can search and replace on the "U+" to end up with "&#x03400;&#x03401;&#x03402;" etc. That's what I'm eventually aiming for.

I figure I can use a text editor to remove each line but gosh, there are so many that a command line method would be ideal.

Thank you again for the help. I'm almost there! :-)


Top
 Profile  
 
 Post subject: Re: Getting Unihan chars in "&#x0abcd;" format?
PostPosted: Tue Jan 24, 2012 9:30 pm 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
I found some sed and awk commands that should do the trick! :-) Thanks!


Top
 Profile  
 
 Post subject: Re: Getting Unihan chars in "&#x0abcd;" format?
PostPosted: Thu Feb 02, 2012 11:25 am 
Offline

Joined: Fri Apr 15, 2011 12:28 am
Posts: 16
I've discovered that there's something "odd" (sorry, I don't know exactly what) about the CJK Extended B and C ranges that are causing failures when I try to include them.

I'm guessing, but am not sure, that this might be due to gaps within the range. Either that or I've incorrectly generated the values.

Am I correct that the Unihan.txt file does NOT include the Ext A, B, C, D ranges?

If so, is there a similar ".txt" file for each of the Ext ranges? Currently I've been generating values by inputting a start and end to the range and the script creates each and every value within the range. But this won't work if some of those generated chars match a char value that's not actually used. A and D seem to work, but B and C do not.

Lastly, is there a way to know what "kind" of Chinese/Han the B and C ranges are used for? Will I need those to support, for example, Mandarin versus Cantonese? Old Chinese versus Modern? Mainland versus Taiwan?

Again, thank you for any help or suggestions!!!!


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 7 posts ] 

All times are UTC - 6 hours [ DST ]


Who is online

Users browsing this forum: No registered users and 1 guest


Quick-mod tools:
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
Template made by DEVPPL.com