Re: CJK question

From: Allen Haaheim (haaheima@interchange.ubc.ca)
Date: Sun Mar 23 2003 - 17:20:24 EST

  • Next message: Allen Haaheim: "Re: CJK question"

    > I tried what you suggested with unipad, but for some reason it went to a
    > location on a PUA character map, rather than CJK Unified Ideographs
    > Extension B, where they are in fact located. I guess it is because Unipad
    > doesn't support Extension B yet, or else I am doing something wrong. But
    > thanks for directing me to the Unipad website, I'm sure it will be useful.

    Code points above FFFF are represented by pairs of code values in the
    Surrogates Area, not in the Private Use Area.
    Unipad should be able to show those Surrogate Area values.

    The code points of the blanks on the website are really in the PUA.

    I don't completely follow you here, but I can see the code points are as you
    say.
    However, I have (correct) hard copies of the text, and there is no doubt
    that the chars that should be there are U+2835C and U+283B9, in the Ext. B
    chart. The website is unlikely to have the wrong chars. But
    there's definitely something wrong somewhere--maybe it is their
    fonts. As with you, only two of their fonts seem to work for me.

    > Here is a sample line of text with the two graphs as blanks (on my
    > machine),
    > second and third from the left. They are No. 2835C and 283B9 respectively,
    > on p.152 of the Extension B pdf:
    >
    > 心而鮮歡。望天涯而佇念,擢雄劍而長歎。

    The second and third char are U+E596 and U+E58E

    > The page this text is from is
    > http://www.chant.org/scripts/frame.asp?t=b&id=000675 I don't think
    > you'll get into the site unless you or your university is a member.

    I can access the fonts at:
    http://www.chant.org/info/download_font.asp

    Only two of the fonts work on my Windows 98 SE
    ICS3 and ICS4

    If I copy the chinese text to wordpad and change the font the second and the
    third char become chinese chars.
    But in ICS3 they look very different from ICS4

    Neither are right. ICS3 and ICS4 are both for the Oracle Bone Script
    database. With our sample text, ICS3 displays OBS graphs (i.e. not standard
    Chinese), and ICS4 displays Chinese gibberish. Our text is from the Pre-Han
    & Han and Six Dynasties databases, which use their ICS1, ICS2, and ICS6
    fonts. As you say, it seems these fonts don't work. If I run a windows
    search, it shows in the search results that they are all in the
    Windows/fonts folder. But upon looking in the folder itself, or in the font
    box in Word2000, they aren't there. I guess this has something to do with
    them not working.

    The following case might confirm it's a problem with the website's fonts:
    http://www.chant.org/scripts/zj/scripts/frame.asp?t=b&id=000869
    text (Shijing #57, last line):
    庶姜,庶士有朅!

    Unipad shows the code points of the third and fourth char from the left (the
    same character) to be U+E053. But the character that belongs there is
    U+5B7D, as another website http://210.69.170.100/s25/index.htm (Han Quan),
    shows in the same line of text: 庶姜孽孽,庶士有朅。And this character is not even in
    Ext A or B, but the regular Unicode CJK U I charset.

    There are other cases where both these sites do not display the character
    (that is, if the problem is not at my end) (Shijing #40, line 11):

    1) 室人交我。
    http://www.chant.org/scripts/zj/scripts/frame.asp?t=b&id=000869
    The fourth and fifth characters should be 徧 U+5FA7 and 讁 U+8B81, but Unipad
    shows they are U+E052 and U+E536.

    2) 室人交遍謫我。
    http://210.69.170.100/s25/index.htm (Han Quan)
    One would also expect Han Quan, like CHANT, to be rigorous and precise. Here
    there are substitutions for the two characters in question, followed by a
    blank that Unipad indicates is U+F6B1. Such substitutions should only be
    necessary when the actual characters are unavailable. What is behind the
    blank I'm not sure, but it may be a note explaining the substitutions. But
    again, all four of the characters can be found in the basic CJK charset, not
    even Ext A or B. I suppose the websites are not using Unicode charsets?
    Thanks again for your remarks and suggestions.--Allen



    This archive was generated by hypermail 2.1.5 : Sun Mar 23 2003 - 17:56:42 EST