Re: CJK Ideograph Fragments

From: Uriah Eisenstein (uriaheisenstein@gmail.com)
Date: Fri May 14 2010 - 10:13:13 CDT

  • Next message: Uriah Eisenstein: "Re: CJK Ideograph Fragments"

    That is true from my experience as well. Many "missing" components can be
    found among CJK Unified Ideographs - perhaps the lists were created with
    another encoding (e.g. GB2312 or Big5) which did not encode these
    characters. Other components I have found in Extensions A and B, as
    mentioned. It takes some investigation to conclude that a fragment is not
    encoded as an ideograph anywhere (just like it takes some investigation to
    make sure a suggested new ideograph isn't encoded already).

    Regards,
    Uriah

    On Mon, May 10, 2010 at 5:07 PM, Mark Davis ☕ <mark@macchiato.com> wrote:

    > As I said, I would not rely too heavily on the accuracy of that data. Where
    > there are ?, or NCRs, or truncated IDS sequences, it looks like the missing
    > character can often be supplied by examining the character.
    >
    > Mark
    >
    >
    > — Il meglio è l’inimico del bene —
    >
    >
    > On Mon, May 10, 2010 at 05:43, Uriah Eisenstein <uriaheisenstein@gmail.com
    > > wrote:
    >
    >> Hello Mr. Davis and thanks for the lists,
    >> I've found several different sources for character compositions (though
    >> none of them seem to include Extension characters except for the generated
    >> files you have posted!). While they all have missing information and
    >> occasional mistakes, it is quite easy to find unencoded fragments in them,
    >> these are usually marked with ? or something similar. I've been making a
    >> list of fragments while working with cjklib, mentioned by Christoph; some
    >> I've found later in Extension A or B, others I have character examples and
    >> could be used for an initial proposal. I don't expect the set of necessary
    >> components to be complete anytime soon, or at all, not anymore than the
    >> entire set of Ideographs :)
    >>
    >> Regards,
    >> Uriah
    >>
    >>
    >> On Sun, May 9, 2010 at 2:11 AM, Mark Davis ☕ <mark@macchiato.com> wrote:
    >>
    >>> FYI, I have a table of radicals at
    >>> https://spreadsheets.google.com/pub?key=0AqRLrRqNEKv-dHlVMzY0RFZ3MTFLZ0RldS1RNXN4Z3c&hl=en&output=html Q
    >>> mapping them to Unified ideographs. Not yet complete (the X values are
    >>> tentative, and I don't know if there are values for the ones marked "
    >>> #VALUE!").
    >>>
    >>> I had also tried taking a look at the data at
    >>> http://cvs.m17n.org/viewcvs/chise/ids/?sortdir=down&pathrev=kawabata#dirlist(IDS which Richard and John said was the best publicly available IDS
    >>> data (although it has a GPS licence, which prevents many people from using
    >>> it). While clearly a lot of work went into in, it is very flawed.
    >>>
    >>> - There are over 400 ill-formed IDS sequences.
    >>> - There are 666 (coincidence?) characters that map to themselves
    >>> (where you'd only expect that of "base" radicals).
    >>> - About 5K characters are missing data.
    >>> -
    >>> - There appears to be free variation between using CJK radicals and
    >>> using the corresponding Unified CJK characters.
    >>> - It uses many NCR components with cryptic IDs, instead of radicals
    >>> or Unified CJK.
    >>> - A cursory look shows a signficant proportion of clear mistakes in
    >>> the data (characters stacked vertically in the wrong order, for example).
    >>> - Many characters cannot be recursively decomposed down to radicals.
    >>>
    >>> So I'm not sure how much use the available IDS data would be in terms of
    >>> looking at necessary components.
    >>>
    >>> Mark
    >>>
    >>> FYI, I posted some generated files at http://macchiato.com/ids/, in case
    >>> anyone is curious as to details.
    >>>
    >>> — Il meglio è l’inimico del bene —
    >>>
    >>>
    >>> On Sat, May 8, 2010 at 13:40, Asmus Freytag <asmusf@ix.netcom.com>wrote:
    >>>
    >>>> On 5/8/2010 11:44 AM, Uriah Eisenstein wrote:
    >>>>
    >>>>> Well,
    >>>>> I've gone through the policies of submitting new characters and scripts
    >>>>> and they don't look encouraging :) But neither do they seem to reject the
    >>>>> idea of character fragments out of hand, as opposed to the reverse case -
    >>>>> characters which can be expressed using existing characters and combining
    >>>>> marks. In fact, the CJK Radicals Supplement block and the Hangul Jamo both
    >>>>> contain character fragments, in a way. But I suppose these already existed
    >>>>> in national standards rather than introduced by Unicode.
    >>>>>
    >>>>> In any case, examples I've seen of proposals cite experts and provide
    >>>>> font makers, neither of whom I have contact with. So I guess I'll drop it
    >>>>> for now, and hope that if someone takes it up I'll see it on the mailing
    >>>>> list.
    >>>>>
    >>>> While a font is ultimately required for a proposal to become adopted, it
    >>>> shouldn't be a bar to formally raising the issue for initial consideration.
    >>>> Oncesomething is considered potentially acceptable, there's enough time to
    >>>> come up with fonts (for the purpose of printing charts) before the
    >>>> committees need to vote on final approval. Proposals can take years from
    >>>> initial consideration to publication....
    >>>>
    >>>> Your suggestion was that these fragments need to be enumerated for
    >>>> various purposes in software and that having a standard enumeration is
    >>>> beneficial. If you can document and support that assertion, I would
    >>>> encourage you to put it on record.
    >>>>
    >>>> Doing so would allow a discussion of whether a standard enumeration is
    >>>> indeed useful enough to encur the cost of standardization.
    >>>>
    >>>> In some ways, this would not be a run-of-the-mill character encoding
    >>>> proposal, because you are not asserting that these fragments need encoding
    >>>> for the purpose of directly expressing text. While that is the primary
    >>>> purpose of character encoding, there are purposes that are ancillary to
    >>>> this, that a universal character encoding such as Unicode must encompass.
    >>>>
    >>>> There is certainly some precedent for character codes that aren't
    >>>> limited to the primary purpose I mentioned, but, because they don't
    >>>> represent a standard situation, one needs to carefully argue why such uses
    >>>> need to be covered by standardization and if so, why doing that as character
    >>>> codes is appropriate.
    >>>>
    >>>> That is different from the more usual task to document that an entity
    >>>> occurs in written or printed documents.
    >>>>
    >>>> The problem is, unless you actually put down all the details in a
    >>>> coherent proposal it's hard to judge correctly what the situation is. When
    >>>> you raise the question informally, all anyone can tell you is that an
    >>>> exceptional request is one that needs exceptional justification, which,
    >>>> while certainly correct, doesn't exacatly help you or anyone to evaluate
    >>>> whether your proposal would meet the required level and type of
    >>>> justification.
    >>>>
    >>>> A./
    >>>>
    >>>>>
    >>>>> Thanks,
    >>>>> Uriah
    >>>>>
    >>>>>
    >>>>> On Sun, May 2, 2010 at 3:06 PM, Uriah Eisenstein <
    >>>>> uriaheisenstein@gmail.com <mailto:uriaheisenstein@gmail.com>> wrote:
    >>>>>
    >>>>> Not exactly, but I suppose such Hanzi fragments could be sued for
    >>>>> similar purposes - e.g. looking up characters by components, where
    >>>>> the available components may include non-character fragments. Some
    >>>>> fragments may be useful for IME purposes, but probably not all.
    >>>>>
    >>>>>
    >>>>> On Sat, May 1, 2010 at 8:57 PM, Edward Cherlin < echerlin@gmail.com
    >>>>> <mailto:echerlin@gmail.com>> wrote:
    >>>>>
    >>>>> 2010/4/28 John H. Jenkins < jenkins@apple.com
    >>>>> <mailto:jenkins@apple.com>>:
    >>>>>
    >>>>> > No. You could certainly write up a proposal and submit it
    >>>>> to the UTC.
    >>>>> > Should the UTC feel the idea has merit, it would then move
    >>>>> it on to WG2
    >>>>> > and/or the IRG.
    >>>>> > The main problem here is that there is a very strong desire
    >>>>> to limit
    >>>>> > ideograph encoding to attested and documentable forms.
    >>>>> Anything which does
    >>>>> > not exist in actual texts is not likely to be well-regarded.
    >>>>>
    >>>>> I had the idea some years ago of writing up a proposal to
    >>>>> encode the
    >>>>> hanzi fragments used in Cangjie Shurufa IMEs. These fragments
    >>>>> are used
    >>>>> extensively in dozens of howto books on keyboarding in
    >>>>> Cangjie. This
    >>>>> includes the pieces (mostly real characters, with some
    >>>>> radicals) used
    >>>>> on keyboard labels, and the common forms they stand for. I
    >>>>> didn't get
    >>>>> any interest from the Cangjie development community or the
    >>>>> authors of
    >>>>> a book on Cangjie that I have, so i abandoned the idea.
    >>>>>
    >>>>> Uriah, is this the sort of thing you have in mind?
    >>>>>
    >>>>> > Similarly, the
    >>>>> > UTC has a strong preference not to encoding anything which
    >>>>> isn't in actual
    >>>>> > use. Proposals to encode characters because they would be
    >>>>> useful if encoded
    >>>>> > even though they aren't actually being used right now are
    >>>>> generally looked
    >>>>> > on with disfavor.
    >>>>> >
    >>>>> > 在 Apr 28, 2010 12:03 PM 時, Uriah Eisenstein 寫到:
    >>>>> >
    >>>>> > Hello,
    >>>>> > My question is about common components of CJK Ideographs
    >>>>> which are not
    >>>>> > encoded as independent Han characters (and perhaps indeed
    >>>>> aren't). A good
    >>>>> > example is the right-hand part of the character 漢 itself:
    >>>>> it is a distinct
    >>>>> > component appearing in multiple other characters, but is not
    >>>>> encoded to the
    >>>>> > best of my knowledge. The same goes for the top part of 鳥
    >>>>> and 島, the
    >>>>> > surrounding part of 與 and 興 and several others. My
    >>>>> question is whether there
    >>>>> > are any plans or discussions for encoding these fragments in
    >>>>> Unicode.
    >>>>> >
    >>>>> > (I haven't found anything about this in mailing list
    >>>>> archives; I did find
    >>>>> > statements that Unicode does not intend to provide any
    >>>>> decomposition data of
    >>>>> > Han characters :) And for good reasons. However, such
    >>>>> fragments may well be
    >>>>> > useful for third-party software dealing with 漢字 glyph
    >>>>> generation, lookup by
    >>>>> > components etc.)
    >>>>> >
    >>>>> > Thanks,
    >>>>> > Uriah Eisenstein
    >>>>> >
    >>>>> >
    >>>>>
    >>>>>
    >>>>>
    >>>>> --
    >>>>> Edward Mokurai (默雷/धर्ममेघशब्दगर्ज/ دھرممیگھشبدگر ج) Cherlin
    >>>>> Silent Thunder is my name, and Children are my nation.
    >>>>> The Cosmos is my dwelling place, the Truth my destination.
    >>>>> http://www.earthtreasury.org/
    >>>>>
    >>>>>
    >>>>>
    >>>>>
    >>>>
    >>>>
    >>>
    >>
    >



    This archive was generated by hypermail 2.1.5 : Fri May 14 2010 - 10:15:57 CDT