Re: CJK Ideograph Fragments

From: Uriah Eisenstein (uriaheisenstein@gmail.com)
Date: Fri May 14 2010 - 10:13:13 CDT

Next message: Uriah Eisenstein: "Re: CJK Ideograph Fragments"

Previous message: Andreas Prilop: "Re: Language info for U+1E1C and U+1E1D"
In reply to: Mark Davis ☕: "Re: CJK Ideograph Fragments"
Next in thread: John H. Jenkins: "Re: CJK Ideograph Fragments"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

That is true from my experience as well. Many "missing" components can be
found among CJK Unified Ideographs - perhaps the lists were created with
another encoding (e.g. GB2312 or Big5) which did not encode these
characters. Other components I have found in Extensions A and B, as
mentioned. It takes some investigation to conclude that a fragment is not
encoded as an ideograph anywhere (just like it takes some investigation to
make sure a suggested new ideograph isn't encoded already).

Regards,
Uriah

On Mon, May 10, 2010 at 5:07 PM, Mark Davis ☕ <mark@macchiato.com> wrote:

> As I said, I would not rely too heavily on the accuracy of that data. Where
> there are ?, or NCRs, or truncated IDS sequences, it looks like the missing
> character can often be supplied by examining the character.
>
> Mark
>
>
> — Il meglio è l’inimico del bene —
>
>
> On Mon, May 10, 2010 at 05:43, Uriah Eisenstein <uriaheisenstein@gmail.com
> > wrote:
>
>> Hello Mr. Davis and thanks for the lists,
>> I've found several different sources for character compositions (though
>> none of them seem to include Extension characters except for the generated
>> files you have posted!). While they all have missing information and
>> occasional mistakes, it is quite easy to find unencoded fragments in them,
>> these are usually marked with ? or something similar. I've been making a
>> list of fragments while working with cjklib, mentioned by Christoph; some
>> I've found later in Extension A or B, others I have character examples and
>> could be used for an initial proposal. I don't expect the set of necessary
>> components to be complete anytime soon, or at all, not anymore than the
>> entire set of Ideographs :)
>>
>> Regards,
>> Uriah
>>
>>
>> On Sun, May 9, 2010 at 2:11 AM, Mark Davis ☕ <mark@macchiato.com> wrote:
>>
>>> FYI, I have a table of radicals at
>>> https://spreadsheets.google.com/pub?key=0AqRLrRqNEKv-dHlVMzY0RFZ3MTFLZ0RldS1RNXN4Z3c&hl=en&output=html Q
>>> mapping them to Unified ideographs. Not yet complete (the X values are
>>> tentative, and I don't know if there are values for the ones marked "
>>> #VALUE!").
>>>
>>> I had also tried taking a look at the data at
>>> http://cvs.m17n.org/viewcvs/chise/ids/?sortdir=down&pathrev=kawabata#dirlist(IDS which Richard and John said was the best publicly available IDS
>>> data (although it has a GPS licence, which prevents many people from using
>>> it). While clearly a lot of work went into in, it is very flawed.
>>>
>>> - There are over 400 ill-formed IDS sequences.
>>> - There are 666 (coincidence?) characters that map to themselves
>>> (where you'd only expect that of "base" radicals).
>>> - About 5K characters are missing data.
>>> -
>>> - There appears to be free variation between using CJK radicals and
>>> using the corresponding Unified CJK characters.
>>> - It uses many NCR components with cryptic IDs, instead of radicals
>>> or Unified CJK.
>>> - A cursory look shows a signficant proportion of clear mistakes in
>>> the data (characters stacked vertically in the wrong order, for example).
>>> - Many characters cannot be recursively decomposed down to radicals.
>>>
>>> So I'm not sure how much use the available IDS data would be in terms of
>>> looking at necessary components.
>>>
>>> Mark
>>>
>>> FYI, I posted some generated files at http://macchiato.com/ids/, in case
>>> anyone is curious as to details.
>>>
>>> — Il meglio è l’inimico del bene —
>>>
>>>
>>> On Sat, May 8, 2010 at 13:40, Asmus Freytag <asmusf@ix.netcom.com>wrote:
>>>
>>>> On 5/8/2010 11:44 AM, Uriah Eisenstein wrote:
>>>>
>>>>> Well,
>>>>> I've gone through the policies of submitting new characters and scripts
>>>>> and they don't look encouraging :) But neither do they seem to reject the
>>>>> idea of character fragments out of hand, as opposed to the reverse case -
>>>>> characters which can be expressed using existing characters and combining
>>>>> marks. In fact, the CJK Radicals Supplement block and the Hangul Jamo both
>>>>> contain character fragments, in a way. But I suppose these already existed
>>>>> in national standards rather than introduced by Unicode.
>>>>>
>>>>> In any case, examples I've seen of proposals cite experts and provide
>>>>> font makers, neither of whom I have contact with. So I guess I'll drop it
>>>>> for now, and hope that if someone takes it up I'll see it on the mailing
>>>>> list.
>>>>>
>>>> While a font is ultimately required for a proposal to become adopted, it
>>>> shouldn't be a bar to formally raising the issue for initial consideration.
>>>> Oncesomething is considered potentially acceptable, there's enough time to
>>>> come up with fonts (for the purpose of printing charts) before the
>>>> committees need to vote on final approval. Proposals can take years from
>>>> initial consideration to publication....
>>>>
>>>> Your suggestion was that these fragments need to be enumerated for
>>>> various purposes in software and that having a standard enumeration is
>>>> beneficial. If you can document and support that assertion, I would
>>>> encourage you to put it on record.
>>>>
>>>> Doing so would allow a discussion of whether a standard enumeration is
>>>> indeed useful enough to encur the cost of standardization.
>>>>
>>>> In some ways, this would not be a run-of-the-mill character encoding
>>>> proposal, because you are not asserting that these fragments need encoding
>>>> for the purpose of directly expressing text. While that is the primary
>>>> purpose of character encoding, there are purposes that are ancillary to
>>>> this, that a universal character encoding such as Unicode must encompass.
>>>>
>>>> There is certainly some precedent for character codes that aren't
>>>> limited to the primary purpose I mentioned, but, because they don't
>>>> represent a standard situation, one needs to carefully argue why such uses
>>>> need to be covered by standardization and if so, why doing that as character
>>>> codes is appropriate.
>>>>
>>>> That is different from the more usual task to document that an entity
>>>> occurs in written or printed documents.
>>>>
>>>> The problem is, unless you actually put down all the details in a
>>>> coherent proposal it's hard to judge correctly what the situation is. When
>>>> you raise the question informally, all anyone can tell you is that an
>>>> exceptional request is one that needs exceptional justification, which,
>>>> while certainly correct, doesn't exacatly help you or anyone to evaluate
>>>> whether your proposal would meet the required level and type of
>>>> justification.
>>>>
>>>> A./
>>>>
>>>>>
>>>>> Thanks,
>>>>> Uriah
>>>>>
>>>>>
>>>>> On Sun, May 2, 2010 at 3:06 PM, Uriah Eisenstein <
>>>>> uriaheisenstein@gmail.com <mailto:uriaheisenstein@gmail.com>> wrote:
>>>>>
>>>>> Not exactly, but I suppose such Hanzi fragments could be sued for
>>>>> similar purposes - e.g. looking up characters by components, where
>>>>> the available components may include non-character fragments. Some
>>>>> fragments may be useful for IME purposes, but probably not all.
>>>>>
>>>>>
>>>>> On Sat, May 1, 2010 at 8:57 PM, Edward Cherlin < echerlin@gmail.com
>>>>> <mailto:echerlin@gmail.com>> wrote:
>>>>>
>>>>> 2010/4/28 John H. Jenkins < jenkins@apple.com
>>>>> <mailto:jenkins@apple.com>>:
>>>>>
>>>>> > No. You could certainly write up a proposal and submit it
>>>>> to the UTC.
>>>>> > Should the UTC feel the idea has merit, it would then move
>>>>> it on to WG2
>>>>> > and/or the IRG.
>>>>> > The main problem here is that there is a very strong desire
>>>>> to limit
>>>>> > ideograph encoding to attested and documentable forms.
>>>>> Anything which does
>>>>> > not exist in actual texts is not likely to be well-regarded.
>>>>>
>>>>> I had the idea some years ago of writing up a proposal to
>>>>> encode the
>>>>> hanzi fragments used in Cangjie Shurufa IMEs. These fragments
>>>>> are used
>>>>> extensively in dozens of howto books on keyboarding in
>>>>> Cangjie. This
>>>>> includes the pieces (mostly real characters, with some
>>>>> radicals) used
>>>>> on keyboard labels, and the common forms they stand for. I
>>>>> didn't get
>>>>> any interest from the Cangjie development community or the
>>>>> authors of
>>>>> a book on Cangjie that I have, so i abandoned the idea.
>>>>>
>>>>> Uriah, is this the sort of thing you have in mind?
>>>>>
>>>>> > Similarly, the
>>>>> > UTC has a strong preference not to encoding anything which
>>>>> isn't in actual
>>>>> > use. Proposals to encode characters because they would be
>>>>> useful if encoded
>>>>> > even though they aren't actually being used right now are
>>>>> generally looked
>>>>> > on with disfavor.
>>>>> >
>>>>> > 在 Apr 28, 2010 12:03 PM 時， Uriah Eisenstein 寫到：
>>>>> >
>>>>> > Hello,
>>>>> > My question is about common components of CJK Ideographs
>>>>> which are not
>>>>> > encoded as independent Han characters (and perhaps indeed
>>>>> aren't). A good
>>>>> > example is the right-hand part of the character 漢 itself:
>>>>> it is a distinct
>>>>> > component appearing in multiple other characters, but is not
>>>>> encoded to the
>>>>> > best of my knowledge. The same goes for the top part of 鳥
>>>>> and 島, the
>>>>> > surrounding part of 與 and 興 and several others. My
>>>>> question is whether there
>>>>> > are any plans or discussions for encoding these fragments in
>>>>> Unicode.
>>>>> >
>>>>> > (I haven't found anything about this in mailing list
>>>>> archives; I did find
>>>>> > statements that Unicode does not intend to provide any
>>>>> decomposition data of
>>>>> > Han characters :) And for good reasons. However, such
>>>>> fragments may well be
>>>>> > useful for third-party software dealing with 漢字 glyph
>>>>> generation, lookup by
>>>>> > components etc.)
>>>>> >
>>>>> > Thanks,
>>>>> > Uriah Eisenstein
>>>>> >
>>>>> >
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Edward Mokurai (默雷/धर्ममेघशब्दगर्ज/ دھرممیگھشبدگر ج) Cherlin
>>>>> Silent Thunder is my name, and Children are my nation.
>>>>> The Cosmos is my dwelling place, the Truth my destination.
>>>>> http://www.earthtreasury.org/
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>

Next message: Uriah Eisenstein: "Re: CJK Ideograph Fragments"
Previous message: Andreas Prilop: "Re: Language info for U+1E1C and U+1E1D"
In reply to: Mark Davis ☕: "Re: CJK Ideograph Fragments"
Next in thread: John H. Jenkins: "Re: CJK Ideograph Fragments"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri May 14 2010 - 10:15:57 CDT