> I hereby eat crow... Ken's explanation is 100% convincing.
> 
> The main lesson I have learned here is that thinking industry standards
> requires much more attention to details than it is customary for the
> outlook of an average application programmer like me.
> 
> Among the rest, I hadn't even imagined that IDS could have been taken from
> a pre-existing standard, and that therefore compatibility issues had to be
> considered.
> 
> I also did not very well consider (even if I had been warned) the
> possibility of such a huge number of variants. But, after Ken's mail, I
> tried and listed all the possible IDSs for a few very common components,
> multiplied the number of variants for the ideographs that contain these
> components, and I obtained a number that was more appropriate for
> astronomy than for writing systems... and this just considering a very
> very small subset of the whole...
> 
> OK: I hope that other people had thoughts similar to mine, so that this
> short discussion could have at least some didactic value.
> 
> Regards.
> 	Marco Cimarosti
> 
> P.S. I still think that a "Unified Ideographs to IDS dictionary" could be
> a useful thing, and a starting point for some interesting *applicative*
> development, but I realize that by no means it would be a useful part of
> Unicode itself.
> 
> 	-----Original Message-----
> 	From:	kenw@sybase.com [SMTP:kenw@sybase.com]
> 	Sent:	1999 September 10, Friday 00.39
> 	To:	Unicode List
> 	Cc:	unicode@unicode.org; kenw@sybase.com
> 	Subject:	RE: Ideographic Description
> 
> 	Marco Cimarosti continued this discussion:
> 
> 	> I cannot help thinking that IDS could be useful not only to
> provide a
> 	> human-readable "description" of rare ideographs, but also as a
> 	> *machine-readable* alternative spelling (or "decomposition") of
> any CJK
> 	> character, including the ones that are already coded in the
> Unified
> 	> Ideographs section.
> 	> 
> 	> I have an impression that, at an early stage of design, this
> orientation
> 	> could have been the original idea behind the proposal of IDS.
> 
> 	Some might have thought that, but there are numerous Han character
> 	decomposition schemes that have been proposed, most of which are
> 	better designed for decomposition than the set of 12 IDC's currently
> 	in 10646/Unicode. Prof. Hsieh in Taiwan has a particularly
> well-researched
> 	and sound proposal, for example, that makes use of many fewer
> operators.
> 
> 	> 
> 	> One of John Jenkins' statements contibutes (willingly or not:-) to
> this
> 	> impression:
> 	> 
> 	> "If there had been any requirement that Unicode conformance would
> imply
> 	> parsing and dealing with IDSs, they never would have made it into
> the
> 	> standard."
> 	> 
> 	> As I read these words, I can nearly hear an echo of the clashes at
> some
> 	> committee meeting!
> 
> 	There were in fact such discussions at the WG2 meeting. Basically
> they
> 	were the background for clarifying what the proposed IDC characters
> 	(from GBK) were, so that no one would be confused about their
> intended
> 	usage once they became part of the 10646/Unicode standards. The UTC,
> 	in particular, was adamantly opposed to having these characters
> brought
> 	in for compatibility with GBK and *then* having them turned to use
> for
> 	decomposition rather than just ideograph description. So the
> direction
> 	you are trying to head with this is directly contrary to the
> unanimously
> 	expressed intent of the UTC in assenting to the encoding of the
> IDC's.
> 
> 	> 
> 	> But there are several other aspects that make me think that IDS
> was designed
> 	> primarily for rendering.
> 	> 
> 	> First of all, why a prefix notation? I think that many people
> would agree
> 	> that expressions using infix operators are much more
> human-readable.
> 
> 	You answer your own question below. Recursive infix operators are
> inherently
> 	ambiguous as to the scope of their operands, requiring bracketing
> conventions
> 	for clarification:
> 
> 	> 
> 	> For our human brain, "infix" expressions (like the every-day
> arithmetic) are
> 	> quite intuitive:
> 	> 	3 + 2 - 1
> 	> 	NOT this AND that
> 	> 	eye beside dog over dog beside dog
> 
> 	This could be:
> 
> 	eye beside (dog over (dog beside dog))
> 
> 	or
> 
> 	(eye beside dog) over (dog beside dog)
> 
> 	or even
> 
> 	((eye beside dog) over dog) beside dog
> 
> 	It would be a bad design to use infix operators for plain text
> 	description of ideographs, because of this ambiguity.
> 
> 	> 
> 	> and even more intuitive than "postfix" expression (as used in
> Postscript or
> 	> some old calculators):
> 	> 	3 2 + 1 -
> 	> 	this NOT that AND
> 	> 	eye dog dog dog beside over beside
> 
> 	The problem with a postfix expression for plain text is the
> backtracking
> 	issue. Most text processes try to operate in the logically forward
> 	direction, while limiting backtracking as much as possible because
> 	of its implications for efficiency. Postfix expressions work when
> you
> 	have stack-oriented processing (as in PostScript), but in plain
> text,
> 	as you can see in your example, this just results in garden-path
> 	processing:
> 
> 	eye
> 	eye dog
> 	eye dog dog
> 	eye dog dog dog
> 	eye dog dog dog beside <== backup two & rebracket
> 	eye dog (dog dog beside)
> 	eye dog (dog dog beside) over <== backup four & rebracket
> 	eye (dog (dog dog beside) over)
> 	eye (dog (dog dog beside) over) beside <== backup six & rebracket
> 	(eye (dog (dog dog beside) over) beside)
> 
> 	> 
> 	> On the other hand, postfix and prefix notations are much more
> easily parsed
> 	> by a computer (they are very performant syntaxes, especially the
> prefix
> 	> notation, and by no means an "enormous overhead").
> 
> 	The prefix notation is the logical choice for plain text
> description,
> 	all things considered.
> 
> 	The "enormous overhead" that John Jenkins is talking about is not
> the
> 	processing time required to deal with the trivial prefix BNF syntax
> 	for IDS's, but the huge equivalence tables that would have to be
> 	carried around to try to interpret all the equivalent relations for
> 	all the possible alternative representations of the "same"
> character,
> 	once you start allowing these descriptions to behave as
> decompositions.
> 	And as John also pointed out, the devil is in the details. Once you
> 	start decomposing the characters, you get into a tremendous mess
> dealing
> 	with variants that may or may not be the "same" thing. The
> combinatorics
> 	and heuristics quickly start to spiral out of control.
> 
> 	There are good reasons why no Han ideographic decomposition system
> 	has ever had any success as an *encoding* for text. Such systems
> make
> 	sense only as adjuncts to the character-oriented
> encoding--particularly
> 	for keyboards and input methods.
> 
> 	> 
> 	> So, why has a hard-to-understand-but-easy-to-parse representation
> been
> 	> preferred to a hard-to-parse-but-very-intuitive-to-read
> representation?
> 
> 	Explained above.
> 
> 	> 
> 	> Another fact that makes me think that IDS is more suited for
> rendering than
> 	> for enjoying, is the great level of graphical detail provided by
> the IDCs.
> 	> 
> 	> Consider this example: provided that humans are intelligent, and
> that
> 	> Far-Eastern humans can read their own languages, what is the need
> for having
> 	> 5 different IDCs for basically the same "surround" relation? That
> is:
> 	> 	IDEOGRAPHIC DESCRIPTION CHARACTER FULL SURROUND
> 	> 	IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM ABOVE
> 	> 	IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM BELOW
> 	> 	IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LEFT
> 	> 	IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER LEFT
> 	> 	IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM UPPER RIGHT
> 	> 	IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND FROM LOWER LEFT
> 	> 
> 	> A single "* IDEOGRAPHIC DESCRIPTION CHARACTER SURROUND" would have
> been
> 	> sufficient, as any reader is intelligent enough to infer the 5
> slightly
> 	> different cases by the very shape of the surrounding component.
> 
> 	I agree completely with you on this. The UTC pointed this out, but
> what
> 	it came down to is that the 12 characters *already* existed as they
> were
> 	in GBK, and were used as defined in GBK implementations. Their
> encoding
> 	in 10646/Unicode is a compatibility issue with GBK, and there was no
> good
> 	reason not to make the full set available to allow transparent
> transcoding
> 	with GBK implementations.
> 
> 	> 
> 	> I hope it is clear what I am trying to say: I am not meaning in
> any way that
> 	> IDS is poorly designed or that is should be changed, but rather
> that it
> 	> looks as it was designed primarily for a different purpose.
> 	> 
> 	> Or, putting it even more positively, that it is designed in such a
> way that
> 	> it is *also* well fit for a different (and very interesting)
> purpose, beside
> 	> the one for which it is currently intended.
> 
> 	Nope. Here is where I get off the bus.
> 
> 	> 
> 	> I am just thinking that, once the IDS will be in place, it could
> be used
> 	> (maybe as a *higher-level protocol*), to achieve new and possible
> useful
> 	> applications:
> 	>  - IDS-based input methods,
> 
> 	There are already dozens of Chinese input methods. I see no
> particular
> 	advantage in adding another based on IDS, which would not even be
> particularly
> 	effective.
> 
> 	>  - "modular" CJK font schemes (containing more rules but far fewer
> glyphs),
> 
> 	This won't work with IDS, for reasons others have described.
> 
> 	>  - component-based searching ("find that word that contained the
> <dragon>
> 	> radical somewhere").
> 
> 	This should be done by database lookup. This kind of information in
> 	available in Unihan.txt (or could be extended to include other
> component
> 	information). Such an approach is *far* *far* more efficient than
> cluttering
> 	up the text representation itself with mostly useless information
> that would
> 	make other text processes extraordinarily inefficient and that would
> itself
> 	be difficult to search correctly. ('dragon' itself can be broken
> apart
> 	and described in pieces -- how do I ensure that I have normalized
> the
> 	text into the relevant chunks that I am searching for??)
> 
> 	> 
> 	> Or even, exiting the computing kingdom:
> 	>  - for a Braille representation of ideographs(!),
> 
> 	Chinese Braille already exists. There are books published in it.
> 
> 	>  - as a new way to sort dictionaries.
> 
> 	Why? There are already multiple ways to sort Han character
> dictionaries.
> 	Adding more abstruse methods of graphical sorting on top of the
> traditional
> 	methods would serve what purpose?
> 
> 	> 
> 	> However, for such experiments to be possible, one more piece
> should be added
> 	> to the game: an ***Unified Ideographs to IDS dictionary***.
> 	> 
> 
> 	Good luck. I'm not volunteering!
> 
> 	--Ken
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:51 EDT