Re: FW: 6 questions

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Tue Sep 18 2001 - 18:22:45 EDT


At 12:26 PM 9/18/01 -0700, Kenneth Whistler wrote:
> > 3. Why don't "noBreak" formatted Unicode characters
> > have a canonical decomposition (the compatibility
> > decomposition surrounded by glue)?
>
>A long story. But the short answer is that such a decomposition
>would cause problems for implementations.

The first part of this long story is that these characters never
should have had a compatibility decomposition. In the early days,
before UAX#14 there was no convenient way to provide machine readable
information on intended line breaking behavior, so dummy decompositions
with a <no-break> tag were added.

Like all compatibility decompositions, they were roundly ignored in
practice at first, and never really reviewed later.

The other piece of information contained in the mappings is a useful
folding for spaces and hyphens which is appropriate for fuzzy searching.
Such a folding is now implicit in the Unicode Collation Algorithm,
which is perhaps a better place for this information.

Canonical decompositions are in principle substitutions that are equally
acceptable as the original string so that the fiction can be maintained
that the data has not been changed. For accented characters (de)composition
this is in principle true, in practice, one will either prefer the
composed or the decomposed forms, that's why UAX#15 defines two
normalization forms of equal standing, but different preference.

In order to support interworking with legacy data and systems form
NFC explicitly excludes compositions that are theoretically possible
but not found in nature (not found in legacy practice).

Replacing singleton no-break characters with long strings of glue etc.
serves no such purpose and will roundly violate the expectations of
existing implementations - wether adapted from legacy code bases, or
written for Unicode.

> > 4. Greek final sigma is not considered a compatibility
> > decomposition (word position variant) because it's
> > usage could also be dependant on spelling convention?
>
>No. Greek implementations have traditionally not made use
>of joiner/non-joiner mechanisms.

Breaking this tradition serves no purpose other than making interworking
with legacy data complicated and error prone.

> > 5. How come east asian width type W and H are non
> > starters for line breaking?
>
>I'll let somebody else tackle that one.

The question is wrongly stated. It is possible to ask:
"How come there are *some* characters with
both EAW H and W and linebreak class NS?"

If linebreak class could have been predicted completely from the EAW
it would have been done. As you read UAX#14, you'll see many characters
with EAW class W that have many different types of linebreak classes
in order to match their actual behavior in line break.

A./



This archive was generated by hypermail 2.1.2 : Tue Sep 18 2001 - 17:15:07 EDT