Re: Digraphs as Distinct Logical Units

From: Roozbeh Pournader (roozbeh@sharif.edu)
Date: Thu Aug 08 2002 - 19:07:32 EDT


Ken,

On Thu, 8 Aug 2002, Kenneth Whistler wrote:

> Expecting the compatibility decompositions to serve this purpose
> effectively is overvaluing what they can actually do.

I would love to hear your opinion about what compatibility decompositions
*are* for, then. I feel a little confused here.

> > providing backup rendering when they lack the glyph,
>
> This seems unlikely to be particularly helpful in this *particular*
> case.

Believe me, it really is. I'm implementing char-cell rendering for Arabic
terminals, and when it comes to Arabic ligatures, since I don't want two
get into a mess of double width things, I just decompose than ligature,
and render the equivalent string. It's not as genuine as may be, but it's
automatic, simple, clean, and conformant.

Some other point: We like to discourage the usage of Arabic Presentation
Forms, don't we? That is mentioned in TUS 3.0 at the end of the chapter
about Arabic. All the characters in the Arabic Presentation Forms blocks
have these decompositions, exactly for this. Only three miss it, U+FD3E
and U+FD3F, Ornate Parentheses, which got there by mistake (and are
mentioned in the text), and U+FE73, which is a half-character (and could
not have it in any way).

By not providing a compatibility decomposition, we are making the proposed
character a healthy and normal characters, just like Arabic letters or
symbols. It won't be a compatibility character like Chinese and Japanese
ones, or other Arabic ligatures, but a new beast encouraged to be used.
Why don't we encode it in the 06xx block then?

> > reading a text stream aloud, and things like that,
>
> And this requires much more than just some raw access to an
> NFKD normalization of the text stream to make any sense, for
> any real application.

Of course, but look how nice it is now: In whatever encoding it is, just
pass it through a converter to Unicode NFKC, and then you will have
something very clean and consistent to work with. Why bother with
difficulties of the various character encodings? This applies to almost
every similar application which is not rendering-oriented.

> Implementation practice since then has suggested that compatibility
> decompositions for these Arabic word ligatures used symbolically are not
> much help -- and if any thing just provoke edge case failures for
> implementations.

I don't get you. They have definitely been a help to me. What are the
other difficulties (other than the decomposition buffer size you just
mentioned)?

Also, please note that we cannot remove all those decompositions, we can
only do the implementer's' job a little harder by breaking the model (and
encouraging the use of Arabic Presentation Forms block), or we can help
him a little.

> Nope. The UTC wouldn't do it, nor have the Pakistani delegates
> working with the UTC and WG2 asked for it.

It was in the first proposals, IIRC. They were not formal, of course.

> Not everything that gets into a national standard gets into Unicode.

Undoubtedly, but being in a national standard helps a lot. So OK, are you
telling me that it is not just for compatibility, it is a legitimate
character that could have got accepted by WG2 even if it was not in UZT?

> > UZT is there also to make a point: that Urdu computing is different (from
> > whatever you are thinking about)! National pride, I'll call it.
>
> Which doesn't change the fact that Pakistan has brought forward
> some characters whose justification seems sufficient for inclusion
> in Unicode.

I agree. Sorry for messing things.

BTW, whose was the suggestion to not provide a compatibility decomposition
for the character?

> But I think you may be overestimating the caving in going on here.
> The UTC is still pushing back on another proposal to disunify Urdu
> digits, for example -- those did *not* get accepted by WG2, nor do I
> expect they will pass muster in future UTC meetings.

Yes, I'm doing that on purpose. I was talking with a Pakistani expert
about UZT before the Dublin meeting. He told me that his colleagues will
propose the character to WG2, and I told him it's impossible, WG2 has
already passed something about not encoding more Arabic ligatures, unless
it is in a pre-90s standard. He told me: "You only need to push hard. Just
insist enough, they will surrender".

I were not present in the Dublin meeting, neither was that guy, so I don't
know what was exactly discussed. I'm not even against encoding the
character, I just can't understand why Unicode is making a first exception
here, encoding what is a compatibility character in all senses as a normal
character.

> Ask some of the proposers just how much effort (extended over how long a
> sustained period) has to be devoted to actually getting characters added
> to the standards.

I have been involved in that, and I know it. I didn't mean that it's an
easy job, just that I hate to see people encoding the ideograph in a
national standard first, instead of suggesting it to IRG.

roozbeh

PS: I'm sorry if I've been offensive, or have talked a little politically.
I'm just a geek who prefers technical excellence to political reasons. I
do try my best for implementing this in our national standard committees,
and I somehow except this from UTC. I just like to see more of that
resistance against grass radicals.

-- 
Roozbeh Pournader               | Sometimes I forget to reply to emails.
Sharif University of Technology | Some other times I don't find the time.
roozbeh <at> sharif <dot> edu   | So kindly remind me if it's important,
http://sina.sharif.edu/~roozbeh | and use other methods if it's urgent.



This archive was generated by hypermail 2.1.2 : Thu Aug 08 2002 - 17:22:04 EDT