Re: Visarga, ardhavisarga and anusvara -- combining marks or not?

From: verdy_p (verdy_p@wanadoo.fr)
Date: Mon Sep 07 2009 - 17:07:27 CDT

Next message: verdy_p: "Re: Run-time checking of fonts for Sinhala support"

Previous message: Doug Ewell: "Re: Run-time checking of fonts for Sinhala support"
Maybe in reply to: verdy_p: "Re: Visarga, ardhavisarga and anusvara -- combining marks or not?"
Next in thread: Asmus Freytag: "Re: Visarga, ardhavisarga and anusvara -- combining marks or not?"
Reply: Asmus Freytag: "Re: Visarga, ardhavisarga and anusvara -- combining marks or not?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Asmus Freytag"
> The second is the radical solution: reclassify every single
> character from Mc to Lo where there isn't any compelling
> reason (in rendering or processing) to consider that
> character actually "combining" in function, not just in name.
> The advantage of this approach is that it would be very
> visible and direct. Treating an "Lo" character by using the
> support for graphically combining characters in a
> renderer is obviously wrong, so you might expect a
> pressure on *all* implementations to get that corrected.
>
> The downside, of course, is that it's impossible to predict
> what uses the gc=Mc classification has been put to by
> actual implementations, outside of simple rendering issues.
> You are correct in calling such an approach destabilizing,
> no matter how appealing it would be, otherwise. For
> the same reason, UTC is correct to continue to be
> consistent with past practice in assigning Mc to any new
> characters that are analogues to existing Mc characters.

This solution would be much too radical. Effectively, if you are speaking about rendering Mc character, they should
be rendered like other cg=Lo characters and handled with the its simpler model (which does not have to focus on
combining marks and the ill-named "non-spacing" or "spacing" dichotimy between all combining marks, but would focus
only on possible ligatures and/or conjuncts, i.e. the preferred ligated forms).

But the main problem you'll have is that it would change how many other uses, outside just rendering, will be
implemented. Notably for handling full-text searches: the gc=Mc classification effectively makes a clear split in
the order of importance with gc=Lo letters that are considered much more important and absolutely needed for every
search at the primary level, as soon as you are trying to cope with variable orthographies. The gc=Mc marks are
effectively not always present in all texts or not presented to users in all styles (so they effectively have cases
where they are effectively not rendered at all, even if they are encoded, to accomodate with these presentation
styles, for example in titling and monumental scriptures, or in summaries and book indexes, and even in
dictionnaries or diaries for the general classification of words).

You may argue that a well-behaved collation algorithm should not depend on gc classification, and that collation
still needs to be tailored for a lot of languages. But the reality is that even the default collation table, used as
the root for all tailoring, needs to be mainteained to built up from the ground by first looking at the gc
classification. If you change the gc massively, you will break a lot of existing collation algorithms, unless they
are built on top of a full copy of the DUCET. You will also have difficulties, at Unicode, to maintain the DUCET for
the future, because the primary or secondary level of "importance" of characters is not tracked anywhere else in the
UCD.

My opinion is that this radical change is absolutely not needed. The standard just needs to say that gc=Mc
characters should be treated like gc=Lo for rendering, and ONLY for rendering purpose, and ONLY if those characters
are effectively rendered, because there does exist contexts within which they will not be rendered with the rest of
the text when some styles are applied. In my pojnt of view I'm not saying that the gc=Mc character are not usefull
orthographically, but just that they have a secondary role, and they can be used as optional, notational-like,
additions on top of a simpler text, just in the same way as how you can analyse, in a multi-level approach, fully
pointed Hebrew or Arabic texts, or epigraphic Greek, where a lot of additional marks were written, sometimes with
very strict orthographic or stylistic rules, to complement the primary level of text.

It also happens that some gc=Mc marks have also changed their role over the history, between being considered as
plain letters, or being just additonal optional marks. This role may also vary between several distinct languages,
including in the modern use, where some may have disappaered from the usual orthography, and som other have been
promoted to being used alsmost systematically to disambiguate some words or the oral spelling.

The gc=Mc reclassification as gc=Lo would just SIPPOSEDLY simplify the rendering. But in my opinion it is not
needed: designers of fonts and renderers just have to be prompted to treat these characters like base letters when
they have to render them, so they must not use the dotted circle for example for the sequences they don't recognize
with their linguistic rules: it's not up to the renderer or font to work with linguistic issues, unless it is
impossible to get the correct rendering needed and expected for specific languages.

Anyway, I don't think that existing implementations exhibit the major rendering problems that what you propose to
solve with such radical change. The problems do exist, but this is at another level, that does not just involve the
gc=Mc characters, but clusters of letters (such as Indic consonnants, with multiple forms: full, half, subjoined,
post-joined, halant-below, and ligatures/conjuncts, when some of the letters are used with virama and sometimes with
additional (dis)joiner controls). In fact, if you change gc=Mc to gc=Lo, you will add even more complication to the
algorithms that have to handle the Indic variable forms, because the first thing they will have to manage is the
identification of base letters and how to identify the letter clusters delimitations...

Next message: verdy_p: "Re: Run-time checking of fonts for Sinhala support"
Previous message: Doug Ewell: "Re: Run-time checking of fonts for Sinhala support"
Maybe in reply to: verdy_p: "Re: Visarga, ardhavisarga and anusvara -- combining marks or not?"
Next in thread: Asmus Freytag: "Re: Visarga, ardhavisarga and anusvara -- combining marks or not?"
Reply: Asmus Freytag: "Re: Visarga, ardhavisarga and anusvara -- combining marks or not?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Mon Sep 07 2009 - 17:10:51 CDT