Re: Dotted Circle plus Combining Mark as Text from Richard Wordingham on 2013-10-26 (Unicode Mail List Archive)

From: Richard Wordingham <richard.wordingham_at_ntlworld.com>
Date: Sat, 26 Oct 2013 16:48:00 +0100

On Sat, 26 Oct 2013 00:41:55 +0200
Philippe Verdy <verdy_p_at_wanadoo.fr> wrote:

> That's exctly why I asked bout how to encode unmbiguously in text tht
> we relly want to represent a semantically defective combining
> sequence (which should be then renderd depending on cultural
> encironment like language tagging in text, or the scripts for which
> the diacritic is encoded).

I don't believe a defective combining sequence has any semantics
which depend on its being defective. This is not to say that such a
sequence as a sequence of codepoints is not sometime useful; it may
be a useful representation of a suffix in a spell-checker. What you
want is a non-breaking space that is only as wide as is necessary for
the combining marks to be placed on it. The nearest I can find to this
in a hypothetical standard-conforming renderer is <U+200A HAIR SPACE,
U+2060 WORD JOINER>. It will only work if the space expands to match the
combining marks placed upon it. It fails to meet your requirements
because it doesn't vanish when it has no combining marks attached.

I didn't much like your earlier use case, but I can offer a use case
that applies to spacing combining marks. If I were transcribing some
text in the Tai Tham script and wished to preserve the line breaks, I
would have problems when default clusters were split between lines.
That does happen to U+1A63 TAI THAM VOWEL SIGN AA. I would not be
the least bit surprised to see it happen to U+1A6E TAI THAM VOWEL SIGN
E; as a culturally relevant comparison, I believe I have seen an
orphaned U+0E40 THAI CHARACTER SARA E in a Thai book. To avoid
defective Tai Tham text, I would have to provide the isolated vowels
with a base character such as U+200A. Using U+00A0 NO-BREAK SPACE would
insert a space before the isolated SIGN E and inappropriately indent an
isolated SIGN AA.

> I suggested WJ for this usage (technically, appending a combining mark
> after WJ mkes that it is no longer defective, but WJ effectively
> blocks reorderings by normalisations, nd remins neutrl for plain text
> searches, without lso introducing ny new brek opportunity). My
> optinoon is tht it is the best "replacement" for the missing base
> letter or symbol.

WJ fails for one very good reason. TUS 6.2 Section 16.2 'Layout
Controls' says,

"The effect of layout controls is specific to particular text processes.
As much as possible, lay-out controls are transparent to those text
processes for which they were not intended. In other words, their
effects are mutually orthogonal."

Thus, WJ between a base character and a combining character shall not
affect their relative rendering. Some renderers respect this
requirement; others don't.

> WJ will not be needed at begining of paragraphs, but should cause no
> other problem. It will never be rendered by itself (except in a
> "visible controls edit mode" where it would show by itself, followed
> by ech separtely encoded diacritic rendered in their "ill form" as
> below, and not combined together, i.e. without ligtures or
> substitutions of pairs by one glyph, or contextual subtitutions of
> isolates).
>
> That OpenType feature could be designed in two ways in fonts:
> - it could specify mappings contextual for ranges of combining marks
> to substitute them with an *inserted* appropriate base glyph (not
> necessrily the same ad the one mpped for U+25CC, it could be a dotted
> arabic sharadah for example, or a dotted "x" in Thai)
> - it could also just specify the single glyph to use for U+25CC
> rendered with this feature enabled (easier for mny font authors that
> don't need fonts covering multiple scripts or cultures): the glyph
> will be then different from the default glyph used for U+25CC in
> encoded texts.

A feature ilfm replacing combining marks by a suitable warning
sequence would have one advantage. It has been pointed out that the
proper glyph of U+25CC is not really suitable for use as a character
base; a smaller glyph is generally required. A feature ilfm could
specify the use of this smaller glyph. It could also choose different
bases for different combining marks. I would be inclined to implement
it as a single lookup for all the scripts supported by the font.

> Documents authors that still wnt to use a specific "base" character
> can still use it *instead* of WJ (to create non defective sequences):
>
> - whitespaces and cursive joiners
> - dashes and hyphens
> - U+25CC or other geometric
> - multiplication sign, or other (maths) symbols
> - Latin letter x (not recommended due to its strong LTR property, and
> effects of word breakers working by runs of the same script), etc.
>
> But such technic may not work in many fonts that will provide
> mappings for (base,diacritic) pairs only for a few wellknown "bases":
>
> - NBSP, U+25CC, and
> - WJ (preferably via the OpenType "ilfm" feature)

Many glyphs for non-spacing combining marks are defined so that they
will give an intelligible result with most characters without the need
for any contextual positioning.

Aside to Philippe: As you're having trouble typing 'a' (injured finger?
sticking key on the keyboard?), if you have an English spell-checker,
please use it.

Richard.
Received on Sat Oct 26 2013 - 10:51:25 CDT

This archive was generated by hypermail 2.2.0 : Sat Oct 26 2013 - 10:51:27 CDT