Standard fallback characters (was: Draft Proposal to add Variation A Sequences for Latin and Cyrillic letters)

From: verdy_p (verdy_p@wanadoo.fr)
Date: Wed Aug 04 2010 - 15:30:22 CDT

Next message: Karl Pentzlin: "Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters"

Previous message: Karl Pentzlin: "Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters"
In reply to: Asmus Freytag: "Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters"
Next in thread: Asmus Freytag: "Re: Standard fallback characters (was: Draft Proposal to add Variation A Sequences for Latin and Cyrillic letters)"
Reply: Asmus Freytag: "Re: Standard fallback characters (was: Draft Proposal to add Variation A Sequences for Latin and Cyrillic letters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Asmus Freytag" wrote:
> The Fraktur problem is one where one typestyle requires additional
> information (e.g. when to select long s) that is not required for
> rendering the same text in another typestyle. If it is indeed desirable
> (and possible) to create a correctly encoded string that can be rendered
> without further change automatically in both typestyles, then adding any
> necessary variation sequences to ensure that ability might be useful.
> However, that needs to be addressed in the context of a precise
> specification of how to encode texts so that they are dual renderable.
> Only addressing some isolated variation sequences makes no sense.

I don't think so.

If a text was initially using a round s, nothing prohibits it being rendered in Fraktur style, but even in this
case, the conversion to "long s" will be inappropriate. So use the Fraktur "round s" directly.

If a text in Fraktur absolutely requires the "long s", it's only when the original text was already using this "long
s". In that case, encode the "long s": The text will render with a "long s" in both "modern" Latin font styles like
Bodoni (with a possible fallback to modern "round s" if that font does not have a "long s"), an in "classic" Fraktur
font styles (with here also a possible fallback to Fraktur "round s" if the Frakut font forgets the long s in its
repertoire of supported glyphs).

In other words, you don't need any variation sequence: "s+VS1" would be strictly encoding the same thing as the
existing encoded "long s". Adding this variation selector would just be a pollution (an unjustified desunification).
The two existing characters are already clearly stating their semantic differences, so we should continue to use
them.

This does not mean that fonts should not continue to be enhanced, and that font renderers and text-layout engines
should not be corrected to support more fallbacks (in fact it will be simpler to implement these fallbacks within
text-renderers, instead of requiring a new font version).

You can apply the same policy to the French narrow non-breaking space NNBSP (aka "fine" in French) that fonts do not
need to map, provided that the font renderers or text layout engines are correctly infering its bet fallback as
"THIN SPACE", before retrying with the "FIFTH EM SPACE" or "SIXTH EM SPACE" characters, then with a standard SPACE
with a reduced metric...

That's because fonts never care about line-breaking properties, that are implemented only in text layout engines.
The same should apply as well with NBSP, if a font does not map it (the text renderer just has to use the fallback
to SPACE to find the glyph in the selected font), to the NON-BREAKING HYPHEN (just infer the fallback to the
standard HYPHEN, then to MINUS-HYPHEN).

In fact, it would be more elegant if Unicode provided a new property file, suggesting the best fallbacks (ordered by
preference) for each character (these fallbacks possibly having their own fallbacks that will be retried if all the
suggested ordered fallbacks are already failing). In most cases, only one fallback will be needed (in very few
cases, several ordered fallbacks should be listed if the implied sub-fallbacks are not in the correct order of
resolution).

It would avoid selecting glyphs from other fallback fonts with very different metrics. Some of these fallbacks are
already listed in the main UCD file, but they are too generic (because the compatibility mappings must resolve ONLY
to non-compatibility decomposable characters). For example NNBSP has a compatibility decomposition as 0020,
just like many other whitespace characters, so it completely looses the width information.

If we had standardized fallback resolution sequences implemented in text renderers, we would not need to update
complex fonts, and the job for font designers would be much simpler, and users of existing fonts could continue to
use them, even if new characters are encoded.

I took the example of NNBSP, because it is one character that has been encoded since long now, but vendors are still
forgetting to provide a glyph mapping for it (for example in core fonts of Windows 7 such as the new "Segoe UI"
font, even though Microsoft included an explicit mapping for NNBSP in Times New Roman). It's one of the frequent
cases where this can be solved very simply by the text-renderer itself.

The same should be done for providing a correct fallback to "round s" if ever any font does not map the "long s".

I also suggest that the lists of standard character fallbacks are scanned within the first selected font, without
trying with other fallback fonts (including multiple font families specified in a stylesheet or generic CSS fonts),
unless the list of fallback characters includes a specifier in the middle of the list that would indicate
that all the characters (the original or the fallback characters already specified before ) should be
searched (this will be useful mostly for symbol/pictograms characters).

As the ordered list of suggested fallback characters will be then rescanned for other fonts, when it reaches the end
without finding any fallback in the current font, it is not necessary to include it at end of the list.

Example of standardized fallback data for "long s" and symbols:

> 00A0 ; 0020 # NON-BREAKING SPACE
> 00B2 ; 0032 # EXPONENT DIGIT TWO
> 2009 ; 200A, 2008, 2005, 2006, 00A0 # NARROW NON-BREAKING SPACE
> 0283 ; 0073 # LATIN SMALL LETTER ESH
> 02A6 ;
0074 200C 0073 # LATIN SMALL LIGATURE T-S
> 02A7 ;
0074 200C 0283 # LATIN SMALL LIGATURE T-ESH
> 20A7 ; ,
0050 200C 0074 200C 0073 # PESETA SYMBOL

Here the Peseta symbol ("Pts" ligature) will be first search in other fonts, before trying to infer a ligature of
the three letters). Because all other fonts will have been scanned for only the precomposed "Pts" symbol ONLY, the
processing will continue by trying to represent the ligature of the three letters: the renderer will attempt to
locate such ligature in the primary font, and as it will likely fail, it will immediately reprocess it trying to
ignore the ZWJ characters, so it will show the three letters "Pts" that are very likely to succeed in many fonts (in
fact almost all Latin fonts).

If it still fails at this point (because the primary font was not designed for Latin), it reaches the end of the
list of standard fallbacks, so it will rescan the other fonts for all suggested fallbacks after the last (no
need to rescan for the symbol), so other fonts will then be scanned successively for the three-letters with ZWJ and
immediately without it, before trying with the next fallback fonts in the specified stylesheet, and then in system-
specific fallback fonts.

Each fallback listed starts with a qualifier which is intended to be processed by the text-renderer when the listed
characters succeeds to be resolved in the current font : it will provide synthetic information. The
qualifier in fact will not alter the rendering, it just specifies that no change is necessary to the rendered glyph
or its metrics and position.

The , , , , , , , , specifiers are altering the
rendering appropriately, by synthetic style modifications, or metric modifications (font size, advance width...).
They may be combined for the same specified fallback...

If the renderer finds the characters listed in the fallback with a mapping in the currently scanned font, it will
render the mapped glyph, using the style modifications indicated by the specifiers (such as the equivalent ones
available in CSS).

The and specifiers could be used as defined aliases for and
respectively...

We could have such data for many of the proposed emojis for emotional faces (most probably using ).

Note that the fully expanded list (after recursion) should contain somewhere the compatibility mappings listed in
the main UCD file. For example:

> 2009 ; 200A, 2008, 2005, 2006, 00A0 # NARROW NON-BREAKING SPACE

complies to this, because it lists a fallback using U+00A0, which already fallbacks with:

> 00A0 ; 0020 # NON-BREAKING SPACE

So U+2009 will effectively fallback (at least at end) to the existing compatibility decomposition in the main UCD
file. The data above also includes the compatibility fallback of long s to round s already specified in the main UCD
file.

Compliant renderers will have to support the list in the specified order at least in its claimed version (this list
of standard fallbacks should not be subject to the encoding stability principles like the compatibility mappings in
the UCD, but should still assert that it lists the compatibility mappings. But we could get a smaller data file if
we dropped this requirement, by also dropping fallbacks that are already specified in the UCD, such as:

> # 00A0 ; 0020 # NON-BREAKING SPACE

which would be commented out in a complete version of the file (for clarity), as it is inferable from the UCD
compatibility decomposition mappings.

As well, the canonical decompositions need not be specified in this new data file for fallbacks: ALL of them are
implied, and there should be NO attempt to override them, so if a canonically decomposable character is not found in
the current font, it will IMMEDIATELY look for the canonical equivalent in the same font, as if the rule was
present, before retrying at end with the list of fallback fonts:

A font-renderer that finds any mapping for a precomposed character but not for its NFD equivalent should still use
that glyph to the precomposed character, for any canonically equivalents strings.

Fonts may also be built with mappings only for decomposed sequences: the renderer should be able to locate the
mapped glyphs in the same way. This will simplify the development of fonts, because the other mappings will only be
needed for legacy systems that still don't use this fallback mechanism.

Philippe.

Next message: Karl Pentzlin: "Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters"
Previous message: Karl Pentzlin: "Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters"
In reply to: Asmus Freytag: "Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters"
Next in thread: Asmus Freytag: "Re: Standard fallback characters (was: Draft Proposal to add Variation A Sequences for Latin and Cyrillic letters)"
Reply: Asmus Freytag: "Re: Standard fallback characters (was: Draft Proposal to add Variation A Sequences for Latin and Cyrillic letters)"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Aug 04 2010 - 15:32:53 CDT