Standard fallback characters (was: Draft Proposal to add Variation A Sequences for Latin and Cyrillic letters)

From: verdy_p (verdy_p@wanadoo.fr)
Date: Wed Aug 04 2010 - 15:30:22 CDT

  • Next message: Karl Pentzlin: "Re: Draft Proposal to add Variation Sequences for Latin and Cyrillic letters"

    "Asmus Freytag" wrote:
    > The Fraktur problem is one where one typestyle requires additional
    > information (e.g. when to select long s) that is not required for
    > rendering the same text in another typestyle. If it is indeed desirable
    > (and possible) to create a correctly encoded string that can be rendered
    > without further change automatically in both typestyles, then adding any
    > necessary variation sequences to ensure that ability might be useful.
    > However, that needs to be addressed in the context of a precise
    > specification of how to encode texts so that they are dual renderable.
    > Only addressing some isolated variation sequences makes no sense.

    I don't think so.

    If a text was initially using a round s, nothing prohibits it being rendered in Fraktur style, but even in this
    case, the conversion to "long s" will be inappropriate. So use the Fraktur "round s" directly.

    If a text in Fraktur absolutely requires the "long s", it's only when the original text was already using this "long
    s". In that case, encode the "long s": The text will render with a "long s" in both "modern" Latin font styles like
    Bodoni (with a possible fallback to modern "round s" if that font does not have a "long s"), an in "classic" Fraktur
    font styles (with here also a possible fallback to Fraktur "round s" if the Frakut font forgets the long s in its
    repertoire of supported glyphs).

    In other words, you don't need any variation sequence: "s+VS1" would be strictly encoding the same thing as the
    existing encoded "long s". Adding this variation selector would just be a pollution (an unjustified desunification).
    The two existing characters are already clearly stating their semantic differences, so we should continue to use
    them.

    This does not mean that fonts should not continue to be enhanced, and that font renderers and text-layout engines
    should not be corrected to support more fallbacks (in fact it will be simpler to implement these fallbacks within
    text-renderers, instead of requiring a new font version).

    You can apply the same policy to the French narrow non-breaking space NNBSP (aka "fine" in French) that fonts do not
    need to map, provided that the font renderers or text layout engines are correctly infering its bet fallback as
    "THIN SPACE", before retrying with the "FIFTH EM SPACE" or "SIXTH EM SPACE" characters, then with a standard SPACE
    with a reduced metric...

    That's because fonts never care about line-breaking properties, that are implemented only in text layout engines.
    The same should apply as well with NBSP, if a font does not map it (the text renderer just has to use the fallback
    to SPACE to find the glyph in the selected font), to the NON-BREAKING HYPHEN (just infer the fallback to the
    standard HYPHEN, then to MINUS-HYPHEN).

    In fact, it would be more elegant if Unicode provided a new property file, suggesting the best fallbacks (ordered by
    preference) for each character (these fallbacks possibly having their own fallbacks that will be retried if all the
    suggested ordered fallbacks are already failing). In most cases, only one fallback will be needed (in very few
    cases, several ordered fallbacks should be listed if the implied sub-fallbacks are not in the correct order of
    resolution).

    It would avoid selecting glyphs from other fallback fonts with very different metrics. Some of these fallbacks are
    already listed in the main UCD file, but they are too generic (because the compatibility mappings must resolve ONLY
    to non-compatibility decomposable characters). For example NNBSP has a compatibility decomposition as 0020,
    just like many other whitespace characters, so it completely looses the width information.

    If we had standardized fallback resolution sequences implemented in text renderers, we would not need to update
    complex fonts, and the job for font designers would be much simpler, and users of existing fonts could continue to
    use them, even if new characters are encoded.

    I took the example of NNBSP, because it is one character that has been encoded since long now, but vendors are still
    forgetting to provide a glyph mapping for it (for example in core fonts of Windows 7 such as the new "Segoe UI"
    font, even though Microsoft included an explicit mapping for NNBSP in Times New Roman). It's one of the frequent
    cases where this can be solved very simply by the text-renderer itself.

    The same should be done for providing a correct fallback to "round s" if ever any font does not map the "long s".

    I also suggest that the lists of standard character fallbacks are scanned within the first selected font, without
    trying with other fallback fonts (including multiple font families specified in a stylesheet or generic CSS fonts),
    unless the list of fallback characters includes a specifier in the middle of the list that would indicate
    that all the characters (the original or the fallback characters already specified before ) should be
    searched (this will be useful mostly for symbol/pictograms characters).

    As the ordered list of suggested fallback characters will be then rescanned for other fonts, when it reaches the end
    without finding any fallback in the current font, it is not necessary to include it at end of the list.

    Example of standardized fallback data for "long s" and symbols:

    > 00A0 ; 0020 # NON-BREAKING SPACE
    > 00B2 ; 0032 # EXPONENT DIGIT TWO
    > 2009 ; 200A, 2008, 2005, 2006, 00A0 # NARROW NON-BREAKING SPACE
    > 0283 ; 0073 # LATIN SMALL LETTER ESH
    > 02A6 ;
    0074 200C 0073 # LATIN SMALL LIGATURE T-S
    > 02A7 ;
    0074 200C 0283 # LATIN SMALL LIGATURE T-ESH
    > 20A7 ; ,
    0050 200C 0074 200C 0073 # PESETA SYMBOL

    Here the Peseta symbol ("Pts" ligature) will be first search in other fonts, before trying to infer a ligature of
    the three letters). Because all other fonts will have been scanned for only the precomposed "Pts" symbol ONLY, the
    processing will continue by trying to represent the ligature of the three letters: the renderer will attempt to
    locate such ligature in the primary font, and as it will likely fail, it will immediately reprocess it trying to
    ignore the ZWJ characters, so it will show the three letters "Pts" that are very likely to succeed in many fonts (in
    fact almost all Latin fonts).

    If it still fails at this point (because the primary font was not designed for Latin), it reaches the end of the
    list of standard fallbacks, so it will rescan the other fonts for all suggested fallbacks after the last (no
    need to rescan for the symbol), so other fonts will then be scanned successively for the three-letters with ZWJ and
    immediately without it, before trying with the next fallback fonts in the specified stylesheet, and then in system-
    specific fallback fonts.

    Each fallback listed starts with a qualifier which is intended to be processed by the text-renderer when the listed
    characters succeeds to be resolved in the current font : it will provide synthetic information. The
    qualifier in fact will not alter the rendering, it just specifies that no change is necessary to the rendered glyph
    or its metrics and position.

    The , , , , , , , , specifiers are altering the
    rendering appropriately, by synthetic style modifications, or metric modifications (font size, advance width...).
    They may be combined for the same specified fallback...

    If the renderer finds the characters listed in the fallback with a mapping in the currently scanned font, it will
    render the mapped glyph, using the style modifications indicated by the specifiers (such as the equivalent ones
    available in CSS).

    The and specifiers could be used as defined aliases for and
    respectively...

    We could have such data for many of the proposed emojis for emotional faces (most probably using ).

    Note that the fully expanded list (after recursion) should contain somewhere the compatibility mappings listed in
    the main UCD file. For example:

    > 2009 ; 200A, 2008, 2005, 2006, 00A0 # NARROW NON-BREAKING SPACE

    complies to this, because it lists a fallback using U+00A0, which already fallbacks with:

    > 00A0 ; 0020 # NON-BREAKING SPACE

    So U+2009 will effectively fallback (at least at end) to the existing compatibility decomposition in the main UCD
    file. The data above also includes the compatibility fallback of long s to round s already specified in the main UCD
    file.

    Compliant renderers will have to support the list in the specified order at least in its claimed version (this list
    of standard fallbacks should not be subject to the encoding stability principles like the compatibility mappings in
    the UCD, but should still assert that it lists the compatibility mappings. But we could get a smaller data file if
    we dropped this requirement, by also dropping fallbacks that are already specified in the UCD, such as:

    > # 00A0 ; 0020 # NON-BREAKING SPACE

    which would be commented out in a complete version of the file (for clarity), as it is inferable from the UCD
    compatibility decomposition mappings.

    As well, the canonical decompositions need not be specified in this new data file for fallbacks: ALL of them are
    implied, and there should be NO attempt to override them, so if a canonically decomposable character is not found in
    the current font, it will IMMEDIATELY look for the canonical equivalent in the same font, as if the rule was
    present, before retrying at end with the list of fallback fonts:

    A font-renderer that finds any mapping for a precomposed character but not for its NFD equivalent should still use
    that glyph to the precomposed character, for any canonically equivalents strings.

    Fonts may also be built with mappings only for decomposed sequences: the renderer should be able to locate the
    mapped glyphs in the same way. This will simplify the development of fonts, because the other mappings will only be
    needed for legacy systems that still don't use this fallback mechanism.

    Philippe.



    This archive was generated by hypermail 2.1.5 : Wed Aug 04 2010 - 15:32:53 CDT