Re: Languages supported by UTF8 and UTF16

From: Michael Everson (everson@evertype.com)
Date: Sat Sep 10 2005 - 16:14:26 CDT

  • Next message: Jukka K. Korpela: "Re: Languages supported by UTF8 and UTF16"

    At 13:56 -0700 2005-09-10, Mark Davis wrote:

    >>>(Part of the problem here is that we didn't
    >>>apply the generative model consistently
    >>>enough; had we done that, many of these
    >>>characters could be represented right now by
    >>>sequences.)
    >>
    >>Well you'd have to give examples of what you mean by THAT, Mark.
    >
    >No problem. One example: the SIL proposed 04FA
    >CYRILLIC CAPITAL LETTER GHE WITH STROKE AND HOOK
    >could be represented by <U+0413, U+0335, U+0321>.

    Yes, but that generative model sucks, which is
    why we don't use it. At a minimum the overlays
    can cause winding errors with white space over
    the overlapping bits.

    Personally I am a fan of precomposed glyphs (as
    people have been since the dawn of printing).
    They are problematic for our users, so if we can
    limit the problem at least by not going for the
    overlays, that's something.

    >There are many other examples in Arabic.

    Which is a completely different thing. I disagree.

    >Had we chosen the same mechanism for Arabic that
    >we did for Latin (eg define common characters as
    >precompositions, and resolution to those in NFC,
    >but also supply generative mechanisms for
    >others), then minority writing systems using
    >Arabic wouldn't have to wait for years to have
    >characters encoded for them.

    I disagree. What I do wish is that normalization
    hadn't been locked down before Africa's needs
    were dealt with. Now, thank goodness, we have
    "named sequences" which will guide font
    developers, and there will, I promise you, be a
    good many African named sequences standardized to
    give font developers the guidance African users
    need them to have.

    >Moreover, we would have avoided security issues
    >with these kinds of characters at the same time.
    >See the examples in
    >http://www.unicode.org/reports/tr36/#Single_Script_Spoofing

    Um, well, the security issues are your bugaboo,
    and they are restricted to a narrow range of
    activity vis à vis the UCS.

    >>o, but it's a problem, because font guys
    >>usually precompose, and only precomposed glyphs
    >>are **guaranteed** 'safe' for good, consistent
    >>typography.
    >
    >As you well know, what is a precomposed glyph in
    >a font is orthogonal to what is a precomposed
    >character in Unicode. For example, a font can
    >have a precomposed glyph for
    >
    >LATIN CAPITAL LETTER A WITH MACRON AND GRAVE
    >
    >while it is represented in Unicode by <U+0100
    >U+0300>. (This is one of many listed in
    >http://unicode.org/Public/UNIDATA/NamedSequences.txt)

    The problem (if you haven't been paying
    attention) is that a lot of people have
    precomposed requirements that aren't met by
    precomposed glyphs because font guys don't know
    what to draw. Europe is lucky; all the important
    letters are precomposed. Africa is unlucky; the
    19 million Yoruba speakers do NOT have ANY
    support for their letters from ANY of the three
    main computer platforms (Windows, Mac, Linux).

    >>Mark, we are a lo-o-o-ng way from user-tailorable collation on ANY platform.
    >
    >I didn't say 'user-tailorable', I said
    >'language-specific tailorings'. These are two
    >very different things. *All* significant modern
    >platforms offer language-specific tailorings.

    For a very very very very very small number of
    languages. What do we do about that?

    >As to the orthogonal issue of user-tailorable
    >collation: certainly the technology is available
    >to customize locales on the user level. For
    >example:
    >
    >1. Go to
    >http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=root&x=col
    >
    >2. In the custom rules box, type (or copy & paste):
    >& c < b <<< B
    >& everyone < Everson
    >
    >3. In the source box, add a few strings, like:
    >Everson
    >everyone
    >Everyone
    >
    >4. Click on the Sort button. You'll see your
    >desired ordering in the Collated box.

    For a start the default collation orders everson
    before Everson and god before God, which is not
    preferable. The English alphabet is always
    presented Aa Bb Cc not aA bB cC (watch the
    Simpsons to see) and so this is A Bad Thing. When
    I click in English, I get the same thing, and
    this is NOT what Oxford practice specifies. Then
    when I click on Ireland or the UK it is still
    wrong.

    I am not very happy with CLDR in this regard.

    >However, collations are very tricky to specify
    >correctly, because of all the issues described
    >in
    >http://www.unicode.org/reports/tr10/#Introduction,
    >so it is no surprise to me that platforms don't
    >choose to offer this as a user-level option.

    I agree with you about that.

    -- 
    Michael Everson * http://www.evertype.com
    


    This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 16:15:39 CDT