Re: Languages supported by UTF8 and UTF16

From: Michael Everson (everson@evertype.com)
Date: Sat Sep 10 2005 - 16:14:26 CDT

Next message: Jukka K. Korpela: "Re: Languages supported by UTF8 and UTF16"

Previous message: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
In reply to: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
Reply: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
Reply: Patrick Andries: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

At 13:56 -0700 2005-09-10, Mark Davis wrote:

>>>(Part of the problem here is that we didn't
>>>apply the generative model consistently
>>>enough; had we done that, many of these
>>>characters could be represented right now by
>>>sequences.)
>>
>>Well you'd have to give examples of what you mean by THAT, Mark.
>
>No problem. One example: the SIL proposed 04FA
>CYRILLIC CAPITAL LETTER GHE WITH STROKE AND HOOK
>could be represented by <U+0413, U+0335, U+0321>.

Yes, but that generative model sucks, which is
why we don't use it. At a minimum the overlays
can cause winding errors with white space over
the overlapping bits.

Personally I am a fan of precomposed glyphs (as
people have been since the dawn of printing).
They are problematic for our users, so if we can
limit the problem at least by not going for the
overlays, that's something.

>There are many other examples in Arabic.

Which is a completely different thing. I disagree.

>Had we chosen the same mechanism for Arabic that
>we did for Latin (eg define common characters as
>precompositions, and resolution to those in NFC,
>but also supply generative mechanisms for
>others), then minority writing systems using
>Arabic wouldn't have to wait for years to have
>characters encoded for them.

I disagree. What I do wish is that normalization
hadn't been locked down before Africa's needs
were dealt with. Now, thank goodness, we have
"named sequences" which will guide font
developers, and there will, I promise you, be a
good many African named sequences standardized to
give font developers the guidance African users
need them to have.

>Moreover, we would have avoided security issues
>with these kinds of characters at the same time.
>See the examples in
>http://www.unicode.org/reports/tr36/#Single_Script_Spoofing

Um, well, the security issues are your bugaboo,
and they are restricted to a narrow range of
activity vis à vis the UCS.

>>o, but it's a problem, because font guys
>>usually precompose, and only precomposed glyphs
>>are **guaranteed** 'safe' for good, consistent
>>typography.
>
>As you well know, what is a precomposed glyph in
>a font is orthogonal to what is a precomposed
>character in Unicode. For example, a font can
>have a precomposed glyph for
>
>LATIN CAPITAL LETTER A WITH MACRON AND GRAVE
>
>while it is represented in Unicode by <U+0100
>U+0300>. (This is one of many listed in
>http://unicode.org/Public/UNIDATA/NamedSequences.txt)

The problem (if you haven't been paying
attention) is that a lot of people have
precomposed requirements that aren't met by
precomposed glyphs because font guys don't know
what to draw. Europe is lucky; all the important
letters are precomposed. Africa is unlucky; the
19 million Yoruba speakers do NOT have ANY
support for their letters from ANY of the three
main computer platforms (Windows, Mac, Linux).

>>Mark, we are a lo-o-o-ng way from user-tailorable collation on ANY platform.
>
>I didn't say 'user-tailorable', I said
>'language-specific tailorings'. These are two
>very different things. *All* significant modern
>platforms offer language-specific tailorings.

For a very very very very very small number of
languages. What do we do about that?

>As to the orthogonal issue of user-tailorable
>collation: certainly the technology is available
>to customize locales on the user level. For
>example:
>
>1. Go to
>http://www-950.ibm.com/software/globalization/icu/demo/locales/en/?_=root&x=col
>
>2. In the custom rules box, type (or copy & paste):
>& c < b <<< B
>& everyone < Everson
>
>3. In the source box, add a few strings, like:
>Everson
>everyone
>Everyone
>
>4. Click on the Sort button. You'll see your
>desired ordering in the Collated box.

For a start the default collation orders everson
before Everson and god before God, which is not
preferable. The English alphabet is always
presented Aa Bb Cc not aA bB cC (watch the
Simpsons to see) and so this is A Bad Thing. When
I click in English, I get the same thing, and
this is NOT what Oxford practice specifies. Then
when I click on Ireland or the UK it is still
wrong.

I am not very happy with CLDR in this regard.

>However, collations are very tricky to specify
>correctly, because of all the issues described
>in
>http://www.unicode.org/reports/tr10/#Introduction,
>so it is no surprise to me that platforms don't
>choose to offer this as a user-level option.

I agree with you about that.

-- 
Michael Everson * http://www.evertype.com

Next message: Jukka K. Korpela: "Re: Languages supported by UTF8 and UTF16"
Previous message: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
In reply to: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
Next in thread: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
Reply: Mark Davis: "Re: Languages supported by UTF8 and UTF16"
Reply: Patrick Andries: "Re: Languages supported by UTF8 and UTF16"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat Sep 10 2005 - 16:15:39 CDT