From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Mon Aug 15 2005 - 16:42:24 CDT
From: "Gregg Reynolds" <unicode@arabink.com>
>> Out of Topic Note: did you notice the placement problem with the
>> COMBINING DOT BELOW in the Verdana font on Windows XP, as shown in my
>> previous message?
>>
> Yep. Verdana COMBINING DOT BELOW is definitely flakely. I looked in
> MSWord and Babelpad.
It's strange because I just noticed it today, and not before. It seems that
there's been an update in my new distribution of Windows, or on Windows
update, because in the past I could see this COMBINING DOT BELOW correctly
placed, and used it as a way to encode Latin-based African languages that
make lots of use of consonnants with dot below, and where the precombined
character is most often absent of fonts, not like the decomposed combining
dot below.
Now if I look into some pages I composed in the past, I see that all these
dots below consonnants appear shifted under the following letter (for
example vowels, or even the word-separating space that follows a dotted
word-final consonnant). So now these pages are broken in their text. I am
sure that I tested these pages in the past with Verdana, in addition to
Arial, Times and Courier New.
What is strange is that the Verdana font seems to correctly *center* the
combining dot below the following character, so that the horizontal position
of this dot depends on the width of the following character.
For example if I code <h, dot below, o> or <h, dot below, i>, the dot
appears under the center of o or i, not under the left of o, and not under
the right of i or after it.
This means that the Verdana font was explicitly instructed to create a
ligature of this combining dot and a base letter. But the combination was
incorrectly encoded in the recent version, and the internal glyph
composition tables are broken there.
I think that Microsoft made changes in his Verdana font to support some
other languages and mixed this dot below with other combining dots below,
when it united its glyph with other characters (for example with the glyph
used for the Hebrew combining meteg point).
Such visual bug does not happen in Arial, Arial Unicode MS, Times New Roman,
Tahoma, and Courrier New.
If you read a plain-text email in Outlook or Outlook Express (and probably
other mail tools as well), the rendered text will be incorrect if you have
set up your mail reader with Verdana as the default font for the Latin
script (because it is more cumfortable to read than the default Arial font).
Unfortunately, Microsoft does not offer in Outlook or Outlook Express a way
to select temporarily the visual font used to render emails, when they are
in plain-text or when they do not specify a specific font. You have to set
and save new preferences, before reading such email. The only thing that
Microsoft and others proposes is to select an alternate encoding charset to
decode the message. Why not having in the same menu an option to set another
font to read the email (for example if the text appears unreadable because
the default font does not render some characters correctly or lacks glyphs
for them, selecting another font would solve the problem).
For the same reason, I feel irritated when I have to reread an email or page
and the mail reader or browser reguesses incorrectly its default encoding,
and reuse the default font. Why doesn't the email reader or browser keep
these preferences attached to the email or page, as additional meta-data in
its local cache or mail storage?
---- I also feel irritated when a all-English or all-French website is encoded only with ISO-8859-1, but does not specify it in the HTML or HTTP headers. When such page contains VERY FEW non-ASCII letters (notably people names containing vowels with diaeresis), IE for example will use its "autodetect mechanism" and will guess incorrectly that the page is encoded with Chinese GB2312: it may completely break the HTML structure, or the text will not be rendered correctly, showing ideographs instead of pairs or triplets of Latin-1 letters. The problem here is that the "autodetect" mechanism has too laxist detection thresholds: it can guess the page is in Chinese only because it has found only 1 apparent ideograph within a page that contains tens of kilobytes of plain-ASCII. Although this is not strictly related to Unicode, this just shows that the autodetection of encodings has been worked a lot only for Asian charsets, and not trained to support European charsets and languages (including ISO-8859-* encodings). There's really a need to add non-Asian language/charset profiles in the encoding autodetection mechanism, and to review the autodetection mechanism (at least for correct determination of the encoding, even if it remains an ambiguity about the effective language, which would require more advanced techniques such as lexical lookups). Before this occurs, the charset selection will remain a nightmare for users, and applications should adopt more smart behavior by letting users select rendering preferences including font selection and effective encoding, and store these preferences along with the page cache or mail stores (as this cannot be a global configuration for all pages or emails). Conclusion: user preferences are good for accessibility of softwares so that they will work correctly the way users want for most of the contents they work with, but these global settings cannot solve all problems. Internationalized softwares must be smarter and should provide ways for users to override their preferences specifically for specific ressources, and then remember these user decisions as a way to effectively "train" the automatisms offered by such programs.
This archive was generated by hypermail 2.1.5 : Mon Aug 15 2005 - 16:44:21 CDT