Re: internationalization assumption

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Thu Oct 07 2004 - 04:50:13 CST

Next message: Philippe Verdy: "Polytonic Greek pneuma letters (spirits) and half-eta glyphs"

Previous message: Mike Ayers: "RE: internationalization assumption"
In reply to: Mike Ayers: "RE: internationalization assumption"
Next in thread: Mike Ayers: "RE: internationalization assumption"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

RE: internationalization assumptionWell the main issue for
internationalization of software is not the character sets with which it was
tested. It is in fact trivial today to make an application compliant with
Unicode text encoding.

What is more complicate is to make sure that the text will be properly
displayed. The main issues that cause most of the problems come in the
following area:

- dialogs and GUI interfaces need to be resized according to text lengths

- a GUI may have been built with a limited set of fonts, all of them with
the same line height for the same point size; if you have to display Thai
characters, you'll need a larger line height for the same point size.

- some scripts are not readable at small point sizes, notably Han sinograms
or Arabic

- the GUI layout should be preferably reversed for RTL languages.

- you need to be aware BiDi algorithm and you'll have to manage the case of
mixed directions each time you have to include portions of texts from a
general LTR script within a RTL interface (for Hebrew or Arabic notably):
ignoring that, your application will not insert the appropriate BiDi
controls that are needed to properly order the rendered text, notably for
mirrored characters such as parentheses. For some variable inclusions in a
RTL resource string, you may need to insert some surrounding RLE/PDF pair so
that the embedded Latin items will display correctly.

- The GUI controls such as input boxes need should be properly aligned so
that input will be performed from the correct side.

- Tabular data may have to be presented with distinct alignments, notably if
items are truncated in narrow but extensible columns (traditionally, tabular
text items are aligned on the left and truncated on the right, but for
Hebrew or Arabic, they should be aligned and truncated in the opposite
direction)

- You have to be aware of the variation of scripts that may be used even in
a pure RTL interface: a user may need to enter sections of texts in another
script, most often Latin. You have to wonder how these foreign text items
will be handled.

- In editable parts of the GUI, mouse selection will be more complex than
what you think, notably with mixed RTL/LTR scripts.

- You can't assume that all text will be readable with a fixed-width font.
Some scripts require using variable-width letters.

- You have to worry about grapheme clusters, notably in Hebrew, Arabic, and
nearly all Indian scripts. This is more complex than what you think for
Latin, Gree, Cyrillic, Han, Hiragana or Katakana texts. Even with the Latin
script, you can't assume that all grapheme clusters will be made of only 1
character. For various reasons, common texts will be entered using combining
characters, without the possibility to make precomposed clusters (this is
specially true for modern Vietnamese that uses multiple diacritics on the
same letter).

- Text handling routines, that change the presentation of text (such as
capitalisation) will not work properly or will not be reversible: even in
the Latin script, there are some characters which are available with only 1
case. Titlecasing is another issue. Such automated presentation effects
should be avoided, unless you are aware of the problem.

- Plain-text searches often need to support indifferent case. This issue is
closely related to collation order, which is sensitive to local linguistic
conventions, and not only to the used script. For example, plain-text search
in Hebrew will often need to support searches with or without vowel marks,
which are combining characters, simply because they are optional in the
language. When this is used to search and match identifiers such as
usernames or filenames, various options will be exposed to you. In addition,
there are lots of legacy text that are not coded with the most accurate
Unicode character, simply because they are entered with more restricted
input methods or keyboards, or were coded with more restricted legacy
charsets (the 'oe' ligature in French is typical: it is absent from
ISO-8859-1 and from standard French keyboards, although it is a mandatory
character for the language; however it is present in Windows codepage 1252,
and may be present in texts coded with it, because itwill be entered through
"assisted" editors or word processors that can perform autocorrection of
ligatures on the fly)

- GUI keyboard accelerators may not be workable with some scripts: you can't
assume that the displayed menu items will contain a matching ASCII letter,
so you'll need some way to allow keyboard navigation of the interface. This
issue is related to accessibility guidelines: you need to offer a way for
users to see which keyboard accelerators they can use to navigate easily in
your interface. Don't assume that accelerators for one language will be used
as easily for another language.

- toolbar buttons should avoid graphic icons with text elements, unless
these items are also internationalizable.

- color coding to add special semantics to text, or even to icons should be
avoided, such as the too common European meanings of Red/Orange/Green.

- Sometimes, it will be hard to summarize in a short button label the
actions it performs. Using help tooltip texts (also internationalizable)
will provide better experience for users, when these buttons need to display
abbreviations.

The other internationalization issues are much simpler: date and number
formats, common words like Yes/No/OK/Cancel/Retry/Abort, are easily solved
with text resources and common i18n libraries, such as the basic common set
of CLDR resources.

----- Original Message -----
From: Mike Ayers
For Unicode applications, Latin 1 testing is insufficient, even for
internationalization testing. Internationalization tests should verify, at
minimum, that characters >u1000 <=uffff (basically, all of the BMP) can be
used. It is also good to verify >=u10000 support, or at least determine
whether or not it exists for your application. I usually test English and
Japanese for BMP conformance. For >BMP, while all the applications I've
tested so far have specifically excluded this range, I still have a simple
strategy based upon snipping the Deseret text from James Kass' script links
page (http://home.att.net/~jameskass/scriptlinks.htm) and using that
(thanks, James!).
Note that none of the above at all refers to localization testing,
which still must be done for every supported language-charset combination
(this is where Unicode can really pay off by reducing things to 1 charset
per language). Internationalization testing should only determine the
ability of your application to handle other languages, it is localization
testing that determines whether it actually handles a given language, and
would include such things as text entry and display, text conversion,
coextistence, etc., as applicable.

Next message: Philippe Verdy: "Polytonic Greek pneuma letters (spirits) and half-eta glyphs"
Previous message: Mike Ayers: "RE: internationalization assumption"
In reply to: Mike Ayers: "RE: internationalization assumption"
Next in thread: Mike Ayers: "RE: internationalization assumption"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Thu Oct 07 2004 - 05:00:05 CST