From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 01 2007 - 12:48:14 CDT
There’s no « UTF-16 subset » in Unicode. UTF-16 is not a subset, but a
transform format that allows encoding every part of the UCS.
I suppose you are then speaking about the Basic Multilanguage Plane (BMP)
that contains 64536 code points, of which just a few are invalid.
But even with such count, you will not be able to cover correctly all the
characters that are encoded using only the BMP code points: many of them are
making combinations including combining diacritics, required ligatures,
required contextual forms, and some characters with glyphs reordering both
before and after the glyph(s) for another character (or even for several
characters).
This means that even a font with 65536 glyphs (the maximum for a TrueType
font) will not be enough to cover each script correctly.
Don’t think about the bad idea of implementing a so large font with a
bitmap: it will be excessive and even very inefficient.
Not also that for small sizes, like the one suggested, the rendering will be
extremely poor, unless you use coloring and transparency for halftoning the
small details: a 6x8 black&white matrix is not enough to show all Latin
characters properly.
For the memory needed, if you limit to some subset of all the encoded
strings using BMP characters only, the limit will be extremely high for any
convenient application, especially if your bitmap hasmore than 1 bit of
color-depth.
You don’t even need to ask us how much memory it will take, Unicode does not
specify it, you can do the computation yourself, by just counting the number
of bitmap glyphs you’ll need, and summing their width according to your own
resolution and color constraints.
So what you need is a renderer made for your embedded system: there are free
or open-source implementations of text renderers and layout engines. I
suggest you go with them, at least for handling the reordering algorithms
that are complex to support.
If you don’t support the BiDi algorithm, and any algorithm, you won’t be
able to properly handle texts encoded with one of the standard Unicode
transform formats (UTF-8, UTF-16, UTF-32…) written in RTL scripts; but you
won’t be able also to handle most Indic scripts (and others that are listed
for example in Windows as “complex scripts” because they need a specific
support in the text layout engine).
If you then restrict just to LTR alphabets, you will have surprises: there
are languages that require encoding letters only as sequences of a base
letter and one or several diacritics (Vietnamese, written with the Latin
script, at least requires supporting 2 diacritics on vowels).
Supporting Unicode is generally not performed this way. In fact it is
supported by first splitting the problem by sorting the characters into
classes that are specific to their script, then to some other general
properties. Then an engine parses the text to render and split it into units
to identify “grapheme clusters”; for each grapheme cluster within the same
class, an appropriate font supporting that class is selected. Then the
renderer looks the font andworks with its internal layout rules to fond
which glyph will best represent the text. According to what it finds, it may
then reorder the completely text order, but will then look for the combining
characters that require multiple separate glyphs that are not ordered the
same way (this occurs in Indic scripts for some vowels) and that require
transforming the characters into glyph ids according to the font
instructions. Then it can use that font to draw the entire layout using the
glyph ids determined by this algorithm.
I have voluntarily simplified the steps, but this above should convince you
that writing a layout engine yourself will not be an easy task. In addition,
to support the font directives, you need to be able to parse font files and
its instructions (even a bitmap font contains at least basic instructions
for determining the size of each glyph and its position in the bitmap).
If you embedded system does not have enough processing power and low memory,
the best you’ll be able to do is to support basic fonts that won’t support
the whole set of characters encodable with the abstract characters encoded
in the UCS by ISO and Unicode.
Really, every system starts by supporting one script, then adds each script
one at a time. Supporting many scripts at the same time is a large project,
and it would be probably too costly for your project to redevelop it (given
that it has already taken decennials to support them in the existing
desktop/server OS’es).
So consider using an existing engine (and participate to its development if
there are features still not working for the languages you need to support).
Remember that for developing such thing, you can’t do that alone (there are
lots of things to learn from the tricky cases needed in every script).
Actually, almost all scripts of the world have their complexities for
supporting some languages. (The possible only exception is the Korean modern
Hangul script that is extremely simple face to the complexities of the Latin
script).
So consider learning more about the concept of grapheme clusters. If you
don’t understand it, youcan’t understand why supporting “only” characters in
the first plane will fail with your approach.
_____
De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
part de de Brebisson, Cyrille (Calculator Division)
Envoyé : mercredi 1 août 2007 17:48
À : unicode@unicode.org
Objet : questions on implementing an embeded system that supports unicode
Hello,
I have a couple of questions, which will probably make it obvious that I am
a newbie :-)
I am working on an embedded system that supports the UTF-16 subset of
Unicode. I have of course read the FAQ and lots of “high level articles”
However, I till have a couple of questions, mainly related to displaying
Unicode strings:
- assuming that I use bitmap fonts (6*8 or so for Latin letters and
12*12~14*14 for asian, and I do not know how bit Arabic and similar letters
needs to be), how much memory will I need to dedicate to the fonts?
- how critical is the implementation of the RTL languages? It seems to add
quite a lot of complexity to the system and, once again, in a low power
embedded system might not be worth it?
- is there any existing software package that I could start with/use as a
basis in my system so that I do not have to rewrite everything?
Thanks, Cyrille
This archive was generated by hypermail 2.1.5 : Wed Aug 01 2007 - 12:48:52 CDT