RE: Latin w/ diacritics (was Re: benefits of unicode)

From: Marco Cimarosti (marco.cimarosti@essetre.it)
Date: Wed Apr 18 2001 - 10:10:31 EDT


James Kass wrote:
> > [...] but would it really take *millions* of dollars for
> > implementing Unicode on DOS or Windows 3.1?
>
> It could be done with, say, Ramon Czyborra's Unifont and QBasic.

Why not? Or, even better, with a Unifont-derived BDF font and GNU C++.

> Funding makes the world revolve, free time makes it rotate. It
> probably wouldn't be too difficult to come up with something
> which would provide basic Unicode file viewing under DOS, maybe
> even with input/editing, too. What's really going to slow down
> the DOS application is the look ups required to emulate OpenType
> features needed for complex scripts like Indic or Latin.

I am not so sure that it takes all those superpowers to substitute a few
contextual glyphs. Don't forget that the vast majority of Unicode characters
just require a single glyph.

Unfortunately, I'll probably never find the time (or funds) to demonstrate
it. But, at the point where I stopped years ago, I demonstrated to myself
that at least European and Middle Eastern scripts (with most of the involved
complexities: bidi algorithm, contextual shapes, 2-3 orders of combining
accents) may be displayed and edited under DOS at a speed that is not
noticeably different from any other text editor.

Also consider that a dwarf 100,000 rows table (such as the Unicode's
Database or a Unicode font) would be considered no rocket science in the
world of, e.g., relational databases -- which do exist and do perform well
even under DOS, Linux, older Windows, etc.

> > > Pre-composed Latin characters in the PUA don't require
> > > any special rendering support, they'd be rendered the same
> > > as any precomposed BMP Latin character.
> >
> > I thought that the PUA was being considered here as a place
> to put the extra
> > *glyphs* needed internally by a rendering engine -- not as
> a direct mean of
> > encoding text.
>
> If the PUA is used in order to display Latin Unicode on older
> systems, like Win 9x, the source page in true Unicode would need
> to be converted to a new file using the PUA encodings before it
> could be displayed.

Yes. But this does not necessarily have to involve converting whole files:
the conversion may (and should) be done internally on a line-by-line basis,
in an internal display buffer.

> In TrueType/OpenType, glyphs don't have to be mapped (assigned to
> code points).

This is a myth that I hope to see eradicated as soon as possible.

The only possible way to display Unicode is to map characters to glyphs
according to a set of rules. There is no magic that can avoid this; existing
"smart font" technologies are simply good implementations of this mapping
process.

What does the 'cmap' table do if not converting code points to another set
of numbers? Letting apart the fact that the second domain (the glyph
indexes) is not standardized and not uniform across different fonts, what is
the difference --in terms of performance-- with using pseudo-Unicode scalars
as glyph indexes?

OpenType just adds one or more similar mapping passages, such as the 'GSUB'
table. And also Graphite and ATSUI include similar dictionary tables under
their hoods.

I am not saying that these technologies do anything wrong: just that other
boys in town might be able to do the same thing (possibly better, or faster,
or using less memory).

> Many fonts have glyphs which the user can't access
> directly, this is especially true now with OpenType.

Sorry, I miss the implication of this.

If the user can't access them it is probably because she doesn't need
them... The rendering engine (whether or not embedded in the font) clearly
can access and use these glyphs when necessary.

> > In the case of PUA being used as a repository of extra
> glyphs, special
> > rendering support is indeed required: which is, the part of
> the rendering
> > engine that maps sequences of base letter + diacritics to
> the precomposed
> > PUA code points.
>
> This could be handled at the input level, the display mechanism
> would only be presented with the final form.

Yes, it is a possibility. For instance, there was a recent discussion about
running the bidirectional algorithm as earlier as possible before the
rendering proper. Similar approaches may be studied for many other aspects
of Unicode rendering (e.g. Indic reordering, (de)composition, etc.).

However, there is no reason to start the rendering process *so* early as to
affect the encoding in text files. That would simply be *too* early...

IMHO, using the PUA for encoding should be only considered when there is
*no* other options: e.g., when an entire script is not (yet) part of
Unicode.

> In other words, if the goal is to provide support on older OS,
> valid Unicode documents from other sources would need to be
> converted to the PUA. If one makes a file on their system with
> PUA, it would need to be converted to valid Unicode prior to
> posting it to the web or world. The keyboard layout for the
> PUA should be set up to enter the PUA code points.

Not so, as I said above.

Similarly, you don't need to convert a Unicode document to a series of glyph
indexes *before* you deliver it to an OpenType application... The OpenType
rendering engine (UniScribe) will do this conversion for you at the proper
time (i.e., a millisecond before going to the screen).

> > > The 'cmap' in TrueType fonts for Windows uses double-byte
> encoding.
> > > (Windows NT supports the new specs which allow multi-byte.)
> >
> > Does this mean that TrueType fonts for Windows NT would be
> capable of
> > breaking the 64-KB barrier and support a whole Unicode font
> which also
> > support extended planes? Really TrueType or OpenType?
>
> No. The new cmap supports more than double-byte in order to access
> non-BMP encodings. The Glyph IDs (the number/order of the glyphs
> in a font) remain locked at 65536 max. Unfortunately this isn't
> expected to change, last I heard.

What a pity!

Maybe it could be one more reason to come up with a small GNU renderer that
supports 0x1000000 glyphs: the moral let-down could move the big ones to
update their tables. :-)

Thanks for the info about Pahawh Hmong!

_ Marco



This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:17:16 EDT