From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed Oct 19 2005 - 03:17:36 CST
On 10/18/2005 9:00 PM, Christopher Fynn wrote:
>
> Raymond Mercier wrote:
>
>> Unicode is meant for the printed text, is it
>> not ?
>
> Not really - at least not without an additional level of markup or
> formatting. Unicode is specifically meant for *plain* text. Printed
> text kind of implies formatting or rich text.
>
Unicode is meant for unambiguously representing text content on computers.
The vast majority of computerized texts are indeed
computer-representations of printed material, or material that can be
rendered using the same typography as printed material.
The distinction that Raymond is aiming at, between texts that use the
(typically more settled) typography of printed materials and texts that
show the much wider variations common to manuscripts is a valid one.
The representation of actual printed documents does of course require
additional formatting information. At the minimum, it would have to
include a font style, a font size, line-spacing and margin information.
While true, this is not what's interesting in this context.
The question here is how to deal with the representation of variable
appearance of what otherwise would be the 'same' text. Where these are
fully regular, as in selecting language or script specific forms or
punctuation, or selecting positional forms for Arabic shaping, or fully
defined by rules of typography, like ligatures in many (but not all)
languages, deferring to the rendering or display engine (together with
some overall style information) is clearly the right thing.
For isolated variants, the UTC has consistently supported the addition
of explicit character codes, as opposed to requiring the use of some
generic character code with markup for variant selection. Such markup is
not really generic and acts more like a code extension mechanism (for
example, entity definitions in HTML). That raises portability issues and
issues of semantic processing of text. Therefore, avoiding such markup
is clearly the right thing.
Limiting this support to forms attested in print is pragmatic: the
number of variants are much smaller, and their use and appearance is
much more settled than for manuscripts. Beyond the variations in
particular forms, manuscripts may exhibit many other variations (in line
width, line spacing, etc. etc) that may or may not be need to be modeled
when a particular text is computerized for a particular purpose.
Even if it was a better solution to support such modeling directly in
the Unicode Standard (and it isn't) it would present the problem that
the standardization process might well not be able to cope with the pace
in which exceptional documents are likely to be discovered, which would
require additional support.
Making the pragmatic choice that the printed tradition represents
sufficiently generic sets of variation to match the task of
standardization is what Raymond had in mind.
A./
This archive was generated by hypermail 2.1.5 : Wed Oct 19 2005 - 03:19:13 CST