From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Wed Oct 19 2005 - 03:17:36 CST
On 10/18/2005 9:00 PM, Christopher Fynn wrote:
>
> Raymond Mercier wrote:
>
>> Unicode is meant for the printed text, is it
>> not ?
>
> Not really - at least not without an additional level of markup or 
> formatting. Unicode is specifically meant for *plain* text. Printed 
> text kind of implies formatting or rich text.
>
Unicode is meant for unambiguously representing text content on computers.
The vast majority of computerized texts are indeed 
computer-representations of printed material, or material that can be 
rendered using the same typography as printed material.
The distinction that Raymond is aiming at, between texts that use the 
(typically more settled) typography of printed materials and texts that 
show the much wider variations common to manuscripts is a valid one.
The representation of actual printed documents does of course require 
additional formatting information. At the minimum, it would have to 
include a font style, a font size, line-spacing and margin information. 
While true, this is not what's interesting in this context.
The question here is how to deal with the representation of variable 
appearance of what otherwise would be the 'same' text. Where these are 
fully regular, as in selecting language or script specific forms or 
punctuation, or selecting positional forms for Arabic shaping, or fully 
defined by rules of typography, like ligatures in many (but not all) 
languages, deferring to the rendering or display engine (together with 
some overall style information) is clearly the right thing.
For isolated variants, the UTC has consistently supported the addition 
of explicit character codes, as opposed to requiring the use of some 
generic character code with markup for variant selection. Such markup is 
not really generic and acts more like a code extension mechanism (for 
example, entity definitions in HTML). That raises portability issues and 
issues of semantic processing of text. Therefore, avoiding such markup 
is clearly the right thing.
Limiting this support to forms attested in print is pragmatic: the 
number of variants are much smaller, and their use and appearance is 
much more settled than for manuscripts. Beyond the variations in 
particular forms, manuscripts may exhibit many other variations (in line 
width, line spacing, etc. etc) that may or may not be need to be modeled 
when a particular text is computerized for a particular purpose.
Even if it was a better solution to support such modeling directly in 
the Unicode Standard (and it isn't) it would present the problem that 
the standardization process might well not be able to cope with the pace 
in which exceptional documents are likely to be discovered, which would 
require additional support.
Making the pragmatic choice that the printed tradition represents 
sufficiently generic sets of variation to match the task of 
standardization is what Raymond had in mind.
A./
This archive was generated by hypermail 2.1.5 : Wed Oct 19 2005 - 03:19:13 CST