From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Mar 13 2007 - 18:09:41 CST
That's an interesting point of view. Effectively the escape sequences that
are used in many CES do use byte values that fall within the ASCII byte
range (or sometime in higher ranges).
But there are no well defined CES conversion scheme to convert those
sequences to Unicode, except by reusing the corresponding ASCII mappings (or
ISO 8859): that's something that breaks proper parsing of the rest of the
text at the character properties level.
It would be better to have in Unicode some special ranges of control
characters mapped to byte values that are part of unconverted CES sequences
like in VT100, VT200 (and so on) protocols, or in other legacy terminal
protocols (to encode colors, cursor control, or other rich text
enhancements, or the encoding of user-defined bitmaps for custom characters
or glyphs, notably used in some East-Asian Teletext systems, because trying
to detect which character those bitmaps represent can be difficult, or even
impossible, as they were really user-defined and local to the document
containing those glyph definitions).
Consider sequences like:
ESC, [, A, I, R
(in a 7-bit or 8-bit encoded document prepared and sent on medias that
support with VT100-like enhancement).
Or even this one with Videotex:
ESC, A, I, R
Do they contain the English word "AIR" or the abbreviation "IR" (preceded by
a ANSI/VT100-like color attribute)? How can we delimit the length of escape
sequences?
At least, with some ISO-based complex sets, we have well-defined registries
and parsing rules for matching the length of sequences that introduce a
codepage selection (so that, during conversion, the sequence itself can be
filtered out, and the rest of the text be interpreted and converted to the
appropriate code-points according to the subset mapping). But for most
protocols, we don't have such thing.
We are even lacking appropriate identification of many national terminal
protocols (for example the BBS or Videotex systems, or newer CES used for
DVD subtitles, or in DVB-T subtitle channels or EPG).
This is a problem when preparing documents for later inclusion on
multiplexed media or streams (like MPEG streams, or DVB channels for
satellite, cable, DSL or air transmission), as they will require specific
software and specific filters.
Note that proper CES identification is also commonly missing in other very
used protocols (for example SMS messaging on mobile phones) or are not
interoperable internationally, and vary between phone operators that each
need their own custom conversion filters when transmitting something else
than pure ASCII (and the SMS protocols do not even allow defining custom
bitmaps for mapping the missing Unicode characters that can't be converted
in the target CES of the recipient, according to mobile phone capabilities).
Even with the same operator, there are lots of differences of implementation
and international support between mobile phones, and for some languages that
need extra characters, the received message is completely unreadable.
Even though UTF-8 has progressed significantly in this area, many mobile
phones lack the necessary built-in font support, and are unable to display
the associated text; that's something that the mobile phone operator should
provide for its subscribers, by allowing mobile phones to send their
capabilities, so that the operator will send small bitmaps to define the
missing glyphs, along with the UTF-8 encoded message. The mobile phone could
then contain an internal cache for those "custom" glyphs sent by their
operator (in most cases, for mobile phone usage, the glyphs do not need to
be scalable and can be bitmaps in a single size; the devie will then adapt
its font size to the default size of those glyphs associated to characters
present in the text).
Another difficulty is caused by UTF-8 encoded grapheme clusters: most small
devices are unable to implement the complex decoding and layout algorithm,
so that's a case where a encoded grapheme cluster should be reencodable as a
single PUA, and then sent with such PUA and a glyph definition mapped to it.
But here also, this means that the pure UTF-8 content of the text must also
allow the inclusion of specific control sequences which are correctly
identified, and that won't generate garbage on devices that don't know those
sequences.
This is not irraisonnable, given that there are still lots of missing
scripts in the Unicode standard (and glyphs associated to their sequences
which can't be present in today's devices); it's not even easy or possible
to upgrade their internal software, but it should be possible to support
those encoded languages with small devices like handheld PDA with a drawing
pen, where users can record their glyphs in the internal memory, and then
use them for messaging over mobile networks. Here again, a protocol will
need to be able to mix glyphs within their transmitted texts, and such
protocol will need arbitrary byte values (unless the text is encoded with a
rich format like XML, plus generic data compression like deflate during
transmission). This is then no more a plain-text format but a computer
language with a syntax used to describe the document, even if there are no
layout information or no rich text information like colours.
A clean way to avoid false parsing when handling those documents in
intermediate gateways that use Unicode-based algorithms would be to be able
to encode with Unicode arbitrary control sequences made of bytes, and send
them as blind objects. The corresponding characters would no more be
associated to normal characters (so there would be no risk of some bytes
being converted to unrelated others because one CES implementation thinks it
is safe to transform a small Latin letter a with acute into a unaccented
small a, even when the initially encoded bytes did not have this incorrectly
assumed semantic).
But then, there are other solutions than encoding those bytes in Unicode:
* may be this is where PUAs should be used? Note that PUAs are generally
handled with the semanctics of symbols, and not the semantic of control
characters, so they are counted for example when computing line breaks, and
the insertion of linebreaks by an agent may break the encoded byte sequence
needed for some origin or target protocols.
* a transport encoding syntax or escaping mechanism can be standardized on
top of Unicode; this is similar to the approach taken for emails with
standard MIME codes for TES, allowing some bytes to have specific meaning,
and requiring some bytes of a plain UTF-8 text to be transformed or escaped;
this already allows to transmit arbitrary Unicode-encoded plain-texts even
on a media which has restrictions (like limited line-length, reserved bytes
for the transmission protocol itself...).
Another option would be to encode only two new controls in Unicode:
* start control sequence;
* end control sequence.
In the middle, every code point present do not have their default Unicode
semantics and properties and must be treated as an unbreakable binary
encoded object... a good question is then: what is the semantic of the whole
sequence itself:
* A control?
* A rich-text enhancement?
* A "graphic" PUA (meaning here a complete grapheme cluster) whose semantic
is global to the document?
* A contextual object that affects the rendering or interpretation of the
rest of the document? is it then safe to extract substrings from the
document? What is the effect of even using only simple truncation of the
document to a limited length?
> -----Message d'origine-----
> De : unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] De la
> part de Doug Ewell
> Envoyé : dimanche 11 mars 2007 23:28
> À : Unicode Mailing List
> Objet : ISO 6429 control sequences with non-ASCII CES's
>
> ISO 6429 (equivalently ECMA 48, ANSI X3.64) defines terminal control
> sequences using the control characters in the U+0000 - U+001F block.
> Many control sequences begin with Escape (U+001B) and also include other
> characters in the printable Basic Latin block.
>
> I get the impression from reading ECMA 48 that these control sequences
> are defined directly on byte values, not character values. That means
> they could not be used with Unicode character encoding schemes such as
> UTF-16, UTF-7, or SCSU, which represent U+001B as something other than
> the single byte 0x1B. It also means they *could* be used with UTF-8.
> Is this correct?
This archive was generated by hypermail 2.1.5 : Tue Mar 13 2007 - 18:14:08 CST