Re: definition of plain text

From: Peter Cyrus <pcyrus_at_alivox.net>
Date: Sat, 15 Oct 2011 04:37:11 +0200

Ken, your explanation seems more permissive than I had anticipated.

Your example of "3<sup>2</sup>" would seem to me at risk of behaving
in unforeseen ways if, for instance, it were split up. Wouldn't it
match a string "up > 2"? Wouldn't it fail to match 3²? I guess I
thought that plain text should be more canonical.

I was also being coy about my real question, but perhaps I shouldn't
have been. I'm working on a conscript intended as a universal
phonetic script, and even though it is unlikely ever to merit
inclusion in Unicode, I'd like to design it to Unicode's standards.
(For the curious, the work done to date is online at
www.shwascript.org.)

One particularity of this script is that it is written in different
"gaits", depending on the phonology of the language. Languages with
open syllables, like most Niger-Congo or Austronesian languages, would
write it as a syllabary. Languages with fixed syllables, like
Chinese, Korean or Vietnamese, would write it as blocks, like Hangul.
Languages with variable syllables, like most Indo-European languages,
would write it as an alphabet. And Afro-Asiatic languages would write
the vowels as diacritics to highlight the triliteral roots. But all
these gaits would use the same underlying letters, and the same
underlying Unicode PUA characters.

The obvious way to encode this is to add a set of invisible characters
that specify the gait of the following run of plain text. Each would
also serve as the end-of-run character for the preceding run. This
solution seems to me analogous to the use of LTR and RTL to mark runs
for directionality, but I don't know enough about the UBA to know
where the pitfalls are or whether a better solution is feasible.
These gait characters would be ignored in search, which is the desired
behavior.

Alternatives might include markup or even different fonts, but the
gaits seem to me as much part of the text as the letters themselves.
Writers will have to explicitly change gaits when they want to embed a
Chinese name in an English text, for example. It seems unwieldy to
capture that information at the keyboard and then package it
separately for encoding, transmission and rendering. Nor does
considering it as a case distinction seem elegant.

May I ask for advice?

On Fri, Oct 14, 2011 at 9:17 PM, Ken Whistler <kenw_at_sybase.com> wrote:
> On 10/14/2011 11:47 AM, Joó Ádám wrote:
>>
>> Peter asked for what the Unicode Consortium considers plain text, ie.
>> what principles it apllies when deciding whether to encode a certain
>> element or aspect of writing as a character. In turn, you thoroughly
>> explained that plain text is what the Unicode Consortium considered to
>> be plain text and encoded as characters.
>
> Correct. And basically, that is what it comes down to.
>
> One cannot look at *rendered* text and somehow know, a priori,
> exactly how that text should be represented in characters. (In the case of
> most
> of what is still being considered for encoding, "rendered text" means
> non-digitally printed historic printed materials, because there isn't any
> character encoding for it in the first place, and hence no compatibility
> encoding issues.)
>
> Sure, there are some general principles which apply:
>
> 1. We don't represent font size differences by means of encoded characters.
>
> 2. We don't represent text coloration differences by means of encoded
> characters.
>
> 3. We don't represent pictures by means of encoded characters.
>
> and so on. Add your favorites.
>
> But character encoding as a process engaged in by character encoding
> committees (in this case, the UTC and SC2/WG2) is an art form which
> needs to balance: existing practice, if any; graphological analysis of
> writing systems; complexity of implementation for proposed solutions
> to encoding; architectural consistency across the entire standard;
> linguistic politics in user communities; and even national body politics
> involved in voting on amendments to the standard.
>
> It is impossible to codify that process in a set of a priori, axiomatic
> principles about what is and is not plain text, and then sit in committee
> and run down some check list and determine, logically, what exactly
> is and is not a character to be encoded. People can wish all they
> want that it were that way, but it ain't.
>
> So yeah, what the Unicode Consortium considers to be plain text is
> what can be represented by a sequence of Unicode characters, once
> those characters ended up standardized and published in the standard.
>
> You can't start at the other end, define exactly what plain text is, and
> then
> pick and choose amongst the already standardized characters based
> on that definition. Given the universal (including historic) scope of the
> Unicode Standard, that way lies madness.
>
> --Ken
>
>
>
Received on Fri Oct 14 2011 - 21:44:03 CDT

This archive was generated by hypermail 2.2.0 : Fri Oct 14 2011 - 21:44:06 CDT