RE: Purpose of plain text from Doug Ewell on 2011-11-15 (Unicode Mail List Archive)

From: Doug Ewell <doug_at_ewellic.org>
Date: Tue, 15 Nov 2011 10:14:42 -0700

If the underlying proposal is to unify all the Indic scripts, after they
have been disunified (and updated independently) in Unicode for 20
years, then there is really no point in continuing this discussion on
the Unicode list.

Unicode has not always been 100% consistent in its principles of what to
encode and what not to encode, but one fundamental concept has always
been the rejection of font hacks.

--
Doug Ewell | Thornton, Colorado, USA | RFC 5645, 4645, UTN #14
www.ewellic.org | www.facebook.com/doug.ewell | @DougEwell 
-------- Original Message --------
Subject: Re: Purpose of plain text
From: Christoph_Päper <christoph.paeper_at_crissov.de>
Date: Tue, November 15, 2011 6:11 am
To: Unicode Discussion <unicode_at_unicode.org>
Doug Ewell:
> How can I search a group of documents, one written in Devanagari and another in Sinhala and another in Tamil and another in Oriya, for a given string if they all use the same encoding, and the only way to tell which is which is to see them rendered in a particular font?
That question made no sense if you didn’t consider them different
scripts.
It is a good indication in favor of unification, for example, when you
use your local script for all loan words from other languages that use
related scripts, but there are counter-examples:
— Japanese Katakana and Hiragana have developed from the same source
for the same language, so that discrimination does not apply.
— In German Fraktur texts you will see modern Romance borrowings or
English xenisms set in Antiqua (but old Greek and Latin loan words in
Fraktur), which would be in favor of disunification by above criterion.
I have no (sufficient) idea how it works in India and its South Asian
neighbors. Most books I read on writing or scripts or writing systems
look at each system – identified by varying definitions – in
isolation and connect them only by descent, not by discrimination.
> Latin (Antiqua) and Fraktur and Gaelic letters are, intrinsically, the same letter. That is not true for Devanagari and Sinhala and Tamil and Oriya letters.
If I understand Naena Guru correctly they want to unify all the
brahmic-indic scripts (similar to ISCII) and, furthermore, unify them
(in a transliterating manner) with the roman script. The second part is
silly, unless there is a romanization movement I’m unaware of.
Whether to draw the line between two related scripts or between two
hands (fonts, typefaces, …) of the same script is sometimes an
arbitrary, yet informed, decision.
In “Euroscript” – the combination of Cyrillic, Greek and Roman
scripts – some uppercase letters look the same most of the time, but
lowercase letters (of these similar letters) differ, often quite a lot.
That alone is good enough a reason not to unify them. Yet, each of the
scripts has similar glyphic variation for all of its letters, but only
if two of them can be used in the same text for different purposes one
has to distinguish them in coding, too. This only applies below the
lexical level, though, i.e. an italic ‘a’ inside an otherwise
upright word or vice versa is still the same letter, but an isolated
italic ‘a’ may need to be distinguished from an upright ‘a’ –
since this most often happens in formulae it comes down to the question
whether you want to be able to encode more notations (incl. IPA
phonetics) than written language, i.e. “true writing”.
Alphabetic scripts, i.e. those that use vocalic and consonantic letters
at the same level and no diacritics, are by definition the easiest to
encode digitally. For the rest, however, there is more than one way to
skin a cat. It is quite possible that the Brahmic/Indic family wasn’t
encoded in the best way for two reasons: related scripts could have been
unified and you can approach most of them from at least two directions
(segmental or syllabic). Of course, it’s probably too late to change
now.
It’s tough to find a definition that fairly and usefully distinguishes
symbol, sign, mark, letter, character, stroke, diacritic, glyph, graph,
grapheme, frame … If you have one and can get everybody to agree on
it, you then still have to decide which of the entities to encode, which
software layer to render them and how to type them on a keyboard. You
have to stick to that decision.
Sadly I haven’t seen a good definition, not everyone agrees, and
deviation from the decision is common.
Unicode, for instance, usually tries to encode what it thinks of as
characters, but under certain conditions it does accept letters, e.g.
precomposed characters (including Hangul syllabograms and CJKV
sinograms), and symbols, e.g. emoticons.

Received on Tue Nov 15 2011 - 11:19:30 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 15 2011 - 11:19:31 CST