On 31/10/2018 at 11:21, Asmus Freytag via Unicode wrote:
>
> On 10/31/2018 2:38 AM, Julian Bradfield via Unicode wrote:
>
> > You could use the various hacks
> > you've discussed, with modifier letters; but that is not "encoding",
> > that is "abusing Unicode to do markup". At least, that's the view I
> > take!
>
> +1
There seems to be a widespread confusion about what is plain text, and what
Unicode is for. From an US-QWERTY point of view, a current mental representation
of plain text may be ASCII-only. UK-QWERTY (not extended) adds vowels with acute.
Unicode is granting to every language its plain text representation. If superscript
acts as abbreviation indicator in a given language, this is part of the plain text
representation of that language.
So far, so good. The core problem is now to determine whether superscript is
mandatory, and baseline is fallback, or superscript is optional and decorative,
and baseline is correct. That may be a matter of opinion, as has been suggested.
However we know now a list of languages where superscript is mandatory, and
baseline is fallback. Leaving English alone, these languages on themselves need
the use of preformatted superscript letters being granted to them by the UTC.
Still in the beginning, when early Unicode set up the Standard, superscript
was ruled out of plain text, except when there was sort of a strong lobbying,
like when Vietnamese precomposed letters were added. Phoneticists have a strong
lobby, so they got some ranges of preformatted letters. To make sure nobody
dare use them in running text elsewhere, all *new* superscript letters got names
on a MODIFIER LETTER basis, while subscript letters got straightforward names
having SUBSCRIPT in them. Additionally, strong caveats were published in TUS.
And the trick worked, as most of the time, one is now referring to the superscript
letters using the “modifier letter” label that Unicode have decked them out with.
That is why, today, any discussion is at risk of being subject to strong biases
when its result should allow some languages to use their traditional abbreviation
indicators, in an already encoded and implemented form. Fortunately the front has
begun to move, as CLDR TC have granted ordinal indicators to the French locale
per v34.
Ordinal indicators are one category of abbreviation indicators. Consistently, the
already-ISO/IEC-8859-1-and-now-Unicode ordinal indicators are used also in titles
like "Sª", "Nª Sª", as found in the navigation pane of:
http://turismosomontano.es/en/que-ver-que-hacer/lugares-con-historia/monumentos/iglesia-de-la-asuncion-peralta-de-alcofea
I’m not quite sure whether some people would still argue that that string isn’t
understood differently from "Na Sa".
> In general, I have a certain sympathy for the position that there is no universal
> answer for the dividing line between plain and styled text; there are some texts
> where the conventional division of plain test and styling means that the plain
> text alone will become somewhat ambiguous.
That is why phonetics need preformatted super- and subscripts, and so do languages
relying on superscript as an abbreviation indicator.
> We know that for mathematics, a different dividing line meant that it is possible
> to create an (almost) plain text version of many (if not most) mathematical
> texts; the conventions of that field are widely shared -- supporting a case for
> allowing a standard encoding to support it.
Referring to Murray Sargent’s UnicodeMath, a Nearly Plain Text Encoding of Mathematics,
https://www.unicode.org/notes/tn28/
is always a good point in this discussion. UnicodeMath uses the full range of
superscript digits, because the range is full. It does not use superscript letters,
because their range is not full. Hence if superscript digits had stopped at the
legacy range "¹²³", only measurement units like the metric equivalents of sq ft and
cb ft could be written with superscripts, and that is already allowed according to
TUS. I’m ignoring why superscript 1 was added to ISO/IEC 8859-1, though. Anyway,
since phonetics need a full range of superscript and subscript digits, these were
added to Unicode, and therefore are used in UnicodeMath.
Likewise, phonetics need a nearly-full range of superscript letters, so these were
added to Unicode, and therefore are used in the digital representation of natural
languages.
> However, it stops short of 100% support for edge cases, as does the ordinary
> plain text when used for "normal" texts. I think, on balance, that is OK.
That is not clear as long as “ordinary plain text” is not defined for the purpose
of this discussion. Since I have superscript small letters on live keys, and the
superscript "ᵉ" even doubled on the same level as the digits (that it is used to
transform into ordinals for most of them), my French keyboard layout driver allows
the OS to output ordinary plain text consisting of various signs including
superscript small Latin letters.
Now is Unicode making a difference between “plain text” and “ordinary plain text”?
There are various ways to “clean up” the UCS, first removing presentation forms,
then historic letters, then mathematical symbols, then why not emoji, and somewhere
in-between, phonetic letters, among which superscripts. The result would then be
“ordinary plain text” — but to what purpose? Possibly so that all documents must be
written up using TeX. Following that logic to its end would mean that composed
letters should be removed, too, given they are accurately represented using escape
sequences like "e\'" for "é".
> If there were another important notational convention, widely shared,
> reasonably consistent and so on, then I see no principled objection to considering
> whether it should be supported (minus some edge cases) in its own form of
> plain text (with appropriate additional elements encoded).
I’m pleased to read that. Given the use of superscript in French is important,
widely shared, and reasonably consistent, we need to know what it should be else.
Certainly: supported by the local keyboard layout. Hopefully it will be, soon.
> The current case, transcribing a post-card to make the text searchable, for
> example, would fit the use case for ordinary plain text, with the warning against
> simulated effects of markup.
Triggering such a warning would need to first sort out whether a given representation
is best encoded using plain text or using markup. If it’s plain text, then that is
not simulating anything. The reverse is true: Markup simulates accurate plain text.
Searchability is ensured by equivalence classes. Google Search has most comprehensive
equivalence classes, indexing even all mathematical preformatted Latin letters like
plain ASCII.
> All other uses are better served by markup, whether
> SGML / XML style to capture identified features, or final-form rich text like PDF
> just preserving the appearance.
Agreed.
Best regards,
Marcel
Received on Wed Oct 31 2018 - 09:57:47 CDT
This archive was generated by hypermail 2.2.0 : Wed Oct 31 2018 - 09:57:47 CDT