Re: U+2018 is not RIGHT HIGH 6 from Michael Probst on 2012-05-03 (Unicode Mail List Archive)

From: Michael Probst <michael.probst03_at_web.de>
Date: Thu, 03 May 2012 08:19:03 +0200

Am Sonntag, den 29.04.2012, 23:43 -0700 schrieb Asmus Freytag:
> Even if some minutiae of glyph selection are left to a font, the
> problem is often that there's no specification as to what certain
> languages need, so that fonts cannot be expected to provide the
> correct implementation.

Strange as this matter is, the glyph might even be the same. The
difference in semantics becomes clear only when the overall design of a
font relies on language specific usage. Then it becomes a bit like
specifying "close" with a ( to get ), or like specifying "Greek" and
"small" with Α (U+0041) to get α (U+3B1), or "Maths", "sans-serif",
"bold" and "italic" with that to get &#120746 (U+1D7AA).

> When Unicode was first created, the fact that one and the same
> quotation mark character could be both opening and closing was not
> widely realized in the character encoding community.

It rather seems that it is unclear whether to encode

* glyphs: like U+002E (with disc- and square-like variants), a mere "low
dot" to mean 'end of sentence', 'decimal point', 'hexadecimal point',
'hexagesimal point' or 'separator';

* graphemes (most of them),

* or abstract semantics: "minus" and "en dash", for example, which may
be something of an opposite of graphemes: they may look the same but
mean something different;

-- and where these are mixed up they are just called "unified
characters" :-)

> This was rectified over time, and now there is detailed information
> (even though it may not be exhaustive) on common practices in chapter
> 6 of the standard.

Thanks for pointing me to that chapter. I had expected to find the RIGHT
HIGH 6(6) in either the current or the rejected proposals or some
discussion (overlooking the one in 2006) and did not refer to the
standard as they are obviously not encoded.

I would say, though, the rectification is still in progress, and still
required.

> So far, this information is limited to character usage (which
> character code when). Augmenting that with information on required
> design differences, that is elements of glyph variations that are
> encompassed by certain of the characters, and how they track with
> language, would round out the picture.

I do not think this is about glyph variation but rather an incomplete
idea of what a code point points at:

        An abstract character has no concrete form and should not be
        confused with a glyph.

        -- 3.4 Characters and Encoding

Intending to write unambiguous plain text one will find an abstract
character missing:

             A. (…)
             B. […]
             C. “…”
             D. „…

Abusing “ (U+201C) at the end of D, relying on the appearance, is
confusing glyph with character.

        Paired punctuation marks containing the qualifier “left” in
        their name are taken to denote opening; characters whose name
        contains the qualifier “right” are taken to denote closing.

-- Paired Punctuation, in 6.2 General Punctuation

It does not help to rephrase or delete that sentence, or to simply state

        All other quotation marks may represent opening or closing
        quotation marks depending on the usage.

        -- Consequences for Semantics, in Language-Based Usage of
        Quotation Marks, in 6.2 General Punctuation

because 'with quotes just use something that might be rendered like what
you mean' is in stark contrast to many other things:

1) The various brackets are encoded with their "open" and "close"
information.

2) The "mathematical context" is encoded with that tilde:

For mathematical usage, U+223C “~” tilde operator should be used
to unambiguously encode the operator.

-- Tilde, in Dashes and Hyphens, in 6.2 General Punctuation

One is not forced to use U+2053 and provide "mathematical context" on a
higher level than plain text.

3) It is very common to provide "sans-serif", "bold" and "italics" on a
level higher than plain text, yet unicode enables one to write
unambiguous plain text with both α (U+3B1) and get &#120746 (U+1D7AA).

4) Unicode enables one to write plain text that unambiguously
distinguishes between "minus" − (U+2212), "en dash" (U+2013) and "figure
dash" ‒ (U+2012), which might be rendered identically, instead of
requiring these differences be made on a higher level and merely
stating: U+002D may represent all kinds of more or less central, shorter
or longer dashes depending on the usage.

5) One is even free to turn a glyph variant U+201F (‟) into a grapheme
on the plain text level by giving it a meaning that is not the same as U
+201C (“).

So there may be a mistake …

… as this may be:

        The semantics of U+201A and U+201B low-9 quotation marks are
        always opening;

        -- Consequences for Semantics, in Language-Based Usage of
        Quotation Marks, in 6.2 General Punctuation

In Albanian and Greek they may be closing.

        (This is not exactly scientific, but such topics tend to be
        quite comprehensive with a reasonable probability of
        correctness:

http://de.wikipedia.org/wiki/%22#Andere_Sprachen
:-)

Michael

P.S.

As far as I understand the situation RIGHT HIGH 6(6) have not been not
encoded to optimise the abstraction of characters, as the

Abstractness of brackets can be increased:

Nestable not-so-abstract characters like ( ) [ ] { } 〈 〉 can be
replaced by only half of them plus 2, namely COMBINING OPEN and CLOSE;
thus, for example

(level 1 (level 2) level 1)

becomes

(COlevel 1 (COlevel 2(CC level 1(CC

which is evidently more abstract. The current state is 33 % less
abstract for the given brackets; as this is applicable to all nestable
pairs the abstraction of such abstract characters can almost be doubled
(as their number can almost be halved).

-- 
"Well," Brahma said, "even after a thousand explanations,
a fool is no wiser, but an intelligent man requires only
two hundred and fifty."
        
        – From the Mahābhārata (महाभारत)

Received on Thu May 03 2012 - 01:23:36 CDT

This archive was generated by hypermail 2.2.0 : Thu May 03 2012 - 01:23:38 CDT