Re: U+2018 is not RIGHT HIGH 6

From: Michael Probst <michael.probst03_at_web.de>
Date: Wed, 02 May 2012 16:59:58 +0200

Am Freitag, den 27.04.2012, 18:01 +0200 schrieb Werner LEMBERG:

> 2) There might be different quotation characters within a document,
> meaning different things. In other words, there are documents
> where the distinction between various quotation marks is more than
> a glyph variant.

So if two glyphs have enough "visual character" to be used in one
document to express two different meanings, then they should be encoded
as different characters? If all these marks have been encoded so they
can be used in a single document to mean different things, then the
RIGHT HIGH 6 and 66 should be encoded if they do occur in documents with
U+2018 and U+201C meaning something different? The idea of the existence
of such works in English and German about German and English language
respectively does not seem to be too exotic.

> > d) Is U+201F (‟) considered a mistake then? It is only about looks,
> > not about meaning like a RIGHT HIGH 6 Q... would be.
>
> See my second argument above.

Kenneth Whistler surmised that U+201F would probably have
got no code point in 2006

        http://unicode.org/mail-arch/unicode-ml/y2006-m06/0300.html

So today U+201B and U+201F would possibly not be encoded at all.

This looks a bit contradictory to me.

Either one leaves it to font selection whether you get

        x \
        \ or x

and thus to the font designer to make the glyph look either way, or one
allows for the use of both forms in one document without provision of
additional meta-information (one could alternate between two fonts each
providing one of the two glyphs for U+2018 and add the font switch info)
by encoding both of them.

> > e) How does the font know which glyph to choose for a given, say,
> > UTF-8 byte sequence? Do we get back to "charset" selection then?
>
> In many cases, plain UTF-8 text doesn't transport enough meta
> information to be rendered correctly. It is the job of the user's
> environment, or the heuristics built into the rendering engine, or
> explicit users settings or document tagging (script, language, etc.)
> to provide this information.
>
> > f) Should a code point not encode meaning and thus a "left opening"
> > mark never be required to be abused as a "right closing" one?
>
> This is not how Unicode works. You can find the fine details in the
> standard.
>
> Whether a certain character is `opening' or `closing' is a meta
> information, usually depending on the language (or the country, or the
> writing direction, or...), to be provided by a higher level.

Does it work at all? Why not specify "ISO-..." for each piece of text
(sequence of bytes) of a document on that level? Specifying information
above the encoding makes the "traditional" encodings perfectly
compatible with each other.

When writing an article in English about German language should one have
to switch the font for each piece of German text containing quotation
marks? or change the locale settings for each piece? or issue some
variation selection for each piece, so the font can pick the best glyph?

One might as well provide the meta-information of "ISO-8859-6" and
"ISO-8859-8" to have the byte E1 be displayed as some (ف)(Arabic Faa) or
(ב)(Hebrew Beth) respectively. In this case the "meta-information" of
"Arabic" and "Hebrew" have been encoded into the unicode characters so
one needs not specify them. The right hand use of the HIGH 6 is not even
depending on just two languages alone like the Arabic E1 and the Hebrew
E1, and LEFT and RIGHT HIGH 6 can occur in single documents just like
these two.

Moreover, the "higher level" making the distinction between U+201E and U
+201C (or the distinction between U+201C and U+201D) does not include
only one of these plus some meta-information, but inserts different
codes. So some meta-information seems to have already been encoded into
the characters. Otherwise even U+201D (for example) were superfluous as
"being the other one" can be determined on the higher level more easily
than the less visible but equal difference between U+201C and a RIGHT
HIGH 66.

And variation selection looks more like some "sub-font selection", for
example choosing

        http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=76F4

        "Ten eyes saw no concealment"
                -- http://zhongwen.com/d/170/x189.htm

to look either like "ten eyes on the ground without a concealment" or
more closely depicting its etymological construction -- but both within
one consistently designed font, not breaking the overall appearance,
only serving contemporary (and probably temporary) local taste.

The drive to reasonably cope with the vast variance of Chinese (and
Mayan, and hieroglyphic) characters, the strive for backward
compatibility and the small difference in appearance seem to hide the
fact that the difference between

        U+201C and U+201D,

as well as the difference between

        U+201C and U+201E

is just as big as the difference between

        U+201C and a RIGHT HIGH 66.

(The same goes for the single version -- which reminds me that the
"locale setting" on the "higher level" could as well provide
"meta-information" of whether to start with “ or ‘ on the first level of
quotation and thus render one of the encodings superfluous … :-)

Michael
Received on Wed May 02 2012 - 10:04:31 CDT

This archive was generated by hypermail 2.2.0 : Wed May 02 2012 - 10:04:32 CDT