a character for an unknown character
charupdate at orange.fr
Thu Dec 29 18:23:55 CST 2016
On Wed, 28 Dec 2016 19:05:17 -0800, Asmus Freytag wrote:
> On 12/28/2016 5:47 PM, Richard Wordingham wrote:
> > On Tue, 27 Dec 2016 21:33:32 -0800
> > Asmus Freytag wrote:
> > >
> > > When it comes to marks (or symbols) of less generic or more complex
> > > shapes, the
> > > presumption that the mark only has "one" shape may be more common,
> > > and examples of the mark
> > > being repurposed may be less common. Not being as common, fewer
> > > readers will
> > > recognize all stylistic variations as being "the same thing". A
> > > variant form will be more
> > > likely to be understood as a related, but not identical symbol. That
> > > in turn fuels the
> > > misperception that Unicode somehow encodes symbols based on a single
> > > conventional usage.
> > The idea of a single conventional usage is also fuelled by a number of
> > practices and policies:
> > 1) A letter belongs to a single script (not to be confused with
> > writing system)
> Making or not making that distinction makes some stuff easier and other stuff
> harder to support in software. Overall, I think Unicode got this one right.
> > 2) Distinction of punctuation and modifier letters, e.g. the highly
> > confusing distinction between U+2019 RIGHT SINGLE QUOTATION MARK and
> > U+02BC MODIFIER LETTER APOSTROPHE
> I'm beginning to thing that 02BC is closer to a mistake than a correct solution;
> there are places where it has to be treated on the same footing as 2019 even
> though the idea was to give it different properties.
U+02BC being shifted from a letter to a punctuation must have been anticipated at
encoding, since the original recommendation was to use it as apostrophe throughout.
Unifying the letter apostrophe and the punctuation apostrophe made IMO more
sense—despite of the conflicting properties—than the unification of the apostrophe
with a quotation mark, because of the downside in text processing (cf. past yearʼs
thread). The most proper solution was IMO to encode all three separately, the same
way as the COMMA has not been unified with the SINGLE LOW-9 QUOTATION MARK,
despite of the latter being often informally referred to as a “comma.”
> > 3) The resolution of U+002D HYPHEN-MINUS into U+2010 HYPHEN, U+2212
> > MINUS SIGN and a few minor punctuation marks
> HYPHEN-MINUS is a bad example, because it's a conflation of several
> quite distinct elements of type a single key for purposes of type writers.
Confusingly, that typewriter legacy is hanging far into the computer era, while
all other parts of computer science and computer practice are constantly updated.
> > 4) Distinction between decimal digits and letters
Perhaps the letters for hexadecimal digits should have been encoded separately?
> > 5) The nightmare of spacing single and double dots.
> ? spacing vs. combining? Not sure what you mean.
I think Richard refers to U+2024 ONE DOT LEADER and U+2025 TWO DOT LEADER, along
with U+002E FULL STOP.
> > Ideal solutions can also be defeated by limited keyboard layouts.
Yeah thatʼs the point. Programming, deploying and customizing keyboard layouts
should become quite trivial. Yet it seems that this is not a part of the curricula.
> > As a result, I have no idea whether the singular of "fithp" (one of
> > Larry Niven's alien species) should be spelt with U+02BC or U+2019,
> > though in ASCII I can just write "fi'".
Normally on an English or French keyboard layout, all three are accessed on
> The only place where "uni" doesn't apply in Unicode is that there's never just
> a single principle that applies, but always multiple ones that are in tension ---
> and in the edge cases, the tension can be felt keenly.
Sorry I cannot follow. Perhaps an example would make the issue clearer.
As of the apostrophe issue, this is IMO an exception, due to the lack of
a character (the punctuation apostrophe), a lack that in turn seems to have
been triggered by an atrophy of the analytical memory in favor of the visual
memory, that is bugged when faced with three “squiggles.” But repeatedly,
there are two comma-like characters, and four that generate period-like
appearances. Not sure whether the lack of an apostrophe reduces that nightmare
More information about the Unicode