a character for an unknown character

Marcel Schneider charupdate at orange.fr
Thu Dec 29 18:23:55 CST 2016

On Wed, 28 Dec 2016 19:05:17 -0800, Asmus Freytag wrote:
> On 12/28/2016 5:47 PM, Richard Wordingham wrote: 
> > On Tue, 27 Dec 2016 21:33:32 -0800 
> > Asmus Freytag  wrote: 
> > > 
> > > When it comes to marks (or symbols) of less generic or more complex 
> > > shapes, the 
> > > presumption that the mark only has "one" shape may be more common, 
> > > and examples of the mark 
> > > being repurposed may be less common. Not being as common, fewer 
> > > readers will 
> > > recognize all stylistic variations as being "the same thing". A 
> > > variant form will be more 
> > > likely to be understood as a related, but not identical symbol. That 
> > > in turn fuels the 
> > > misperception that Unicode somehow encodes symbols based on a single 
> > > conventional usage. 
> > The idea of a single conventional usage is also fuelled by a number of 
> > practices and policies: 
> > 
> > 1) A letter belongs to a single script (not to be confused with 
> > writing system) 
> Making or not making that distinction makes some stuff easier and other stuff 
> harder to support in software. Overall, I think Unicode got this one right. 
> > 
> > 2) Distinction of punctuation and modifier letters, e.g. the highly 
> > confusing distinction between U+2019 RIGHT SINGLE QUOTATION MARK and 
> I'm beginning to thing that 02BC is closer to a mistake than a correct solution; 
> there are places where it has to be treated on the same footing as 2019 even 
> though the idea was to give it different properties. 

U+02BC being shifted from a letter to a punctuation must have been anticipated at 
encoding, since the original recommendation was to use it as apostrophe throughout.
Unifying the letter apostrophe and the punctuation apostrophe made IMO more 
sense—despite of the conflicting properties—than the unification of the apostrophe 
with a quotation mark, because of the downside in text processing (cf. past yearʼs 
thread). The most proper solution was IMO to encode all three separately, the same 
way as the COMMA has not been unified with the SINGLE LOW-9 QUOTATION MARK, 
despite of the latter being often informally referred to as a “comma.”

> > 
> > 3) The resolution of U+002D HYPHEN-MINUS into U+2010 HYPHEN, U+2212 
> > MINUS SIGN and a few minor punctuation marks 
> HYPHEN-MINUS is a bad example, because it's a conflation of several 
> quite distinct elements of type a single key for purposes of type writers. 

Confusingly, that typewriter legacy is hanging far into the computer era, while 
all other parts of computer science and computer practice are constantly updated.

> > 
> > 4) Distinction between decimal digits and letters 

Perhaps the letters for hexadecimal digits should have been encoded separately?

> > 
> > 5) The nightmare of spacing single and double dots. 
> ? spacing vs. combining? Not sure what you mean.

I think Richard refers to U+2024 ONE DOT LEADER and U+2025 TWO DOT LEADER, along 
with U+002E FULL STOP.

> > 
> > Ideal solutions can also be defeated by limited keyboard layouts.

Yeah thatʼs the point. Programming, deploying and customizing keyboard layouts 
should become quite trivial. Yet it seems that this is not a part of the curricula.

> > As a result, I have no idea whether the singular of "fithp" (one of 
> > Larry Niven's alien species) should be spelt with U+02BC or U+2019, 
> > though in ASCII I can just write "fi'". 

Normally on an English or French keyboard layout, all three are accessed on 
live keys.

> The only place where "uni" doesn't apply in Unicode is that there's never just 
> a single principle that applies, but always multiple ones that are in tension --- 
> and in the edge cases, the tension can be felt keenly. 

Sorry I cannot follow. Perhaps an example would make the issue clearer.
As of the apostrophe issue, this is IMO an exception, due to the lack of 
a character (the punctuation apostrophe), a lack that in turn seems to have 
been triggered by an atrophy of the analytical memory in favor of the visual 
memory, that is bugged when faced with three “squiggles.” But repeatedly, 
there are two comma-like characters, and four that generate period-like 
appearances. Not sure whether the lack of an apostrophe reduces that nightmare 
by half.


More information about the Unicode mailing list