Re: Private Use proposals (long)

From: William Overington (WOverington@ngo.globalnet.co.uk)
Date: Sat May 25 2002 - 03:43:40 EDT


>In this context, I think William is saying that it's risky to overload
>ZWJ to handle Latin ligation because we can't completely rule out the
>possibility that we might need ZWJ to "join" Latin characters the way it
>currently joins Arabic characters. This concern can probably be put to
>rest by reading the description in Section 13.2 of the Unicode 3.1
>Technical Report (UAX #27). The description carefully spells out the
>relationship between "cursively connected" and "ligated" renditions and
>the roles ZWJ and ZWNJ play in determining the rendition to be used.
>

My concern was not of that as such. It was more of a computer programming
view of the matter than a typographic view of the matter. It was simply
that there are two types of operation, joining and ligating. It just seemed
to me that two different operations would best be implemented using two
different codes. I have now had a look at the text to which you refer and
it looks to be much the same or perhaps even identical to the text in one of
Mr Everson's pdf files on the matter to which I referred in a previous post.

The idea of ZWJ ZWNJ ZWJ as a sequence so that every possibility is
accessible seems less than elegant when there was the clear possibility of
defining ZWL and keeping ZWJ as it was.

Still, perhaps extending the definition of ZWJ so as not to define ZWL as an
extra command is elegantly subtle, I am still thinking about this: however,
I am learning more about the system by trying to resolve this matter to my
own satisfaction. I suppose that ultimately it comes down to how a person
himself or herself views software and data structures. The ZWJ ZWNJ ZWJ
sequence seems like an ice skater doing a clever spin in the air then
landing badly.

>Ah, but it *isn't* harmless. It causes problems for normalization. For
>homework tonight, read UAX #15, "Unicode Normalization Forms." The key
>point for our discussion is that creation of additional canonical or
>compatibility equivalents -- such as a new ligature for two existing
>characters -- would destabilize the normalization process, because
>normalization engines based on different versions of Unicode might
>produce different results. Beyond a certain point in time (defined as
>Unicode 3.1), no new canonical or compatibility equivalences can be
>defined. Because of this, a new "ft" ligature could not carry the
>obvious compatibility mapping to 0066 0074; but that would destroy most
>of the benefit of encoding it in the first place.
>

Well, I got the document and have looked at it. Unfortunately at the moment
I do not understand it. However, your example about the ft ligature above
provides me within an insight into what I think may be the issue.

>Fortunately, it is not necessary to assign new Unicode characters in
>order to put your favorite Latin ligatures in a font. Just create the
>"ft" ligature glyph and teach your font to substitute it in place of the
>unconnected 0066 and 0074 glyphs. You can assign the ligature to a PUA
>code point if you like, but if the internal mapping is done right it
>isn't necessary for you to publicize the PUA code point, or for users to
>use it directly. (I'm not a font designer, but the font designers on
>this list say this is easy.)
>

Please consider someone who is trying to produce a German Fraktur document
or a setting of a text from an 18th Century English printed book using
Microsoft Word 97 running under Windows 95. It seems to me that use of
Insert Symbol using a fount which has the required ligatures encoded in the
Private Use Area of a TrueType fount is a perfectly reasonable way to
produce the required output. The text might not be encoded in a manner
suitable for sorting of the words into dictionary order, yet that is not
required for that application. It is as if the argument used to prevent
adding further ligatures to the U+FB.. block is that because some
applications need to sort the words in text into alphabetical order, then
all applications must encode text in a way that enables the words in that
text to be sorted into alphabetical order.

Please consider that someone is writing an applet with the intention of
making it available on the web where people may customize a calling
parameter of the applet (by using a PARAM statement in the HTML call of the
applet) so that the applet typesets a short piece of text, such as a poem,
without an external fount at all, just from software built into the applet.
A set of codes for the ligatures is essential. For example, if one wished
to have the word astrolabe displayed on the screen, using an st ligature one
would enter a'uFB06rolabe in the PARAM statement in the HTML file. It seems
to me that if one wishes to have the word picture displayed on the screen
that regular Unicode should provide a code for the ct ligature. Now, I
recognize that the problem over sorting could occur in some applications,
yet I feel that the matter could perhaps be resolved by starting a new block
of alphabetic presentation forms that are called non-normalized ligatures or
whatever and including them there. There is an argument that lots of
ligatures might need to be defined, yet for German Fraktur there is a
specific set and for 18th Century English printing there is probably a
limited set which would be needed. Maybe not every fount maker would use
those codes, just as many fount makers do not use many of the codes that are
available to them in Unicode, yet it does seem to me that people both now
and in the future who may want to code transcriptions of short passages from
older books in a text that is primarily modern language should have
facilities made available by Unicode.

However, I feel that I have received little support for this position at the
present time. The situation might perhaps change as time goes on. However,
regardless of what is included in regular Unicode at the present time I do
feel that there is a need for a set of code point allocations so I am hoping
to publish a collection of my own Private Use Area allocations later today.
Hopefully those will be useful to people who would like some sort of list of
code points for ligatures, even if it is not a standard list. Maybe some
readers of this list who have no immediate use for such a list might
nevertheless file the list away somewhere in case they need it later.

>In other words, a grass-roots de-facto standard for encoding padlock
>symbols in the PUA would sort of emerge from the PUA code point
>allocations you have suggested. Personally, I am skeptical it would
>work out that way.

My idea was that people who are experimenting with the implications of your
suggestion to include padlock symbols would have the chance to find that the
results of their experiments were compatible as to any test founts which had
been made.

Though, since you raise the idea and since I have included the padlock
symbols in what is now called the Courtyard Codes collection, maybe those
four symbols will be used widely. There is nothing to stop anyone who has
the necessary fount producing facilities making a fount with those four
characters included at those Private Use Area code points if he or she so
chooses, and indeed publishing it if he or she so chooses.

William Overington

25 May 2002



This archive was generated by hypermail 2.1.2 : Sat May 25 2002 - 03:01:41 EDT