From: Eric Muller (emuller@adobe.com)
Date: Thu Mar 22 2007 - 12:10:07 CST
Why does Adobe-Japan1 contains a single sequence for a given character?
The fundamental reason is that 1) CIDs are more restrictive than
characters, and 2) our CID collection is open-ended.
1) if you look at what is in the Adobe-Japan1 CID collection today, you
will notice that it distinguishes shapes that are not distinguished at
the character level. Whereas Unicode unifies two shapes that differ only
in roof-top modification or in rotated strokes, the AJ1 CID collection
retains those distinctions. If you think in terms of glyphs sets, a
character is a certain set of glyphs; a CID is a subset of one such set;
the glyphs for that CID in a given AJ-1 font family are a subset of that
subset; the glyph for that CID in a particular face of that family are a
subset of that. Of course, this nesting is not always very clean,
because of duplicate encoding, and various other historical accidents,
but it's still a useful view.
The fact that we identify a single subset of a given character (i.e.
have a single CID for a given character) does not mean that that subset
contains all the glyph shapes for the character. More concretely,
consider U+4FD8, for which we only have CID 4147: there are shapes which
are acceptable for U+4FD8 which are not acceptable for CID 4147. In
other words, these two things are not equivalent, so <U+4FD8> and
<U+4FD8, U+E0100> = CID 4147 express different things. Granted, this is
not explicitly stated in the definition of AJ1, but it is there.
It is true that if I display today <U+4FD8> with an AJ1 font, then I
will always get a shape that satisfies CID 4147, because that is the
only kind of shape that can get in an AJ1 font today. But if I display
with any Unicode font, not just an AJ1 font, <U+4FD8, U+E0100> and
<U+4FD8> can produce different results, and I have "more guarantees"
about the way <U+4FD8, U+E0100> will look like than I have about the way
<U+4FD8> will look like.
All this applies equally well to the cases where a character has
multiple CIDs. The only difference in that case is that I can guarantee
that two different occurrences of a given character will show up
differently.
2) our AJ1 CID collection is open-ended, i.e. we can add CIDs to it over
time, as the need arises. For example, suppose that JIS decides in a new
edition to modify the shape of the acceptable glyphs for a given JIS
code point, then we would add a CID for the new shape. Playing that in
the past: consider the shape given in JIS 0208 :1978 to 17-28 (aka
U+958F): we have CID 1246 for that; then :1984 comes along and changes
the shape of 17-28; we do not redefine the shape of CID 1246, instead we
add CID 7641. [This is reconstruction does not necessarily match what
really happened, it's only for illustration.]
Let's put the two together. If I want the CID 4147 guarantee but there
is not IVS for it today, then all I can put in my document is <U+4FD8>.
We already saw that this may or may not be displayed the way I want. I
need to impose, by means outside my plain text, the use of an AJ1 font
to get what I want. Not wonderful, but I could live with it. Then
tomorrow a new CID shows up for U+4FD8, and we register two sequences,
one for CID 4147 and one for the new CID. I can use those sequences in
new documents, but that leaves the document I created today in the cold.
Even if I can still enforce the use of an AJ1 font, I no longer get the
guarantee that this lone <U+4FD8> is displayed with CID 4147. I would
need a further guarantee that in AJ1 fonts, U+4FD8 is cmapped to CID
4147, now *and forever*. Well, if you look at the history of font's
cmaps, that is definitely not happening. Indeed, the change of shapes
mandated by the JIS standards make it more or less impossible to enforce
that "never change the cmaps", and that creates all sort of very nasty
problems for our customers. By registering today a sequence even when it
is the only one for a given character, we can offer our customers (and
others) a viable and robust solution.
As usual with variation sequences, this is not say that every occurrence
of a character in a document should be decorated with a variation
selector. Whether to decorate every occurrence, to decorate no
occurrence or anywhere in between depends on what guarantees you need
for your document. I could imagine for example that an official document
would systematically decorate people and place names, but would
systematically not decorate the "boilerplate".
Eric.
This archive was generated by hypermail 2.1.5 : Thu Mar 22 2007 - 12:11:54 CST