Re: "Missing character" glyph

From: Doug Ewell (dewell@adelphia.net)
Date: Thu Aug 01 2002 - 11:42:58 EDT


Martin Kochanski <unicode at cardbox dot net> wrote:

> To look at it another way, virtually the only action that the Unicode
> Consortium needs to take to define UNRENDERED CHARACTER is to promise
> never to define a character at that code point.

I think this is exactly what they have done by creating the
"noncharacters" from U+FDD0 through U+FDEF. These code points are
guaranteed never to be assigned to real characters.

The better-known noncharacters U+FFFE and U+FFFF have some "suggested"
semantics that may discourage their use in applications such as
Martin's. U+FFFE is a byte-swapped UTF-16 BOM, so certain software
might handle it specially with that in mind (e.g. it might byte-swap the
rest of the text). U+FFFF is -1 in a 16-bit environment, so some
software might intentionally use it as a sentinel or other special
value.

OTOH, there is nothing numerically special about the code points U+FDD0
through U+FDEF, and it seems unlikely that much software knows about
them or handles them in any special way, so Martin can probably use them
without interference from the OS or app.

> UNRENDERED CHARACTER has to be part of the BMP for backward
> compatibility: it should be renderable as a single glyph, not as a
> pair of glyphs, even on old systems that do not understand surrogates.
> The proposed positioning is intended to persuade older systems that
> this character should be rendered conventionally, like a Latin letter.

The suggested noncharacter code points are indeed in the BMP (there are
others outside the BMP). Putting such a beast in or near an alphabetic
script block, however, implicitly assigns a meaning to it (e.g. "this is
an unrendered character for use with alphabetic scripts"), which is
exactly what Martin was trying to avoid. Special formatting characters
and characters intended to aid special-purpose display scenarios (like
the control pictures) are intentionally segregated far from the
alphabetic script blocks

> Otto Stolz suggested U+03A2, which would be equally valid. However,
> U+03A2 is quite obviously the code for GREEK CAPITAL LETTER FINAL
> SIGMA. For O.S., this is a reason for using the code (because there
> is, in fact, no such letter, so the code can be used); for me, this
> is a strong reason for *not* using the code, because if it **ever**
> became necessary to encode GREEK CAPITAL LETTER FINAL SIGMA then no
> character other than U+03A2 would be acceptable, whereas U+024F has no
> inherent semantics at all.

The only reason to ever encode GREEK CAPITAL LETTER FINAL SIGMA, or for
that matter, LATIN CAPITAL LETTER SHARP S, would be to make some sort of
typographical point or engage in sort kind of spelling reform. As we
know, spelling reforms are usually aimed at making things simpler, not
more complex, and "odd" letters like Greek final sigma and Latin sharp-s
are part of that complexity. So it really shouldn't ever be necessary
to assign such a thing, but of course Asmus is correct; no code point is
100% safe.

My recommendation: Use the noncharacters. That's what they're there
for.

-Doug Ewell
 Fullerton, California



This archive was generated by hypermail 2.1.2 : Thu Aug 01 2002 - 09:47:55 EDT