a character for an unknown character
Jukka K. Korpela
jkorpela at cs.tut.fi
Sun Dec 25 11:31:28 CST 2016
21.12.2016, 4:29, Martin Mueller wrote:
> Is there a Unicode character that says “I represent an alphanumerical
> character, but I don’t know which”.
I think including such a “character” in Unicode would not fit into the
the idea of Unicode as a system for encoding plain text characters. You
seem to be asking for a symbol that is not a graphic or control
character but information about uncertainty regarding a character a data
stream. So I think this does not fall into the category of plain text,
and the information should be expressed at a higher protocol level, e.g.
in markup or as out-of-band information.
When it is not certain what character there is in some text to be
encoded, there is a wide range of possible situations. For example, it
might be a thing like “there is letter U or letter V, probably the
latter” or “there is some Latin letter but no hint of what it might be”
or even “there is an alphanumerical character” (though I find it
difficult to imagine such a situation). Such things can hardly be
described using new characters; rather, they need to be expressed using
verbal descriptions (which are about the encoded text, not part of it)
or some formal notations or both.
> This is a very common problem in
> the transcription of historical texts where you have lacunas. Often, the
> extent of the lacuna is known, and the alphabet is known as well. The
> EEBO TCP transcriptions of English texts before 1700 are good examples.
> They are SGML transcriptions, where missing stuff is represented by
> <gap/> elements with attributes about this or that. This is efficient
> when it comes to pages, very inefficient when it comes to individual
Efficient in what sense? Saving bytes can hardly be an issue here. And
if various attributes are needed to describe the case, then it would
become awkward to try to do the same with encoded characters (or
“characters”, Unicode code points).
> In the TCP project, various code points from the Geometrical were used
> to represent lacunae. The black circle (\u25cf) has been used as the
> character for a missing character.This is OK and unambiguous in its
If some graphic symbol is by convention used to represent a lacuna, then
the issue, as regards to Unicode, is simply whether that symbol exists
as an encoded character or whether there is need to add that graphic
symbol to Unicode. But it would be a matter of encoding graphic
characters (irrespectively of their meaning in some content), not about
encoding abstract ideas like “an unrecognized character”.
> But would be nice to have a special character for just that
Various symbols are used in different contexts to indicate situations
like “there is a written symbol that cannot be recognized as a specific
character”. Perhaps there should be a universal convention about this,
but it is unrealistic to expect that to happen. The Unicode Standard can
hardly standardize such things. And if there were such a universal
symbol, it would surely have been encoded in Unicode—not because of its
meaning, but because of its consistent use as a character in plain text.
So I think the conclusion is that you should use established
conventions, if they exist, about using some symbol for such situations,
or define a convention as needed. You should not expect the character to
be recognized in this special meaning without such a higher-level
There’s a theoretical (?) problem with this. Let us assume that you
decide to use a particular character to represent “unknown character” in
your data, when working with some type of written texts. What happens
when you encounter, in the study of those text, a graphic symbol that is
best identified as the character you decided to use in that special
meaning? Well, I think you can decide to solve that problem if it ever
More information about the Unicode