U+hhhh[h[h]] NAME syntax
Asmus Freytag (c)
asmusf at ix.netcom.com
Sat Aug 13 19:19:15 CDT 2016
On 8/13/2016 2:47 PM, Doug Ewell wrote:
> PDF is a presentation format. If the editorial committee sets
> character names in lowercase "under the hood" so that they will end up
> looking good in Minion smallcaps in the PDF file, and a user
> subsequently scrapes the PDF file for content, it doesn't mean there's
> anything formal or normative about setting character names in lowercase.
Character names, when presented in the Unicode character database are
uppercase. The general approach by Unicode is to define property names
and values so that case distinctions are not needed to unambiguously
resolve identifiers (same for space and most hyphens). That means, the
presentation can be flexibly adapted to the style of the document (e.g.
the Core Specification has a different style than other documents), yet
still retain unambiguous identification of the character.
I believe that small-caps generally looks nice and distinctive. For HTML
the way to do this is with a CSS style that allows the underlying text
representation to be uppercase while showing lowercase small-cap
letters. Marcel, I believe, gave some example, although something like
this was used as early as Unicode 5.0 for the UAXs, when we printed them
as part of the book.
For plain text, all caps is the easiest way to make the character name
stick out and prevent misinterpretation of it as part of the surrounding
text. The question becomes then, how much of the character name to show
and in which order.
I'm personally partial to U+nnnn (x) CHARACTER NAME. In some cases, this
requires some edits to make the text flow, but it has the advantage of
being unambiguous, and something that works well for characters of all
scripts and categories, including marks and punctuation. In some
instances U+nnnn (x) transliterated name works well. I like the use of (
) instead of " " (curly or not) because the latter is hopeless in
showing any combining marks above (the get lost among the "").
However, notations like x (U+nnnn) work pretty well, also, especially
when all the "x" are from a distinct-looking script. The same goes for x
CHARACTER NAME (U+nnnn). In many cases, there really isn't a need to
quote the glyph, and not doing so, can reduce clutter.
In short, this isn't a one-size fits all kind of situation.
More information about the Unicode