Characters that should be displayed?
Jukka K. Korpela
jkorpela at cs.tut.fi
Sun Jun 29 16:02:59 CDT 2014
2014-06-29 21:44, Koji Ishii wrote:
> The spec currently has the following text:
>> Control characters (Unicode class Cc) other than tab (U+0009), line
>> feed (U+000A), and carriage return (U+000D) are ignored for the
>> purpose of rendering. (As required by [UNICODE], unsupported
>> Default_ignorable characters must also be ignored for rendering.)
> and there’s a feedback saying that CSS should display visible glyphs
> for these control characters.
That would change the identity of the characters. They are by definition
“control characters”, i.e. they have no visible glyphs, but they may
have control effects. However, it might be argued that rendering them
somehow would not mean normal rendering but be a diagnostic indication
of an error. Those characters are invalid in HTML and XML (except XML
1.1, but who uses it?).
However, the tradition of web browsers is permissive in order to be
user-friendly. E.g., a casual control character somewhere might be
interesting to a *developer* or maintainer to notice, so that he could
analyze and fix the problem that caused it, but to a *user* (visitor),
it would mostly be just disturbing. He can’t fix the problem, and is
mostly useless to him to see that the page has some control character in
the source. So *developer tools* should indicate should problems or
provide ways to detect, but it seems correct to ignore them in normal
> Since all major browsers do not display
> them today, this is a breaking-change
Well, I would not take that as strong argument. This would be a change
in error processing. But it would not be useful for other reasons.
> I found the following text in Unicode 6.3, p. 185, "5.21 Ignoring
> Characters in Processing”:
>> Surrogate code points, private-use characters, and control
>> characters are not given the Default_Ignorable_Code_Point property.
>> To avoid security problems, such characters or code points, when
>> not interpreted and not displayable by normal rendering, should be
>> displayed in fallback rendering with a fallback glyph
> By looking at this, my questions are as follows:
> 1. Should control characters that browsers do not interpret be
> displayed in fallback rendering?
It is reasonable to interpret that there are no such control characters,
because all control characters except those with special handling are
interpreted as being invalid data and therefore ignored.
2. Should private-use characters
> (U+E000-F8FF, 0F0000-0FFFFD, 100000-10FFFD) without glyphs be
> displayed in fallback rendering?
They might be seen as “not displayable by normal rendering”, so yes. On
the practical side, although Private Use characters should not be used
in public information interchange, they are increasingly popular in
“icon font” tricks. Whatever we think of such tricks, users should not
be punished for them. If the trick fails (usually because a page uses a
downloadable font for icon glyphs allocated to Private Use codepoints
but something prevents the use of such a font), it is relevant to the
user to know that there is *some* data, which can be crucial (e.g., an
item in a navigation menu). So some dull fallback rendering is probably
better than simply ignoring the characters.
> 3. When the above text says “surrogate code points”, does that mean
> everything outside BMP?
No, it means code points that do not represent *any* characters due to
being in certain special areas in the coding space. They are invalid in
HTML and in XML. If they appear in data, the reason is usually that
UTF-16 encoded data containing non-BMP characters is being processed in
a wrong way. At the level of interpreting a byte stream as a stream of
characters, surrogate code *units* in UTF-16 should be processed and
interpreted in pairs so that one pair is taken as one character. And
when CSS gets at it, it only sees the character in the DOM.
It is adequate to ignore surrogate code points, since they are invalid
and signalling them to users (as opposite to developers) would hardly do
> 4. Should every code point that are not
> given the Default_Ignorable_Code_Point property and that without
> interpretations nor glyphs displayed in fallback rendering? I could
> not find such statement in Unicode spec, but there are some people
> who believe so.
> 5. Is there anything else Unicode recommends to
> display in fallback rendering, or not to display? This must be RTFM,
> but pointing out where to read would be appreciated.
From the Unicode point of view, an implementation may decide what
characters it supports. What it does to characters that it does not
support seems to be generally up to the implementation to decide as
regards to rendering. Here, too, I would consider the practical impact
on users. If a page contain characters that have no glyphs in the fonts
that are used, then the page has data that is probably valid but cannot
be rendered in a particular situation. Showing some indication of this
is relevant, because the user knows he is missing something real, and he
might be able to fix the situation in various ways (e.g., changing
browser settings, downloading an installing extra fonts, or just
switching to a different browser – browsers are known to differ in their
abilities to use the fonts installed in a system).
More information about the Unicode