Display Problems?
During an early period in the history of the Unicode® Standard, when software products were starting to
support Unicode text, it was often the case that products supported some Unicode characters
and scripts but not others. This created problems for users. For instance, people who
wanted to create Web content in different languages using Unicode characters couldn’t be certain that
the browser used to read the content would be able to display it legibly. As a result,
there was a broad need for tips on how to diagnose and solve display problems.
Today, the situation is much better. Major operating systems and browsers have broad
support for Unicode characters and scripts, and legible display of Unicode text is not the
widespread problem that it was in early days.
There are three kinds of text display problem that might still occur in modern software
products:
Other special considerations apply to the display of Unicode Emoji, but are not
covered here. For more information regarding emoji, see
FAQ: Emoji and Pictographs
Lack of Font Support
Most operating systems include fonts that provide extensive coverage of Unicode
characters, and most applications know how to make use of the system fonts. There may be
gaps, however.
When Unicode text is displayed but there is a lack of font support for some characters in
the text, the typical symptom is appearance of special character-not-supported or
“tofu” glyphs. (Font vendors often refer to such glyphs as “.notdef” glyphs.) Often, this
will look like a white square box (like a piece of tofu), or a box containing a question mark or
diagonals. Some applications generate a fallback glyph that shows the code point for the
character.
Other symbols might also be used. Sometimes, there might just be blank space.
When this occurs, the underlying issue is most likely to be one of the following:
- The product might not yet have been updated to support characters added in the most
recent versions of the Unicode Standard.
- An operating system might have font support, but an application running on that OS
might have its own font selection or fallback logic that is not up to date with what’s
available in the latest version of the OS.
- Due to limited storage (especially on mobile devices) or other such factors, a vendor
might decide not to include font support for less-frequently-used characters.
If you encounter this issue and have access to a font that does support the characters in
the text, you may be able to work around the issue if the application provides a way for
you to indicate that the text should be displayed with that font. In apps that support
text editing, there will usually be a way to select the font used to display the text. In
some cases, the app might not accept the font you select; if that happens, contact the app
vendor for help.
In apps that are not text editors, getting your custom font used might require tailoring
of font fallback logic used by the app. That is not a commonly-available feature. Contact
the vendor to see if that is possible, or to report the gap in font support in their app.
If this issue occurs with Web content, it is likely that the content author has assumed
that an appropriate font can be supplied by the browser or by the host OS the browser is
running on. A better approach is for the content to use CSS Web fonts to control what
fonts are used to display the content. Contact the content author to suggest that option.
Incorrect Shaping
In some situations, text might display with recognizable characters of some script, but
not with the expected glyph forms, or without correct positioning of marks. For example,
within Arabic-script text, you might see a character that isn’t connecting to another
character as expected. Or in an Indic-script text, you might see a conjunct form, but not
the expected conjunct form.
These symptoms can be due to one of three issues: incorrect encoding of text, a
limitation or bug in software, or a limitation or bug in the font.
Incorrect Encoding of the Text
The content might not be using the appropriate Unicode characters for the text, or it
might not be using appropriate character sequences to represent certain text elements. The
text may look correct in some specific context (some specific software with a specific
font), but is not represented in an interoperable way that would work as intended in other
contexts.
For example, if Arabic-script text contains characters from the Arabic Presentation
Forms-A or Arabic Presentation Forms-B blocks, those characters would not display with
different connecting forms in different word contexts. The characters in those blocks are
for legacy or special-use purposes only and should not normally be used in Arabic-script
text.
Another common situation involves Indic scripts. Some characters, such as vowel letters,
have an appearance that’s like a combination of other characters, but these are not
considered equivalent in Unicode. For instance, U+0906 “आ” appears to be like a
combination of U+0905 “अ” plus U+093E “ा”. However, that sequence is not
equivalent and, in fact, is explicitly documented as not to be used. (See
Table 12-1, Devanagari Vowel Letters.)
Even so, some Devanagari-script content may incorrectly be
using such sequences to represent the vowel letters. And some software or font
implementations may intentionally be displaying the sequence with a different appearance
from the vowel letter to avoid potential security issues.
If you suspect the encoded representation used in the text is the problem, contact the
content author.
Software Limitation or Bug
Many scripts have complex rendering behaviours that require specific support in a
rendering or “shaping” engine. An app or operating system might be able to display a
default form of each character in its logical order, but not have the special logic
needed to correctly shape the text so that it appears as expected for that script.
Symptoms can include the following:
- Character sequences are not displayed in the expected direction for that script
(for example, left-to-right rather than right-to-left).
- Characters of a cursive-connecting script display with disconnected glyphs.
- Within syllable clusters, characters appear in the wrong order.
- Within syllable clusters, marks appear on the wrong base glyphs.
- Certain character sequences that are expected to display with a special form
instead display with a different form or with the default glyphs for each character.
These symptoms point to a lack of correct shaping support in software. There might also be
font issues involved, as discussed below. If the same symptoms occur when using different fonts from
different vendors, that even more strongly suggests a software issue.
This could be a known limitation in the software: that version might not yet have shaping
support for characters added in the most recent versions of the Unicode Standard, or the vendor might
not yet have implemented support for that script. On the other hand, the vendor might have
added support for the script but with proprietary logic that doesn’t follow Unicode
specifications. Or, the software might simply have a bug.
If you suspect a software limitation or bug is the cause, contact the software vendor.
For particularly complex scripts, it’s also possible that the Unicode specifications for
that script are incomplete. That could lead to different software implementations
displaying the same character sequences in different ways, because there isn’t a complete
specification for how certain text elements should be encoded, or how the encoded
sequences should be displayed. If that’s the case, the Unicode Technical Committee can
consider proposals to extend the specifications for that script.
Font Limitation or Bug
For scripts that have complex rendering behaviours, fonts need to be correctly
implemented with certain layout data that determines what glyphs will be displayed and how
they will be positioned. (This is in addition to software needing to have appropriate
“shaping engine” support.) Typical symptoms include the following:
- Characters of a cursive-connecting script do not display with the correct connecting
form.
- Marks are not correctly positioned on the base glyph or they display over spaces.
- Certain character sequences that are expected to display with a special form instead
display with a different form or with the default glyphs for each character.
It’s possible the font has an incomplete implementation. For example, the font developer
may have added default glyphs for the characters of a script, matching what they see in
the Unicode code charts, but not added the additional glyphs required for correct display
of the script.
If you suspect a font issue, contact the font vendor, or try using a different font.
Incorrect Characters
Occasionally, you may see garbled text with incorrect characters. In some cases, you
might see several occurrences of “�” or another symbol such as “?”. This is
sometimes referred to as “mojibake”. These symptoms suggest an encoding error—most likely,
the text went through an incorrect encoding conversion.
The most likely cause for this is that text was, at some earlier point, encoded in a
legacy encoding (or character set) but was incorrectly labeled (with incorrect metadata) to indicate
the exact encoding.
For example, if a file containing the text “Русский” was encoded using the Windows-1251
encoding but was not labelled as such (with metadata contained inside the file or in the
repository holding the file), then an app reading that file might assume a different
encoding and interpret it as different characters. For instance, the software might assume
Windows-1252 encoding and then interpret the text as “Ðóññêèé”. Or, if other heuristics
suggested that the text was using Big 5 (Traditional Chinese) encoding, the app would interpret the
text as “唒嚭膱�”.
Good text encoding practice has always required that the encoding used for content be
explicitly declared in metadata. Today, best practice is that text be encoded using a
Unicode encoding form such as UTF-8.
If you suspect an encoding or encoding conversion issue is the cause of the display
problem, then contact the content author. Or, if the content is maintained in some
repository, contact the agency maintaining that repository.