UTF-8, UTF-16, UTF-32 & BOM
General questions, relating to UTF or Encoding Form
Q: Is Unicode a 16-bit encoding?
A: No. The first version of Unicode was a 16-bit encoding, from 1991 to 1995, but starting with Unicode 2.0 (July, 1996), it has not
been a 16-bit encoding. The Unicode Standard encodes characters in the range U+0000..U+10FFFF, which amounts to a 21-bit code space. Depending on the
encoding form you choose (UTF-8, UTF-16, or UTF-32), each character will then be represented either as a sequence of one to four 8-bit bytes,
one or two 16-bit code units, or a single 32-bit code unit.
Q: Can Unicode text be represented in more than one way?
A: Yes, there are several possible representations of
Unicode data, including UTF-8, UTF-16 and UTF-32. In addition,
there are compression transformations such as the one described in the
UTS #6: A Standard Compression Scheme for Unicode (SCSU).
Q: What is a UTF?
A: A Unicode transformation format (UTF) is an
algorithmic mapping from every Unicode code point (except surrogate code
points) to a unique byte
sequence. The ISO/IEC 10646 standard uses the term “UCS transformation
format” for UTF; the two terms are merely synonyms for the same concept.
Each UTF is reversible, thus every UTF supports lossless round tripping: mapping
from any Unicode coded character sequence S to a sequence of bytes and
back will produce S again. To ensure round tripping, a UTF mapping
must map all code points (except surrogate code points) to
unique byte sequences. This includes reserved (unassigned) code points and the 66 noncharacters
(including U+FFFE and U+FFFF).
compression method, even though it is reversible, is not a UTF because the same string can map to very
many different byte sequences, depending on the particular SCSU
Q: Where can I get more information on
A: For the formal definition of UTFs see
Section 3.9, Unicode Encoding Forms in The Unicode Standard. For more information on encoding
forms see UTR #17: Unicode Character Encoding Model.
Q: How do I write a UTF converter?
A: The freely available open source project International Components for Unicode (ICU) has UTF conversion built into it. The latest version may be downloaded from the ICU Project web site. [AF]
Q: Are there any byte sequences that
are not generated by a UTF? How should I interpret them?
A: None of the UTFs can generate every arbitrary byte
sequence. For example, in UTF-8 every byte of the form 110xxxxx2
must be followed with a byte of the form 10xxxxxx2.
A sequence such as <110xxxxx2 0xxxxxxx2>
is illegal, and must never be generated. When faced with this illegal
byte sequence while transforming or interpreting, a UTF-8 conformant
process must treat the first byte 110xxxxx2 as an
illegal termination error: for example, either signaling an error,
filtering the byte out, or representing the byte with a marker such as
FFFD (REPLACEMENT CHARACTER). In the latter two cases, it will continue
processing at the second byte 0xxxxxxx2.
A conformant process must not interpret illegal or
ill-formed byte sequences as characters, however, it may take error
recovery actions. No conformant process may use irregular byte
sequences to encode out-of-band information.
Q: Which of the UTFs do I need to support?
A: UTF-8 is most common on the web. UTF-16 is used by Java and Windows. UTF-8 and UTF-32
used by Linux and various Unix systems. The conversions between all of them are
algorithmically based, fast and lossless. This makes it easy to support
data input or output in multiple formats, while using a particular UTF
for internal storage or processing.
Q: What are some of the differences
between the UTFs?
A: The following table summarizes some of the properties of
each of the UTFs.
|Smallest code point
|Largest code point
|Code unit size
|Fewest bytes per character
|Most bytes per character
In the table <BOM> indicates that the byte order is
determined by a byte order mark, if present at the beginning of the data
stream, otherwise it is big-endian. [AF]
Q: Why do some of the UTFs have a BE or LE
in their label, such as UTF-16LE?
A: UTF-16 and UTF-32 use code units that are two and four
bytes long respectively. For these UTFs, there are three sub-flavors:
BE, LE and unmarked. The BE form uses big-endian byte serialization
(most significant byte first), the LE form uses little-endian byte
serialization (least significant byte first) and the unmarked form uses
big-endian byte serialization by default, but may include a byte order
mark at the beginning to indicate the actual byte serialization used. [AF]
Q: Is there a standard method to package a
Unicode character so it fits an 8-Bit ASCII stream?
A: There are three or four options for making Unicode fit into
an 8-bit format.
a) Use UTF-8. This preserves ASCII, but not Latin-1,
because the characters >127 are different from Latin-1. UTF-8 uses
the bytes in the ASCII only for ASCII characters. Therefore, it works
well in any environment where ASCII characters have a significance as
syntax characters, e.g. file name syntaxes, markup languages, etc., but
where the all other characters may use arbitrary bytes.
Example: “Latin Small Letter s with Acute” (015B) would be
encoded as two bytes: C5 9B.
b) Use Java or C style escapes, of the form \uXXXXX or \xXXXXX.
This format is not standard for text files, but well defined in the
framework of the languages in question, primarily for source files.
Example: The Polish word “wyjście” with character “Latin Small
Letter s with Acute” (015B) in the middle (ś is one character) would
look like: “wyj\u015Bcie".
c) Use the
&#DDDDD; numeric character escapes
as in HTML or XML. Again, these are not standard for plain text files,
but well defined within the framework of these markup languages.
Example: “wyjście” would look like “
d) Use SCSU.
This format compresses Unicode into 8-bit format, preserving most of
ASCII, but using some of the control codes as commands for the decoder.
However, while ASCII text will look like ASCII text after being encoded
in SCSU, other characters may occasionally be encoded with the same byte
values, making SCSU unsuitable for 8-bit channels that blindly interpret
any of the bytes as ASCII characters.
Example: “<SC2> wyjÛcie” where <SC2> indicates the byte 0x12 and
“Û” corresponds to byte 0xDB. [AF]
Q: Which of these approaches is the best?
A: That depends on the circumstances: Of these four
approaches, d) uses the least space, but cannot be used transparently in most 8-bit environments. a) is the most widely supported in
plain text files and b) and c) use the most space, but are widely
supported for program source files in Java and C, or within HTML and XML files respectively.
Q: Which of these formats is the most standard?
A: All four require that the receiver can understand that
format, but a) is considered one of the three equivalent Unicode
Encoding Forms and therefore standard. The use of b), or c) out of their
given context would definitely be considered non-standard, but could be
a good solution for internal data transmission. The use of SCSU is
itself a standard (for compressed data streams) but few general purpose
receivers support SCSU, so it is again most useful in internal data
Q: What is the definition of UTF-8?
A: UTF-8 is the byte-oriented encoding form of Unicode. For
details of its definition, see Section 2.5, Encoding Forms and Section
3.9, Unicode Encoding Forms ” in The Unicode Standard. See, in particular, Table 3-6 UTF-8 Bit Distribution
and Table 3-7 Well-formed UTF-8 Byte Sequences, which give
succinct summaries of the encoding form. Make sure you refer to the latest version of the
Unicode Standard, as the
Unicode Technical Committee has tightened the definition of UTF-8
over time to more strictly enforce unique sequences and to prohibit
encoding of certain invalid characters. There is an Internet
about UTF-8. UTF-8 is also defined in Annex D of ISO/IEC 10646. See also
the question above, How do I write a UTF converter?
Q: Is the UTF-8 encoding scheme the same
irrespective of whether the underlying processor is little endian or big
A: Yes. Since UTF-8 is interpreted as a sequence of bytes,
there is no endian problem as there is for encoding forms that use
16-bit or 32-bit code units. Where a BOM is used with UTF-8, it is
only used as an encoding signature to distinguish UTF-8 from other encodings — it has nothing
to do with byte order.
Q: Is the UTF-8 encoding scheme the same
irrespective of whether the underlying system uses ASCII or EBCDIC
A: There is only one definition of UTF-8. It is precisely the same,
whether the data were converted from ASCII or EBCDIC based character
sets. However, byte sequences from standard UTF-8 won’t interoperate
well in an EBCDIC system, because of the different arrangements of
control codes between ASCII and EBCDIC.
UTF-EBCDIC defines is a specialized UTF that will
interoperate in EBCDIC systems.
Q: How do I convert a UTF-16 surrogate
pair such as <D800 DC00> to UTF-8? As one 4-byte sequence or as two
separate 3-byte sequences?
A: The definition of UTF-8 requires that supplementary
characters (those using surrogate pairs in UTF-16) be encoded with a
single 4-byte sequence. However, there is a widespread practice of generating
pairs of 3-byte sequences in older software, especially software which pre-dates the
introduction of UTF-16 or that is interoperating with UTF-16
environments under particular constraints. Such an encoding is not conformant
to UTF-8 as defined. See UTR
#26: Compatability Encoding Scheme for UTF-16: 8-bit (CESU) for a
formal description of such a non-UTF-8 data format. When using CESU-8,
great care must be taken that data is not accidentally treated as if it
was UTF-8, due to the similarity of the formats.
Q: How do I convert an unpaired UTF-16 surrogate
A different issue arises if an unpaired surrogate is
encountered when converting ill-formed UTF-16 data. By representing such
an unpaired surrogate on its own as
a 3-byte sequence, the resulting UTF-8 data stream would become
ill-formed. While it faithfully reflects the nature of the input,
Unicode conformance requires that encoding form conversion always
results in a valid data stream. Therefore a converter must treat
this as an error. [AF]
Q: What is UTF-16?
A: UTF-16 uses a single 16-bit code unit to encode the most
common 63K characters, and a pair of 16-bit code units, called
surrogates, to encode the 1M less commonly used characters in Unicode.
Originally, Unicode was designed as a pure 16-bit
encoding, aimed at representing all modern scripts. (Ancient scripts
were to be represented with private-use characters.) Over time, and
especially after the addition of over 14,500 composite characters for
compatibility with legacy sets, it became clear that 16-bits were not
sufficient for the user community. Out of this arose UTF-16.
Q: What are surrogates?
A: Surrogates are code points from two special ranges of Unicode
for use as the leading, and trailing values of paired code units
in UTF-16. Leading, also called high, surrogates are
from D80016 to DBFF16, and trailing, or low,
surrogates are from DC0016 to DFFF16. They are called
surrogates, since they do not represent characters directly, but only as a
Q: What’s the algorithm to convert from
UTF-16 to character codes?
A: The Unicode Standard used to contain a short algorithm,
now there is just a bit distribution table. Here are three short code snippets
that translate the information from the bit distribution table into C
code that will convert to and from UTF-16.
Using the following type definitions
typedef unsigned int16 UTF16;
typedef unsigned int32 UTF32;
the first snippet calculates
the high (or leading) surrogate from a character code C.
const UTF16 HI_SURROGATE_START = 0xD800
UTF16 X = (UTF16) C;
UTF32 U = (C >> 16) & ((1 << 5) - 1);
UTF16 W = (UTF16) U - 1;
UTF16 HiSurrogate = HI_SURROGATE_START | (W << 6) | X >> 10;
where X, U and W correspond to the labels used in Table
3-5 UTF-16 Bit Distribution. The next snippet does the same for the low surrogate.
const UTF16 LO_SURROGATE_START = 0xDC00
UTF16 X = (UTF16) C;
UTF16 LoSurrogate = (UTF16) (LO_SURROGATE_START | X & ((1 << 10) - 1));
Finally, the reverse, where hi and lo are the high and low
surrogate, and C the resulting character
UTF32 X = (hi & ((1 << 6) -1)) << 10 | lo & ((1 << 10) -1);
UTF32 W = (hi >> 6) & ((1 << 5) - 1);
UTF32 U = W + 1;
UTF32 C = U << 16 | X;
A caller would need to ensure that C, hi, and lo are in the
appropriate ranges. [AF]
Q: Isn’t there a simpler way to do this?
A: There is a much simpler computation that does not try to
follow the bit distribution table.
const UTF32 LEAD_OFFSET = 0xD800 - (0x10000 >> 10);
const UTF32 SURROGATE_OFFSET = 0x10000 - (0xD800 << 10) - 0xDC00;
UTF16 lead = LEAD_OFFSET + (codepoint >> 10);
UTF16 trail = 0xDC00 + (codepoint & 0x3FF);
UTF32 codepoint = (lead << 10) + trail + SURROGATE_OFFSET;
Q: Why are some people opposed to UTF-16?
A: People familiar with variable width East Asian character
sets such as Shift-JIS ( SJIS) are understandably nervous about UTF-16,
which sometimes requires two code units to represent a single character.
They are well acquainted with the problems that variable-width
codes have caused.
However, there are some important differences between the mechanisms
used in SJIS and UTF-16:
In SJIS, there is overlap between the leading and
trailing code unit values, and between the trailing and single code unit values. This causes a number of problems:
It causes false matches. For example, searching for
an “a” may match against the trailing code unit of a Japanese character.
It prevents efficient random access. To know whether
you are on a character boundary, you have to search backwards to
find a known boundary.
It makes the text extremely fragile. If a unit is
dropped from a leading-trailing code unit pair, many following characters can be
In UTF-16, the code point ranges for high and low
surrogates, as well as for single units are all completely disjoint.
None of these problems occur:
There are no false matches.
The location of the character boundary can be directly
determined from each code unit value.
A dropped surrogate will corrupt only a single
The vast majority of SJIS characters require 2 units,
but characters using single units occur commonly and often have
special importance, for example in file names.
With UTF-16, relatively few characters require 2 units.
The vast majority of characters in common use are single code units.
Even in East Asian text, the incidence of surrogate pairs should be
well less than 1% of all text storage on average. (Certain
documents, of course, may have a higher incidence of surrogate
pairs, just as phthisique is an fairly infrequent word in
English, but may occur quite often in a particular scholarly text.)
Q: Will UTF-16 ever be extended to more
than a million characters?
A: No. Both Unicode and ISO 10646 have
policies in place that formally limit future code assignment to
the integer range that can be expressed with current UTF-16 (0 to
1,114,111). Even if other encoding forms (i.e. other UTFs) can represent
larger integers, these policies mean that all encoding forms will
always represent the same set of characters. Over a million possible codes is far more than enough
for the goal of Unicode of encoding characters, not glyphs. Unicode is not designed to encode arbitrary data. If
you wanted, for example, to give each “instance of a character on paper
throughout history” its own code, you might need trillions or
quadrillions of such codes; noble as this effort might be, you would not
use Unicode for such an encoding. [AF]
Q: Are there any 16-bit values that are
A: Unpaired surrogates are invalid in UTFs. These include any value
in the range D80016 to DBFF16 not followed by a value in the range DC0016
to DFFF16, or any value in the range DC0016 to DFFF16 not preceded by a
value in the range D80016 to DBFF16. [AF]
Q: What about noncharacters? Are they invalid?
A: Not at all. Noncharacters are valid in UTFs and must be properly converted.
For more details on the definition and use of noncharacters, as well as their correct representation in each UTF,
see the Noncharacters FAQ.
Q: Because most supplementary characters are uncommon, does that mean I can ignore them?
A: Most supplementary characters (expressed with surrogate pairs in
UTF-16) are not too common. However, that does not mean that
supplementary characters should be neglected. Among them are a number of
individual characters that are very popular, as well as many sets
important to East Asian procurement specifications. Among the notable
supplementary characters are:
many popular emoji and emoticons
symbols used for interoperating with Wingdings and Webdings
numerous small sets of CJK characters important for procurement, including personal and place names
variation selectors used for all ideographic variation sequences
numerous minority scripts important for some user communities
some highly salient historic scripts, such as Egyptian hieroglyphics
Ken Lunde has an interesting presentation file on this topic, with a Top Ten list: Why Support Beyond-BMP Code Points?
Q: How should I handle supplementary characters in my code?
A: Compared with BMP characters as a whole, the supplementary characters
occur less commonly in text. This remains true now, even though many
thousands of supplementary characters have been added to the standard,
and a few individual characters, such as popular emoji, have become
quite common. The relative frequency of BMP characters, and of
the ASCII subset within the BMP, can be taken into account when
optimizing implementations for best performance: execution speed, memory
usage, and data storage.
Such strategies are particularly useful for UTF-16 implementations,
where BMP characters require one 16-bit code unit to process or store,
whereas supplementary characters require two.
Strategies that optimize for the BMP are less useful for UTF-8
implementations, but if the distribution of data warrants it, an
optimization for the ASCII subset may make sense, as that subset only
requires a single byte for processing and storage in UTF-8.
Q: What is the difference between UCS-2 and UTF-16?
A: UCS-2 is obsolete terminology which refers to a Unicode implementation up to Unicode 1.1, before surrogate code points and UTF-16 were added to Version 2.0 of the standard. This term should now be avoided.
UCS-2 does not describe a data format distinct from UTF-16, because
both use exactly the same 16-bit code unit representations. However,
UCS-2 does not interpret surrogate code points, and thus
cannot be used to conformantly represent supplementary characters.
Sometimes in the past an implementation has been labeled "UCS-2" to indicate that it does not support supplementary characters and doesn't interpret pairs of surrogate code points as characters. Such an implementation would not handle processing of character properties, code point boundaries, collation, etc. for supplementary characters. [AF]
Q: What is UTF-32?
A: Any Unicode character can be
represented as a single 32-bit unit in UTF-32. This single 4 code unit
corresponds to the Unicode scalar value, which is the abstract number
associated with a Unicode character. UTF-32 is a subset of the encoding
mechanism called UCS-4 in ISO 10646. For more information, see Section 3.9, Unicode Encoding Forms in The Unicode Standard.
Q: Should I use UTF-32 (or UCS-4) for
storing Unicode strings in memory?
A: This depends. If you frequently need to access APIs that
require string parameters to be in UTF-32, it may be more convenient to
work with UTF-32 strings all the time. However, the downside of UTF-32
is that it forces you to use 32-bits for each character, when only 21
bits are ever needed. The number of significant bits needed for the
average character in common texts is much lower, making the ratio
effectively that much worse. In many situations that does not matter,
and the convenience of having a fixed number of code units per character
can be the deciding factor.
Increasing the storage for the same
number of characters does have its cost in applications dealing with
large volume of text data: it can mean exhausting cache limits sooner;
it can result in noticeably increased read/write times or in reaching
bandwidth limits; and it requires more space for storage. What a number of implementations do is to represent strings with
UTF-16, but individual character values with
The chief selling point for Unicode is providing a
representation for all the world’s characters, eliminating the need for
juggling multiple character sets and avoiding the associated data corruption
problems. These features were enough to swing industry to the side of
using Unicode (UTF-16). While a UTF-32 representation does make the
programming model somewhat simpler, the increased average storage size
has real drawbacks, making a complete transition to UTF-32 less compelling.
Q: How about using UTF-32 interfaces in my
A: Except in some environments that store text as UTF-32 in
memory, most Unicode APIs are using UTF-16. With UTF-16 APIs the
indexing is at the storage or code unit level, with higher-level mechanisms
for graphemes or words specifying their boundaries in terms of the
code units. This provides efficiency at the low levels, and the
required functionality at the high levels.
If its ever necessary to locate the nth
character, indexing by character can be implemented as a high level
operation. However, while converting
from such a UTF-16 code unit index to a character index or vice versa is fairly
straightforward, it does involve a scan through the 16-bit units up to
the index point. In a test run, for example, accessing UTF-16 storage as
characters, instead of code units resulted in a 10× degradation. While
there are some interesting optimizations that can be performed, it will
always be slower on average. Therefore locating other boundaries, such
as grapheme, word, line or sentence boundaries proceeds directly from
the code unit index, not indirectly via an intermediate character code
Q: Doesn’t it cause a problem to have
only UTF-16 string APIs, instead of UTF-32 char APIs?
A: Almost all international functions (upper-, lower-,
titlecasing, case folding, drawing, measuring, collation,
transliteration, grapheme-, word-, linebreaks, etc.) should take
string parameters in the API, not single code-points
(UTF-32). Single code-point APIs almost always produce the wrong results
except for very
simple languages, either because you need more context to get the right answer,
or because you need to generate a sequence of characters to return
the right answer, or both.
For example, any Unicode-compliant
collation (See UTS #10: Unicode Collation Algogrithm (UCA)) must be able to handle sequences of more than one
code-point, and treat that sequence as a single entity. Trying to collate by handling single code-points
at a time, would get the wrong answer. The same will happen for drawing
or measuring text a single code-point at a time; because scripts like
Arabic are contextual, the width of x plus the width of y is not equal
to the width of xy. Once you get beyond basic typography, the same is
true for English as well; because of kerning and ligatures the width of
“fi” in the font may be different than the width of “f” plus the width
of “i". Casing operations must return strings, not single code-points;
http://www.unicode.org/charts/case/ . In particular, the title
casing operation requires strings as input, not single code-points at a
Storing a single code point
in a struct or class instead of a string, would exclude support for
graphemes, such as “ch” for Slovak, where a single code point may not be sufficient,
but a character sequence is needed to express what
is required. In other words, most API parameters and fields of composite
data types should
not be defined as a character, but as a string. And if they are
strings, it does not matter what the internal representation of the
Given that any industrial-strength text and
internationalization support API has to be able to handle sequences of
characters, it makes
little difference whether the string is internally represented by a
sequence of UTF-16 code units, or by a sequence of code-points ( = UTF-32 code units).
Both UTF-16 and UTF-8 are designed to make working with substrings easy,
by the fact that the sequence of code units for a given code point is
Q: Are there exceptions to the rule of exclusively using
string parameters in APIs?
A: The main exception are very low-level
operations such as getting character properties (e.g. General Category
or Canonical Class in the UCD). For those it is handy to have interfaces
that convert quickly to and from UTF-16 and UTF-32, and that allow you
to iterate through strings returning UTF-32 values (even though the
internal format is UTF-16).
Q: How do I convert a UTF-16 surrogate
pair such as <D800 DC00> to UTF-32? As one 4-byte sequence or as two
A: The definition of UTF-32 requires that supplementary
characters (those using surrogate pairs in UTF-16) be encoded with a
single 4-byte sequence.
Q: How do I convert an unpaired UTF-16 surrogate
A: If an unpaired surrogate is encountered when
converting ill-formed UTF-16 data, any conformant converter must
treat this as an error. By representing such an unpaired surrogate on its
own, the resulting UTF-32 data stream would become ill-formed. While it
faithfully reflects the nature of the input, Unicode conformance
requires that encoding form conversion always results in valid data
Byte Order Mark (BOM) FAQ
Q: What is a BOM?
A: A byte order mark (BOM) consists of the character
code U+FEFF at the beginning of a data stream, where it can be used
as a signature defining the byte order and encoding form, primarily of unmarked plaintext
files. Under some higher level protocols, use of a BOM may be mandatory
(or prohibited) in the Unicode data stream defined in that
Q: Where is a BOM useful?
A: A BOM is useful at the beginning of files that are typed as
text, but for which it is not known whether they are in big or little endian format—it
can also serve as a hint indicating that the file is in Unicode, as
opposed to in a legacy encoding and furthermore, it act as a signature
for the specific encoding form used. [AF]
Q: What does ‘endian’ mean?
A: Data types longer than a byte can be stored in computer
memory with the most significant byte (MSB) first or last. The former is
called big-endian, the latter little-endian. When data is exchanged, bytes
that appear in the "correct" order on the sending system may appear to be
out of order on the receiving system. In that situation, a BOM would look
like 0xFFFE which is a noncharacter, allowing the receiving system to
apply byte reversal before processing the data. UTF-8 is byte oriented and
therefore does not have that issue. Nevertheless, an initial BOM might be
useful to identify the datastream as UTF-8. [AF]
Q: When a BOM is used, is it only in
16-bit Unicode text?
A: No, a BOM can be used as a signature no matter how the
Unicode text is transformed: UTF-16, UTF-8, or UTF-32. The exact bytes
comprising the BOM will be whatever the Unicode character U+FEFF is
converted into by that transformation format. In that form, the BOM
serves to indicate both that it is a Unicode file, and which of the
formats it is in. Examples:
|00 00 FE FF
|FF FE 00 00
|EF BB BF
Q: Can a UTF-8 data stream contain the BOM
character (in UTF-8 form)? If yes, then can I still assume the remaining
UTF-8 bytes are in big-endian order?
A: Yes, UTF-8 can contain a BOM. However, it makes no
difference as to the endianness of the byte stream. UTF-8 always has the
same byte order. An initial BOM is only used as a signature — an
indication that an otherwise unmarked text file is in UTF-8. Note that
some recipients of UTF-8 encoded data do not expect a BOM. Where UTF-8
is used transparently in 8-bit environments, the use of a BOM
will interfere with any protocol or file format that expects specific
ASCII characters at the beginning, such as the use of "#!" of at the
beginning of Unix shell scripts.
Q: What should I do with U+FEFF in the
middle of a file?
A: In the absence of a protocol supporting its use as a BOM and when not at the
beginning of a text stream, U+FEFF should normally not occur. For
backwards compatibility it should be treated as ZERO WIDTH
NON-BREAKING SPACE (ZWNBSP),
and is then part of the content of the file or string. The use of
U+2060 WORD JOINER is strongly preferred over ZWNBSP for expressing word joining
semantics since it cannot be confused with a BOM. When designing a markup
language or data protocol, the use of U+FEFF can be restricted to that
of Byte Order Mark. In that case, any U+FEFF occurring in the middle of a file can be treated as an
unsupported character. [AF]
Q: I am using a protocol that has BOM at
the start of text. How do I represent an initial ZWNBSP?
A: Use U+2060 WORD JOINER instead.
Q: How do I tag data that does not
interpret U+FEFF as a BOM?
A: Use the tag UTF-16BE to indicate big-endian
UTF-16 text, and UTF-16LE to indicate little-endian UTF-16
text. If you do use a BOM, tag the text as simply UTF-16.
Q: Why wouldn’t I always use a protocol
that requires a BOM?
A: Where the data has an associated type, such as a field in a database,
a BOM is unnecessary. In particular, if a text data stream is marked as
UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE, a BOM is neither necessary nor permitted.
Any U+FEFF would be interpreted as a ZWNBSP.
Do not tag every string in a database or set of fields with a BOM,
since it wastes space and complicates string concatenation. Moreover, it also means two data fields may have
precisely the same content, but not be binary-equal (where one is
prefaced by a BOM).
Q: How I should deal
A: Here are some guidelines to follow:
A particular protocol (e.g. Microsoft conventions for
.txt files) may require use of the BOM on certain Unicode data
streams, such as files. When you need to conform to such a protocol,
use a BOM.
Some protocols allow optional BOMs in the case of
untagged text. In those cases,
Where a text data stream is known to be plain text, but
of unknown encoding, BOM can be used as a signature. If there is no
BOM, the encoding could be anything.
Where a text data stream is known to be plain Unicode
text (but not which endian), then BOM can be used as a signature. If
there is no BOM, the text should be interpreted as big-endian.
Some byte oriented protocols expect ASCII characters at
the beginning of a file. If UTF-8 is used with these protocols, use
of the BOM as encoding form signature should be avoided.
Where the precise type of the data stream is known (e.g.
Unicode big-endian or Unicode little-endian), the BOM should not be
used. In particular, whenever a data stream is declared to be
UTF-16BE, UTF-16LE, UTF-32BE or UTF-32LE a BOM must not be
used. (See also Q: What is the
difference between UCS-2 and UTF-16?.)