Re: How is NBH (U0083) Implemented?

From: Ken Whistler <kenw_at_sybase.com>
Date: Mon, 08 Aug 2011 16:44:04 -0700

On 8/1/2011 7:26 AM, Naena Guru wrote:

This thread wandered off into an argument about whether U+FEFF ZWNBSP or
U+2060 WJ is best supported and which should be used to inhibit line breaks.
However, there are still several other issues which bear addressing in
Naena Guru's
questions:

> The Unicode character NBH (No Break Here: U0083) is understood as a
> hidden character that is used to keep two adjoining visual characters
> from being separated in operations such as word wrapping.

As Jukka noted, U+0083 is a C1 control code, whose semantics is not actually
defined by the Unicode Standard. Its function in ISO 6429 is to
represent the
control function "No Break Here". U+0083 is unlikely to be supported
(except for
pass-through) by any significant Unicode-based software as a control
function.
Its only implementation was likely for some terminal-based software in
what are now
basically obsolete systems.

See the wiki on the topic of C0 and C1 control codes for a quick summary
of the
status of various control codes and their implementation:

http://en.wikipedia.org/wiki/C0_and_C1_control_codes

> It seems to be similar to ZWJ (Zero Width nonJoiner: U200C) in that it
> can prevent automatic formation of a ligature as programmed in a font.

U+200C ZWNJ is the Unicode format control whose function is to break cursive
connection between adjacent characters. That is a different and distinct
function
from indicating the position of an inhibited line break.

Also, it is important to recognize that the insertion of *any* random
control code between
two characters may end up preventing automatic formation of a font
ligature, if it isn't
accounted for in the font tables. That does not imply that insertion of
random control
codes (including U+0083) is a recommended way of inhibiting ligature
formation for
a pair of characters in a particular font.

> However, it seems to me that an NBH evokes a question mark (?) Is this
> an oversight by implementers or am I making wrong assumptions?

Because most control codes, including nearly all of the C1 control
codes, are unsupported
by typical Unicode-based text processing software, it is not too
surprising that insertion
of U+0083 in text would result in a "?" or other indication of an
unsupported and/or undisplayable
character.
>
> There is also the NBSP (No-break Space: U00A0), which I think has to
> be mapped to the space character in fonts, that glues two letters
> together by a space. If you do not want a space between two letters
> and also want to prevent glyph substitutions to happen, then NBH seems
> to be the correct character to use.

No. And that leads to the discussion which followed, about U+FEFF and
U+2060.
>
> NBH is more appropriate for use within ISO-8859-1 characters than
> ZWNJ, because the latter is double-byte.

"Double-byte" is not a concept with any applicability to the Unicode
Standard. That is a hold-over
from Asian character sets which mixed ASCII with two-byte encoding of
extensions to
cover Han characters (and other additions).

And U+0083 is no more appropriate for use with ISO 8859-1
implementations than
Unicode implementations, for the same reason: it is a control function
which simply isn't supported.

> Programs that handle SBCS well ought to be afforded the use of NBH as
> it is a SBCS character. Or, am I completely mistaken here?

If you actually run into the byte 0x83 in data which is ostensibly
labeled "ISO-8859-1", in
almost all actual cases you would be dealing instead with 0x83 (= U+0192
LATIN SMALL LETTER F
WITH HOOK) in mislabeled Windows Code Page 1252 data. It would be really
inadvisable
to start expecting it to be supported as a line break inhibiting control
code instead.

--Ken
Received on Mon Aug 08 2011 - 18:49:07 CDT

This archive was generated by hypermail 2.2.0 : Mon Aug 08 2011 - 18:49:12 CDT