verdy_p at wanadoo.fr
Mon Jun 2 17:20:49 CDT 2014
I better expect: "treat them as you like", there will never be any warranty
of interoperability, everyone is allowed to use them as they want and even
change it at any time. The behavior is not defined in TUS, and users cannot
expect that TUS will define this behavior.
There's no clear solution about what to do if you encounter them in data
supposed to be text. For me they are not text, so the whole data could be
rejected or the text remaining after some filtering may be galsely
interpreted. You need an external specification outside TUS.
I certainly do not consider non-characters like unassigned valid code
points where applications are strongly encouraged to not apply any kinf of
filter if they want to remain compatible with evolutions of the standard
that may assign them (the best you can do with unassigned code points is
treat them as symbols, with the minimial properties defined in the standard
(notably Bidi properties according to their range, where this direction is
defined in some ranges, or treat them as symbols with weak direction), even
if applications cannot still render them (renderers will find a way to show
them, generally using a .notdef glyph like empty boxes). Normalizers will
also not mix them (the default combining class should be 0).
Only applications that want to ensure that the text conforms to a specific
version of the standard are allowed to filter out or signal as errors the
presence of unassigned code points. But all applications can do that kind
of things on non-characters (or any code unit whose value falls outside the
valid range of a defined UTFà. This is an important difference.
non-characters are not like unassigned code points, they are assigned to be
considered invalid and filterable by design by any Unicode conforming
process for handling text.
2014-06-02 23:53 GMT+02:00 Markus Scherer <markus.icu at gmail.com>:
> On Mon, Jun 2, 2014 at 1:32 PM, David Starner <prosfilaes at gmail.com>
>> I would especially discourage any web browser from handling
>> these; they're noncharacters used for unknown purposes that are
>> undisplayable and if used carelessly for their stated purpose, can
>> probably trigger serious bugs in some lamebrained utility.
> I don't expect "handling these" in web browsers and lamebrained utilities.
> I expect "treat like unassigned code points".
> Unicode mailing list
> Unicode at unicode.org
-------------- next part --------------
An HTML attachment was scrubbed...
More information about the Unicode