Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
Shawn.Steele at microsoft.com
Sun Jan 31 13:52:32 CST 2016
It should be understood that any algorithm that changes the Unicode character data to non-character data is therefore binary, and not Unicode. It's inappropriate to shove binary data into unicode streams because stuff will break.
From: Unicode [mailto:unicode-bounces at unicode.org] On Behalf Of Chris Jacobs
Sent: Sunday, January 31, 2016 10:08 AM
To: J Decker <d3ck0r at gmail.com>
Cc: unicode at unicode.org
Subject: Re: Encoding/Use of pontial unpaired UTF-16 surrogate pair specifiers
J Decker schreef op 2016-01-31 18:56:
> On Sun, Jan 31, 2016 at 8:31 AM, Chris Jacobs <chris.jacobs at xs4all.nl>
>> J Decker schreef op 2016-01-31 03:28:
>>> I've reconsidered and think for ease of implementation to just mask
>>> every UTF-16 character (not codepoint) with a 10 bit value, This
>>> will result in no character changing from BMP space to
>>> surrogate-pair or vice-versa.
>>> Thanks for the feedback.
>> So you are still trying to handle the unarmed output as plaintext.
>> Do you realize that if a string in the output is replaced by a
>> canonical equivalent one this may mess up things because the
>> originals are not canonical equivalent?
> I see ... things like mentioned here
Yes especially the part about normalization.
This would not only spoil the normalized string, but also, as the string can have a different length, for anything after that your ever-changing xor-values may go out of sync.
More information about the Unicode