From: Doug Ewell (doug@ewellic.org)
Date: Fri Jan 01 2010 - 11:17:15 CST
Happy New Year to all.
"verdy_p" <verdy underscore p at wanadoo dot fr> wrote:
>> Unicode, and even ASCII, contains plenty of seldom-used control
>> characters, with defined semantics if that is desirable, which an
>> internal process can safely insert, use, and remove for purposes like
>> this.
>
> No, you're wrong, there's no such character. If it existed, then this
> character would also have a use within normal strings that would be
> part of a primary key, and that would break the logic. If it is
> "seldom used", it does not qualify as it will conflict with this
> seldom use, so it will unavoidably be UNUSABLE to insert/use/remove
> for such purpose.
If you are concerned that every possible control character, like U+009C
STRING TERMINATOR or U+0081 <I don't have a name because nobody uses
me>, might appear in the real text, then yes, this is a problem.
> The BOCU-1 RESET code is NOT a character, and what I wrote was exactly
> the kind of use where it can be beneficial, because BOCU-1 was
> designed with the express purpose of being a binary-ordered encoding
> suitable for collation according to code point's scalar values.
Right, I'm aware the reset byte is not a character.
> I DID NOT say that a RESET code neded to be inserted in the
> plain-text, but its insertion with a collation key as a key separator
> DOES NOT violate the rule, as we can completely warranty that it will:
> - never present in encoded plan-texts
> - will always sort AFTER any valid Unicode character
> - will not be ignored.
If you want to use a mechanism that is internal to BOCU-1 to serve a
metadata purpose, be my guest. You will not be able to convert your
data to any other encoding and still retain this metadata. If that is
not a problem for you, great.
Hopefully you read what I wrote about UTF-8 and tag characters, or
remembered when it happened. It is a valuable lesson.
> An I still maintain that the special RESET code in BOCU-1 should NEVER
> be present in any encoded plain-text (as effectively it has the
> potential of creating multiple distinct encodings for equivalent
> texts).
This is not an absolute rule of BOCU-1, and the authors indicate how it
could be useful for concatenating strings, which seems to me a more
common scenario than sorting multi-column text in BOCU-1 using only the
untailored UCA.
> So it does not absolutely need a leading BOM
With its lack of transparency with ASCII or any other encoding, I can
hardly think of an encoding that is more in need of a BOM than BOCU-1.
> (My opinion is that, for interchange purpose, BOMS should be allowed
> in ALL encodings if they can represent the U+FEFF codepoint,
No argument there.
> and that this codepoint should also exclusively represent a BOM and no
> ZWNSP semantic
Too late; U+FEFF nominally still has both semantics, but see below.
> if needed one could replace all ZWNBSP by ZWJ, making sure that all
> final renderers will either be able to render it).
This is a hack. Developers of renderers should make ZWNBSP display
correctly. It's not that hard. Creators of documents shouldn't have to
modify their text to appease the renderer. And remember, it's
default-ignorable.
> All the legacy problems about the BOM would have been much simpler if
> it had been mapped to a non-character (exactly like also U+FFFE)
> instead of a legacy control format (like U+FEFF), but now it is too
> late to change it or recommand some other codepoint.
U+2060 WORD JOINER is recommended for the ZWNBSP semantic. And
honestly, when was the last time you saw U+FEFF in real-world text (not
in a test case) used with the ZWNBSP semantic?
-- Doug Ewell | Thornton, Colorado, USA | http://www.ewellic.org RFC 5645, 4645, UTN #14 | ietf-languages @ http://is.gd/2kf0s
This archive was generated by hypermail 2.1.5 : Fri Jan 01 2010 - 11:20:15 CST