RE: Collation - last character?

From: Lars Kristan (lars.kristan@hermes.si)
Date: Wed Mar 20 2002 - 07:42:31 EST


Kenneth Whistler wrote:
> I do not feel that this is an *encoding* issue at all. Nor is it even
> an issue for the Unicode Collation Algorithm to define such a usage.
Strictly speaking, this may be true. But the catch 22 of it is that Unicode
'rules' will prevent such characters from being defined until they are
proven to be in use. But then again, nobody can use them, unless their
behavior is defined...

>
> What you are looking for is something that could be agreed upon by
> the programming language communities as:
>
> 1. a symbol, from among the vast collection already encoded in
> Unicode, that would be agreed to by the programming language
> communities as acceptable in identifiers, as is "_".
Of course there is a variety of definitions of which characters are allowed
for identifiers. Some may restrict identifiers to alpha characters found in
7 bit ASCII, and there is nothing that can be squeezed in there. I believe
that it would suffice if this character would be an alpha character, treated
same as any 'letter' of any script.
And yes, this makes my previous suggestion of having such a character in the
General Punctuation block a bad idea.
With identifiers, I meant a generic identifier, including uses like user
names, file names and so on. For filenames (which, generally, already accept
even most of punctuation) no further agreement or acceptance would be
needed. If collation ensures that a character is sorted last, it can be used
as such immediately.
>
> 2. when using a simple, single-level ordering (e.g., for
> sorting menu-items), would be given a primary weight above
> all alphas, as "_" would be given a primary weight below
> all alphas.
>
> 3. when using a multi-level, sophisticated ordering according
> to the UCA, would also be given a primary weight above all
> alphas, as "_" would be given a primary weight below all
> alphas, so as to preserve the expected behavior, while allowing
> all the sophistication of language-specific ordering behavior
> for sorting lists.
Not just above all alphas, also above (/after) all other
characters/codepoints (including punctuation and undefined codepoints. If
the latter have a primary weight of FF80, then the proposed character(s)
would need to have a primary weight higher than that. FFFF seems in order,
though a slightly lower value (FFF0?) would allow some room if something
arises in the future.

>
> In either case, whether doing simple sorting or complex, multi-level
> sorting, you are talking about some tailored behavior here. You
> can't just sort on code point order, and you cannot simply use the
> Unicode Collation Algorithm without tailoring to get the effects you
> want.
My original proposal of adding a new character as codepoint U+FFF0 was
intended to give this character a very good sorting behavior even when based
on code point order. This would meet the requirements only if a codepoint in
the Specials block can be defined to be an alpha character. U+10FFF0 (or
U+10FFFC) would behave even better, but maybe other benefits of having it as
U+FFF0 would override the slight advantage that U+10FFFx would have in raw
sort order.
I understand that underscore does not meet the requirements as the "last
character"'s counterpart. Another new character (the "first character")
would complicate things further, since a codepoint that would sort well in
"raw sort mode" is not available. Which however does not mean that a "first
character" (for example with codepoint U+FFF1) would be useless. Oh well,
again, I need to change my proposal, to:
U+FFF4 - "first character" [sorted _second_]
U+FFF5 - "last character" [default primary weight FFF7?]
U+FFF6 - "application specific first character" [sorted _first_]
U+FFF7 - "application specific last character" [default primary weight
FFF7?]
U+FFF8 (still unassigned).

Only U+FFF4 and U+FFF5 need a rendering, while U+FFF6 and U+FFF7 might be
zero-width non-breaking spaces.
U+FFF4 - rendered as top left corner (U+231C)
U+FFF5 - rendered as bottom left corner (U+231E)
Perhaps rather as full size, not quarter size. Which however does NOT mean
that those symbols would be given this role.

There is controversy as to what application's internal use is. For example,
a database could use these characters for its internal use, meaning, they
would no longer be available for "internal use" of a developer using this
database. Perhaps database should use U+10FFFF, developer (API user) should
use U+FFF7 and end user would use U+FFF5.

>
> By the way, my suggestion for an appropriate, already encoded symbol
> to meet your requirements would be U+221E INFINITY. ;-) Or how about
> U+261F WHITE DOWN POINTING INDEX, if you want something more iconic?
IMHO, changing the collation order of any existing codepoint, especially if
it appears in a larger group, is not a good idea.

I understand why experienced unicoders feel very strongly about proposals
for new characters. But then again, if no character had this specific role
before, why would one get it now? A role that has never existed before
should in my opinion be assigned a new codepoint. I see its rendering as
secondary and it is not the rendering that justifies a new codepoint, it is
the role.

Lars Kristan



This archive was generated by hypermail 2.1.2 : Wed Mar 20 2002 - 08:42:30 EST