L2/08-299

Date/Time:    Fri Aug  8 11:29:08 CDT 2008
Contact:      deltasp2002@yahoo.com
Name:         Robin
Report Type:  Technical Report or Tech Note issues
Opt Subject:  Unicode Collation Comments

You might want to read this in an editor that supports indentation.
Otherwise I will send it as email attachment to address.

Notes on UCA specification. While I appreciate very much the UCA specification
I really struggled to decipher the UCA document as there seems to be some ambiguity
in places(to me anyway). I was looking at it on and off for a while now and am only implementing
it now in August 2008. I have some notes here that perhaps you can consider
clarifying further in the UCA specification. I think I have mastered most of the
stuff but there are some dangling questions and I dont want to get caught out
later by mistakes now even if it does pass the tests. If some of the topics below
were clarified a little further I think it will save others much time as well.

Implicit Weights
================
  In the implicit weights derivation description, three special values are
  listed for BASE but it just says depending on the TYPE of character. It
  does not specify exactly which type, but it is obviously the ranges from
  UnicodeData.txt. This is not so easy to see at first considering the many
  "ideograph" references throughout the UCD.

  This code will illustrate it more clearly.

            unsigned int BASE;
            if (c >= 0x4E00 && c <= 0x9FC3) BASE = 0xFB40;/* CJK Ideograph */
            else if ((c >= 0x3400  && c <= 0x4DB5)  ||    /* CJK Ideograph Extension A*/
                     (c >= 0x20000 && c <= 0x2A6D6) )     /* CJK Ideograph Extension B*/
                       BASE = 0xFB80;
            else BASE = 0xFBC0;

  Or perhaps this table.
      0x4E00..0x9FC3 = 0xFB40 /* CJK Ideograph */
      (0x3400..0x4DB5) OR (0x20000..0x2A6D6) = 0xFB80 /* CJK Ideograph Extension A / CJK Ideograph Extension B */
      Anything else = 0xFBC0

  --- NOTE!!! If using the modified CE encoding as described below the values
  --- for BASE can be much higher than used here.


FOURTH LEVEL WEIGHT CALCULATION
===============================
  When describing the collation elements, it says "there is a fourth level, and this
  is computable." It goes on to refer to the section on implicit weight derivation
  as the place to go for how to calculate the fourth level but does not actually do
  so.

  A description of the process would be clearer.
  Example
  =======
    To calculate the fourth level weight as listed in allkeys.txt, you must derive
    the decomposition character/s for the character that you want to compute
    a fourth level weight for. For most characters this will be the character
    itself, and so the fourth level weight will be the character itself, but for
    characters with compatibility decompositions, the fourth level weight will
    be its compatibility decomposition. I would assume this would always be the
    first character of its decomposition in the case where the decomposition was
    longer than 1 character.

    Also that the fourth level weight is 0 if levels 1..3 are all zero. Seems
    to be the case in allkeys.txt anyway.


VARIABLE WEIGHTING
==================
  In the formal description of the UCA, the step after fetching collation elements for
  the longest match in a string specifies to handle variable weighting. Should this
  step not be a completely separate phase that happens after the weights have all been
  added to the array. That way you could look ahead to the ignorables after variable weights
  and there would be no back tracking.

32 BIT INTEGER CE FORMAT
========================
  Allkeys.txt lists secondary weights that number more than 255, and so it seems
  that the specification of the DUCET would not work for those weights. I handle this
  by adjusting the number of bits for primary, secondary and tertiary in the 32-bit
  value like so...
    32 BIT INTEGER CE FORMAT HANDLING ALL SECONDARY WEIGHTS IN ALLKEYS, PLUS 2 X MORE PRIMARIES
    --- I just put the code here directly, should be easy to see how the number of possible primaries
    --- is now doubled, the number of possible secondaries is doubled, and the number of possible tertiaries is
    --- reduced to only 64. Millions of indexes can also be encoded by using the CE in a way similar to how
    --- it is done for the DUCET.

/*
  UCA CE(Collation Element) ENCODING.
    PRIMARY WEIGHT IS 17 BITS (0..131072) (If primary is >= 131000 then it is an index, see below)
    SECONDARY WEIGHT IS 9 BITS (0..512)
    PRIMARY WEIGHT IS 5 BITS (0..31)
*/
#define XL_CE_ENCODE(p,w,t) (xlUINT32)(((((xlUINT32)(p)) << (xlUINT32)15) | (((xlUINT32)(w)) << (xlUINT32)6)) | (xlUINT32)(t))
/* Get primary weight from encoded ce */
#define XL_CE_PWEIGHT(ce) (((xlUINT32)(ce) >> (xlUINT32)15))
/* Get secondary weight from encoded ce */
#define XL_CE_SWEIGHT(ce) ((((xlUINT32)(ce) >> (xlUINT32)6) << (xlUINT32)23) >> (xlUINT32)23)
/* Get tertiary weight from encoded ce */
#define XL_CE_TWEIGHT(ce) (((((xlUINT32)(ce)) << (xlUINT32)26) >> (xlUINT32)26))

/* index encodings - more than 35127296 items. Its an index when the PWeight is >= 131000 */
#define XL_CE_IS_INDEX(ce) (XL_CE_PWEIGHT(ce) >= 130000)
/* Encode an index, just pass in the index from 0..35127296(Best make it 35127294 to avoid 0xFFFFFFFF)   */
#define XL_CE_ENCODE_INDEX(index) ((((xlUINT32)130000 + ((xlUINT32)index / (xlUINT32)32768)) << (xlUINT32)15) | (xlUINT32)((xlUINT32)index % (xlUINT32)32768))
#define _XL_CE_INDEX(ce) (((((xlUINT32)(ce)) << (xlUINT32)17) >> (xlUINT32)17))
/* Decode an index, pass the ce in here if the CE_IS_INDEX macro returns true   */
#define XL_CE_DECODE_INDEX(ce) ((XL_CE_PWEIGHT(ce) - (xlUINT32)130000) * (xlUINT32)32768) + _XL_CE_INDEX(ce)

/*
  Note that the indexes are not(in xl implementation) like UCA DUCET indexes, they are always a direct index into one buffer, that is, contractions
  and expansions are stored in the same buffer of 32 bit unsigned ints, and then ces in the main lookup list can
  be indexes to the buffer. If the item in the buffer where the index points to is 0xFFFFFFFF then it is the start of
  contractions at index+1. Contractions are the list of characters in the contraction(minus the lead), plus
  the ces for the contraction, which are never indexes. The contractions finishes with an extra 0xFFFFFFFF, which signals
  that the following ces are what to use if no contraction match was found.

    A brief statement of the logic might be...


    ' THERE ARE TWO TABLES, A MAIN TABLE OF 32 BIT CES, AND A DATA TABLE OF 32BIT UINTS.
    ' CES IN THE MAIN TABLE ARE LIKE IN DUCET BUT ENCODED LIKE ABOVE. THE DATA TABLE
    ' HAS ALL THE CONTRACTIONS, EXPANSIONS AND FALLBACKS.
    WHILE I < LEN
      ' GET NEXT INPUT CHARACTER
      CH = INPUT[I]
      ' LOOKUP IT UP IN THE MAIN TABLE
      CE = LOOKUPCE(CH)
      ' SEE IF THE CE IS VALID
      IF CE != 0xFFFFFFFF THEN
        IF ISINDEX(CE)
          ' AN INDEX MEANS ITS A INDEX INTO THE DATA FOR CONTRACTIONS AND EXPANSIONS AND FALLBACK
          INDEX = DECODE_INDEX(CE)
          IF DATA[INDEX] == 0xFFFFFFFF THEN
            ' START OF CONTRACTION LIST, SEE IF THERE IS A MATCH
            GOTMATCH = FALSE
            WHILE (1)
              ' HAIRY CODE HERE TO DISCOVER IF THERE IS A CONTRACTION MATCH
              ' SEE UCA
            END WHILE

            ' IF NOT GOTO _ISEXPANSION_OR
            IF NOT GOTMATCH THEN
              GOTO _ISEXPANSION_OR
            ELSE
              ... OUTPUT THE CES FOR THE CONTRACTION MATCH
            END IF
          ELSE
            ' EXPANSION OR NORMAL OR INVALID
            _ISEXPANSION_OR:
               IF DATA[INDEX] == 0xFFFFFFFE THEN
                 ' THERE IS NO FALLBACK CE, TIME TO COMPUTE
                 GOTO _COMPUTE_CE
               ELSE
                 ...SIMPLY OUTPUT A LIST OF CES UP TO TERMINATOR 0xFFFFFF
                 WHILE (!DATA[INDEX] == 0xFFFFFFFF)
                   CEOUT = DATA[INDEX]
                   INDEX = INDEX + 1
                 END WHILE
               END IF
          END IF
        ELSE
          ' NORMAL CE, OUTPUT
          CEOUT = CE
        END IF
      ELSE
        _COMPUTE_CE:
          ' MUST COMPUTE THE WEIGHT
          IF DECOMPOSITION_TYPE(CH) > DCT_CANONICAL THEN
            ' COMPUTE THE COMPATIBILITY DECOMPOSITIONS MAPPINGS FROM UCA SPEC.
            CE[0..x] = COMPUTE_COMPAT(CH) ' THIS WILL BE MORE THAN ONE CE OFTEN BECAUSE OF COMPATIBILITY DECOMPOSITION
            CEOUT = CE[0..x]
          ELSE
            ' THIS IS SIMPLIFIED, YOU WILL ACTUALLY MAKE TWO CES AS PER DERIVED WEIGHT SPECIFICATION.
            P = COMPUTEP(CH)
            W = COMPUTEW(CH)
            T = COMPUTET(CH)
            CEOUT = ENCODECE(P,W,T)
          END IF
      END IF
      I = I + 1
    END IF

    But of course the same index scheme as UCA DUCET can be accomadated by the number of indexes in the ce encoding anyway.
    Using this scheme there is no "binary search" or "boyer moore" style searching required because indexes
    always point directly to the contraction or expansion in question. Lookups into the main table for
    ces are like the unicode trie concept.
*/


DEFAULT WEIGHTS
===============
  One thing that is slightly confusing about derived weights is
  the values like .0020 that are assigned during runtime calculation.
  Does this mean to use the value of the .0020 as it was after tailoring,
  or does it mean to use the value 20. Same thing applies for the
  tertiary weights calculated at runtime. If the tertiary weight
  table returns 0x08 for some character, does it mean to use 0x08
  for the tertirary weight, or the actual final weight for .0008 after
  tailoring?

VARIABLE WEIGHTING FINAL LEVEL WEIGHT
=====================================
  Whilst handling variable level weighting for the SHIFTED option, it specifies
  to add a final level weight to the ce. But this final level weight is not mentioned in
  the "FORM SORT KEY" section where it only discusses looping either forwards or backwards through
  the list of returned collation elements, and these loops are only directed at specific weight levels.
  I suppose it means that if the option is shifted then there is a mandatory "X" level weight
  at (numLevels + 1), but its just not that clear.

  In the table for the final level weight for the SHIFTED option, the final level weight is labelled L4. Now
  I am not sure at all. Must I replace the computable L4 weight with the L4 weight in the table? Or is that
  a typing error? It surely cant be both the fourth level weight L4 and the final level weight.

  Now, if the algorithm is set to use 3 levels by the user, do you take the fourth level into account
  when deciding if a collation element is ignorable? Also, if the collation element is to have a final
  level weight appended, does that occur if there are only 3 levels that will be considered?

  For example. A collation table is set to run with only 3 levels. Now a variable element is discovered
  and the SHIFTED option is enabled. How does the final level weight come into play if there are only
  three levels. Is it for that variable element only that gets another level after it? Or is it just forgotten
  about because there are only 3 levels???

  The examples show that the final level weight is like L4 or fourth level. So instead of final level weight
  would it not be more correct to say "replace the fourth level weight with one of the options in the table below" when
  specifying the variable weight handling.

  It also says that elements marked with a * are variable. What about the elements that are all zero? There
  is no star but the SHIFTED option table specifies that if the collation element is completely ignorable
  then use 0x0000 as the new final level weight. But if all zeroes is not marked with a star how could it
  be considered for variable weighting? Obviously because every collation element is then given the new final level
  weight as per the table, but it is so clear to straight away, to me anyway. My mind works in sequence, so
  when I read the first two options for BLANKED and NON_IGNORABLE, I assume the process only applies to
  variable elements then I must decipher that all elements are given a new final level weight which might or might
  not be the replacement of the computable fourth level weight. What if the collation uses 4 levels and it
  is SHIFTED? That could mean to use five levels if the "final level weight" terminology was used.


  I will leave my implementation to just do three levels for now(or assume its L4 replacement when it says "final level weight").


CONTRACTIONS
============
  When searching for the "longest match", it is not immediately clear exactly how that works,
  or at least there is the question of what to do after removing a character. It says
  to replace S with S+C and remove C. So far so good. The ambiguity is what about
  unblocked characters that were already skipped before C because they did not match? Are
  they to be considered for matching as well. So if it was S+C where C was 2 characters ahead of
  S, would the skipped unblocked character after S then be considered for matching after removing C. It seems
  like it is the best idea to forget about already examined unblocked characters and look ahead
  after C, then when continuing the contraction match use the character after the removed C
  instead of considering already checked unblocked characters.

  CONTRACTION IS 0xAA 0xCC
  TEXT IS 0xAA 0xUNBLOCKED 0xCC

  AFTER REMOVING C AND MAKING S+C IT LOOKS LIKE

  (0xAA + 0xCC) 0xUNBLOCKED

  MUST YOU RECONSIDER BLOCKED, AS IN SEARCH FOR A MATCH
  IN CONTRACTIONS THAT IS (0xAA + 0xCC + 0xUNBLOCKED)?


TERTIARY WEIGHT TABLE
=====================
  The condition in the tertiary weight table is based on General Category or a specific list
  of characters. Except that the specific list of characters trails off with a ".." after specifying only
  two of the characters. I guessed at which characters to use from UnicodeData.txt by following the
  character naming scheme but if those lists could be specifically defined it would probably make
  for more conforming uca implementations.

TRAILING HANGUL WEIGHTS
=======================
  The interleaving method seems best for hangul because there seems to be no need
  to tailor the data. Is this correct? It says that the internal weight table is
  separate from the other weights. So is it correct to assume that for hangul sequences
  you can switch to an entirely unrelated and computed data table, the weights of which
  will not interfere with the other weights? It just says that the weights have nothing to
  do with the default weights but there needs to be more dicussion as to the relation of
  the weights to UCA default weights. I will just try and implement it assuming the above is correct
  and there is no tailoring required for hangul when using the interleaved method. I think if
  that was explicitly mentioned it would help to eliminate any confusion.


7.3 Compatibility Decompositions
================================
  Step 1 of the compatibility decompositions section specifies
  to "Derive the decomposition". Does this mean to get the NFKD
  string of the one character string for the character in question,
  or does it simply mean get the characters decomposition sequence from
  unicodedata.txt.

  Step 2 specifies to get the ce for each element in the decomposition.
  Does this mean to recursively pass the decomposition to the
  uca function to get the collation element array, or does it simply
  mean to lookup each ce in the main ce array?(But what then
  if one of the characters in the decomposition was an expansion or
  some of the characters was a contraction).

  If Step 1 and Step 2 were more explicit it would make more sense.

VARIABLE WEIGHTING
==================
  A table is presented for the final level weight in variable weighting when
  the option is SHIFTED. The line "Ignorable (L1, L2) after Variable" is confusing,
  ok so its ignorable(at what, L1 or L2, L1 and L2???), but what does the "after variable"
  mean? Is it a final level weight for subsequent ignorables after the variable weight that
  triggered the weighting adjustment? Is it after the maximum level variable weighting?
  Explicit specification here would be more convenient.

     Completely Ignorable              0000   NULL [.0000.0000.0000.0000]
>>>  Ignorable (L1, L2) after Variable 0000   COMBINING GRAVE [.0000.0000.0000.0000]
     Variable                          old L1 SPACE [.0000.0000.0000.0209]
     None of the above                 FFFF   Capital A [.06D9.0020.0008.FFFF]

  I assume it means after the variable collation element in the array of
  collation elements? Does the L1, L2 specification mean to ignore any
  L3 ignorables?


-- -- -- -- -- -- -- -- -- -- -- -- -- -- -- --
(End of Report)