L2/08-299 Date/Time: Fri Aug 8 11:29:08 CDT 2008 Contact: deltasp2002@yahoo.com Name: Robin Report Type: Technical Report or Tech Note issues Opt Subject: Unicode Collation Comments You might want to read this in an editor that supports indentation. Otherwise I will send it as email attachment to address. Notes on UCA specification. While I appreciate very much the UCA specification I really struggled to decipher the UCA document as there seems to be some ambiguity in places(to me anyway). I was looking at it on and off for a while now and am only implementing it now in August 2008. I have some notes here that perhaps you can consider clarifying further in the UCA specification. I think I have mastered most of the stuff but there are some dangling questions and I dont want to get caught out later by mistakes now even if it does pass the tests. If some of the topics below were clarified a little further I think it will save others much time as well. Implicit Weights ================ In the implicit weights derivation description, three special values are listed for BASE but it just says depending on the TYPE of character. It does not specify exactly which type, but it is obviously the ranges from UnicodeData.txt. This is not so easy to see at first considering the many "ideograph" references throughout the UCD. This code will illustrate it more clearly. unsigned int BASE; if (c >= 0x4E00 && c <= 0x9FC3) BASE = 0xFB40;/* CJK Ideograph */ else if ((c >= 0x3400 && c <= 0x4DB5) || /* CJK Ideograph Extension A*/ (c >= 0x20000 && c <= 0x2A6D6) ) /* CJK Ideograph Extension B*/ BASE = 0xFB80; else BASE = 0xFBC0; Or perhaps this table. 0x4E00..0x9FC3 = 0xFB40 /* CJK Ideograph */ (0x3400..0x4DB5) OR (0x20000..0x2A6D6) = 0xFB80 /* CJK Ideograph Extension A / CJK Ideograph Extension B */ Anything else = 0xFBC0 --- NOTE!!! If using the modified CE encoding as described below the values --- for BASE can be much higher than used here. FOURTH LEVEL WEIGHT CALCULATION =============================== When describing the collation elements, it says "there is a fourth level, and this is computable." It goes on to refer to the section on implicit weight derivation as the place to go for how to calculate the fourth level but does not actually do so. A description of the process would be clearer. Example ======= To calculate the fourth level weight as listed in allkeys.txt, you must derive the decomposition character/s for the character that you want to compute a fourth level weight for. For most characters this will be the character itself, and so the fourth level weight will be the character itself, but for characters with compatibility decompositions, the fourth level weight will be its compatibility decomposition. I would assume this would always be the first character of its decomposition in the case where the decomposition was longer than 1 character. Also that the fourth level weight is 0 if levels 1..3 are all zero. Seems to be the case in allkeys.txt anyway. VARIABLE WEIGHTING ================== In the formal description of the UCA, the step after fetching collation elements for the longest match in a string specifies to handle variable weighting. Should this step not be a completely separate phase that happens after the weights have all been added to the array. That way you could look ahead to the ignorables after variable weights and there would be no back tracking. 32 BIT INTEGER CE FORMAT ======================== Allkeys.txt lists secondary weights that number more than 255, and so it seems that the specification of the DUCET would not work for those weights. I handle this by adjusting the number of bits for primary, secondary and tertiary in the 32-bit value like so... 32 BIT INTEGER CE FORMAT HANDLING ALL SECONDARY WEIGHTS IN ALLKEYS, PLUS 2 X MORE PRIMARIES --- I just put the code here directly, should be easy to see how the number of possible primaries --- is now doubled, the number of possible secondaries is doubled, and the number of possible tertiaries is --- reduced to only 64. Millions of indexes can also be encoded by using the CE in a way similar to how --- it is done for the DUCET. /* UCA CE(Collation Element) ENCODING. PRIMARY WEIGHT IS 17 BITS (0..131072) (If primary is >= 131000 then it is an index, see below) SECONDARY WEIGHT IS 9 BITS (0..512) PRIMARY WEIGHT IS 5 BITS (0..31) */ #define XL_CE_ENCODE(p,w,t) (xlUINT32)(((((xlUINT32)(p)) << (xlUINT32)15) | (((xlUINT32)(w)) << (xlUINT32)6)) | (xlUINT32)(t)) /* Get primary weight from encoded ce */ #define XL_CE_PWEIGHT(ce) (((xlUINT32)(ce) >> (xlUINT32)15)) /* Get secondary weight from encoded ce */ #define XL_CE_SWEIGHT(ce) ((((xlUINT32)(ce) >> (xlUINT32)6) << (xlUINT32)23) >> (xlUINT32)23) /* Get tertiary weight from encoded ce */ #define XL_CE_TWEIGHT(ce) (((((xlUINT32)(ce)) << (xlUINT32)26) >> (xlUINT32)26)) /* index encodings - more than 35127296 items. Its an index when the PWeight is >= 131000 */ #define XL_CE_IS_INDEX(ce) (XL_CE_PWEIGHT(ce) >= 130000) /* Encode an index, just pass in the index from 0..35127296(Best make it 35127294 to avoid 0xFFFFFFFF) */ #define XL_CE_ENCODE_INDEX(index) ((((xlUINT32)130000 + ((xlUINT32)index / (xlUINT32)32768)) << (xlUINT32)15) | (xlUINT32)((xlUINT32)index % (xlUINT32)32768)) #define _XL_CE_INDEX(ce) (((((xlUINT32)(ce)) << (xlUINT32)17) >> (xlUINT32)17)) /* Decode an index, pass the ce in here if the CE_IS_INDEX macro returns true */ #define XL_CE_DECODE_INDEX(ce) ((XL_CE_PWEIGHT(ce) - (xlUINT32)130000) * (xlUINT32)32768) + _XL_CE_INDEX(ce) /* Note that the indexes are not(in xl implementation) like UCA DUCET indexes, they are always a direct index into one buffer, that is, contractions and expansions are stored in the same buffer of 32 bit unsigned ints, and then ces in the main lookup list can be indexes to the buffer. If the item in the buffer where the index points to is 0xFFFFFFFF then it is the start of contractions at index+1. Contractions are the list of characters in the contraction(minus the lead), plus the ces for the contraction, which are never indexes. The contractions finishes with an extra 0xFFFFFFFF, which signals that the following ces are what to use if no contraction match was found. A brief statement of the logic might be... ' THERE ARE TWO TABLES, A MAIN TABLE OF 32 BIT CES, AND A DATA TABLE OF 32BIT UINTS. ' CES IN THE MAIN TABLE ARE LIKE IN DUCET BUT ENCODED LIKE ABOVE. THE DATA TABLE ' HAS ALL THE CONTRACTIONS, EXPANSIONS AND FALLBACKS. WHILE I < LEN ' GET NEXT INPUT CHARACTER CH = INPUT[I] ' LOOKUP IT UP IN THE MAIN TABLE CE = LOOKUPCE(CH) ' SEE IF THE CE IS VALID IF CE != 0xFFFFFFFF THEN IF ISINDEX(CE) ' AN INDEX MEANS ITS A INDEX INTO THE DATA FOR CONTRACTIONS AND EXPANSIONS AND FALLBACK INDEX = DECODE_INDEX(CE) IF DATA[INDEX] == 0xFFFFFFFF THEN ' START OF CONTRACTION LIST, SEE IF THERE IS A MATCH GOTMATCH = FALSE WHILE (1) ' HAIRY CODE HERE TO DISCOVER IF THERE IS A CONTRACTION MATCH ' SEE UCA END WHILE ' IF NOT GOTO _ISEXPANSION_OR IF NOT GOTMATCH THEN GOTO _ISEXPANSION_OR ELSE ... OUTPUT THE CES FOR THE CONTRACTION MATCH END IF ELSE ' EXPANSION OR NORMAL OR INVALID _ISEXPANSION_OR: IF DATA[INDEX] == 0xFFFFFFFE THEN ' THERE IS NO FALLBACK CE, TIME TO COMPUTE GOTO _COMPUTE_CE ELSE ...SIMPLY OUTPUT A LIST OF CES UP TO TERMINATOR 0xFFFFFF WHILE (!DATA[INDEX] == 0xFFFFFFFF) CEOUT = DATA[INDEX] INDEX = INDEX + 1 END WHILE END IF END IF ELSE ' NORMAL CE, OUTPUT CEOUT = CE END IF ELSE _COMPUTE_CE: ' MUST COMPUTE THE WEIGHT IF DECOMPOSITION_TYPE(CH) > DCT_CANONICAL THEN ' COMPUTE THE COMPATIBILITY DECOMPOSITIONS MAPPINGS FROM UCA SPEC. CE[0..x] = COMPUTE_COMPAT(CH) ' THIS WILL BE MORE THAN ONE CE OFTEN BECAUSE OF COMPATIBILITY DECOMPOSITION CEOUT = CE[0..x] ELSE ' THIS IS SIMPLIFIED, YOU WILL ACTUALLY MAKE TWO CES AS PER DERIVED WEIGHT SPECIFICATION. P = COMPUTEP(CH) W = COMPUTEW(CH) T = COMPUTET(CH) CEOUT = ENCODECE(P,W,T) END IF END IF I = I + 1 END IF But of course the same index scheme as UCA DUCET can be accomadated by the number of indexes in the ce encoding anyway. Using this scheme there is no "binary search" or "boyer moore" style searching required because indexes always point directly to the contraction or expansion in question. Lookups into the main table for ces are like the unicode trie concept. */ DEFAULT WEIGHTS =============== One thing that is slightly confusing about derived weights is the values like .0020 that are assigned during runtime calculation. Does this mean to use the value of the .0020 as it was after tailoring, or does it mean to use the value 20. Same thing applies for the tertiary weights calculated at runtime. If the tertiary weight table returns 0x08 for some character, does it mean to use 0x08 for the tertirary weight, or the actual final weight for .0008 after tailoring? VARIABLE WEIGHTING FINAL LEVEL WEIGHT ===================================== Whilst handling variable level weighting for the SHIFTED option, it specifies to add a final level weight to the ce. But this final level weight is not mentioned in the "FORM SORT KEY" section where it only discusses looping either forwards or backwards through the list of returned collation elements, and these loops are only directed at specific weight levels. I suppose it means that if the option is shifted then there is a mandatory "X" level weight at (numLevels + 1), but its just not that clear. In the table for the final level weight for the SHIFTED option, the final level weight is labelled L4. Now I am not sure at all. Must I replace the computable L4 weight with the L4 weight in the table? Or is that a typing error? It surely cant be both the fourth level weight L4 and the final level weight. Now, if the algorithm is set to use 3 levels by the user, do you take the fourth level into account when deciding if a collation element is ignorable? Also, if the collation element is to have a final level weight appended, does that occur if there are only 3 levels that will be considered? For example. A collation table is set to run with only 3 levels. Now a variable element is discovered and the SHIFTED option is enabled. How does the final level weight come into play if there are only three levels. Is it for that variable element only that gets another level after it? Or is it just forgotten about because there are only 3 levels??? The examples show that the final level weight is like L4 or fourth level. So instead of final level weight would it not be more correct to say "replace the fourth level weight with one of the options in the table below" when specifying the variable weight handling. It also says that elements marked with a * are variable. What about the elements that are all zero? There is no star but the SHIFTED option table specifies that if the collation element is completely ignorable then use 0x0000 as the new final level weight. But if all zeroes is not marked with a star how could it be considered for variable weighting? Obviously because every collation element is then given the new final level weight as per the table, but it is so clear to straight away, to me anyway. My mind works in sequence, so when I read the first two options for BLANKED and NON_IGNORABLE, I assume the process only applies to variable elements then I must decipher that all elements are given a new final level weight which might or might not be the replacement of the computable fourth level weight. What if the collation uses 4 levels and it is SHIFTED? That could mean to use five levels if the "final level weight" terminology was used. I will leave my implementation to just do three levels for now(or assume its L4 replacement when it says "final level weight"). CONTRACTIONS ============ When searching for the "longest match", it is not immediately clear exactly how that works, or at least there is the question of what to do after removing a character. It says to replace S with S+C and remove C. So far so good. The ambiguity is what about unblocked characters that were already skipped before C because they did not match? Are they to be considered for matching as well. So if it was S+C where C was 2 characters ahead of S, would the skipped unblocked character after S then be considered for matching after removing C. It seems like it is the best idea to forget about already examined unblocked characters and look ahead after C, then when continuing the contraction match use the character after the removed C instead of considering already checked unblocked characters. CONTRACTION IS 0xAA 0xCC TEXT IS 0xAA 0xUNBLOCKED 0xCC AFTER REMOVING C AND MAKING S+C IT LOOKS LIKE (0xAA + 0xCC) 0xUNBLOCKED MUST YOU RECONSIDER BLOCKED, AS IN SEARCH FOR A MATCH IN CONTRACTIONS THAT IS (0xAA + 0xCC + 0xUNBLOCKED)? TERTIARY WEIGHT TABLE ===================== The condition in the tertiary weight table is based on General Category or a specific list of characters. Except that the specific list of characters trails off with a ".." after specifying only two of the characters. I guessed at which characters to use from UnicodeData.txt by following the character naming scheme but if those lists could be specifically defined it would probably make for more conforming uca implementations. TRAILING HANGUL WEIGHTS ======================= The interleaving method seems best for hangul because there seems to be no need to tailor the data. Is this correct? It says that the internal weight table is separate from the other weights. So is it correct to assume that for hangul sequences you can switch to an entirely unrelated and computed data table, the weights of which will not interfere with the other weights? It just says that the weights have nothing to do with the default weights but there needs to be more dicussion as to the relation of the weights to UCA default weights. I will just try and implement it assuming the above is correct and there is no tailoring required for hangul when using the interleaved method. I think if that was explicitly mentioned it would help to eliminate any confusion. 7.3 Compatibility Decompositions ================================ Step 1 of the compatibility decompositions section specifies to "Derive the decomposition". Does this mean to get the NFKD string of the one character string for the character in question, or does it simply mean get the characters decomposition sequence from unicodedata.txt. Step 2 specifies to get the ce for each element in the decomposition. Does this mean to recursively pass the decomposition to the uca function to get the collation element array, or does it simply mean to lookup each ce in the main ce array?(But what then if one of the characters in the decomposition was an expansion or some of the characters was a contraction). If Step 1 and Step 2 were more explicit it would make more sense. VARIABLE WEIGHTING ================== A table is presented for the final level weight in variable weighting when the option is SHIFTED. The line "Ignorable (L1, L2) after Variable" is confusing, ok so its ignorable(at what, L1 or L2, L1 and L2???), but what does the "after variable" mean? Is it a final level weight for subsequent ignorables after the variable weight that triggered the weighting adjustment? Is it after the maximum level variable weighting? Explicit specification here would be more convenient. Completely Ignorable 0000 NULL [.0000.0000.0000.0000] >>> Ignorable (L1, L2) after Variable 0000 COMBINING GRAVE [.0000.0000.0000.0000] Variable old L1 SPACE [.0000.0000.0000.0209] None of the above FFFF Capital A [.06D9.0020.0008.FFFF] I assume it means after the variable collation element in the array of collation elements? Does the L1, L2 specification mean to ignore any L3 ignorables? -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- -- (End of Report)