Additional decompositions in decomps.txt
eliz at gnu.org
Sun Feb 21 10:32:44 CST 2016
This question is separate from, though related to, the "Character
folding in text editors" thread.
The UCA database includes the file decomps.txt, which is said to be
based on the normative properties:
# The decompositions used in the generation of DUCET are loosely based
# on the normative decomposition mappings defined in UnicodeData.txt
# in the Unicode Character Database. An examination of this data listing
# clearly shows the close relationship to the decomposition mappings.
# However, those decomposition mappings are adjusted as part of the input
# to the generation of DUCET, in order to produce default weights more
# appropriate for collation. Those adjusted
# decompositions fall into several classes:
# 1. In some cases a decomposition mapping from UnicodeData.txt is
# 2. In some cases a decomposition mapping from UnicodeData.txt is
# 3. In some cases a new decomposition is added for a character which
# has no decomposition mapping in UnicodeData.txt. In this third case,
# a new decomposition tag "<sort>" is introduced, to distinguish these
# introduced decompositions from those derived from UnicodeData.txt.
However, I see in decomps.txt entries that seem to belong to neither
of the 3 classes described above. Here are 2 notable examples:
00F8;;006F 0338 # LATIN SMALL LETTER O WITH STROKE => LATIN SMALL LETTER O + COMBINING LONG SOLIDUS OVERLAY
0142;;006C 0335 # LATIN SMALL LETTER L WITH STROKE => LATIN SMALL LETTER L + COMBINING SHORT STROKE OVERLAY
In both these cases, UnicodeData.txt defines no decomposition
properties, but the "<sort>" tag I expected to see is absent from
decomps.txt. Is there something I'm missing here?
More information about the Unicode