Laurentiu asked whether the confusables.text file could be
simplified, since many of the lines are repeated, with only the
table types being different. Examples:
2028 ; 0020 ; SL #* ( → ) LINE SEPARATOR →
SPACE #
2028 ; 0020 ; SA #* ( → ) LINE SEPARATOR →
SPACE #
2028 ; 0020 ; ML #* ( → ) LINE SEPARATOR →
SPACE #
2028 ; 0020 ; MA #* ( → ) LINE SEPARATOR →
SPACE #
He had asked the quite reasonable question: "Is it the case that the
SL confusables form a proper subset of the SA confusables, and so on
compared to ML and then to MA confusables? If yes, the duplication
in confusables.txt would be reduced quite a bit if each set only
listed what that set contains in addition to the previous set, and
inherited everything else from the previous set."
I did some analysis, and here's what I found:
As it turns out, they are not just supersets. With the version I
had, here are the stats.
4523 [MA, ML, SA, SL]
51 [ML, SA, SL]
511 [ML, SL]
122 [SA, SL]
724 [MA, SA]
351 [MA, ML]
330 [MA]
97 [ML]
45 [SA]
1 [SL]
However, we could make the file dramatically smaller if we
change the format to make the type field be a space delimited
list. So all of the above would be on one line:
2028 ; 0020 ; SL SA ML MA #* ( → ) LINE SEPARATOR → SPACE
#
The question to the committee is whether this is worth doing.