We are using ICU and CLDR with SQLite. I am not a software developer but a user of the output.


We have had some comments from Chinese colleagues on name sorting and I am unsure if what we have is correct or if it is expected our development team are supposed to use the tools in a different way.  We are currently sorting the phonebook by pinyin and an example of a comment we have had is regarding “沈” when ends up being sorted as Chen, but our China team are saying it should be Shen.


I am trying to figure out if  the utilities should come up with the generally accepted match out of the box or if  “沈” really does map to 2 pinyin equivalents or if our dev team is supposed to override the default rule to make Chen a Shen. I did notice in CLDR 24 for zh.xml that there is an additional section called compounds and then says “Here 沈 collates as shěn/7stk/rad85, between 弞 7/stk/rad57, 审 8stk/rad40”.  I have not a clue how to interpret this but am wondering if this means to override the mapping to chén earlier in the table and if this was something learned in CLDR for v24 onwards ?


Not being able to read Chinese I am unsure if there will be loads of these examples or only a few and I believe our dev team have a similar problem too and are relying of the default collations.


Any advice is very much appreciated. 


Ps I did visit some other sites like Chinese tools and on searching for “沈” was offered Chén , Shěn and Tán as pinyin equivalents so I guess there are more than 1, I am just wondering if for names (which in our case it is a phonebook) there is some common knowledge it can only be Shěn.


I also managed to pin down a passing Chinese work colleague but all he could say was is only and Chén is a ‘suggestions’ rather than actual match (and then exited stage left in haste) – is that correct ?


