[Unicode]   Common Locale Data Repository : Bug Tracking Home | Site Map | Search
 
Modify

CLDR Ticket #9944(accepted data)

Opened 8 months ago

Last modified 4 months ago

Put Hani after Latin in sorting for Simplified Chinese

Reported by: xefbfbd@… Owned by: markus
Component: collation Data Locale:
Phase: rc Review:
Weeks: 0.1 Data Xpath:
Xref:

Description

According to GB/T 13418-92 文字条目通用排序规则 (Character and entry filing principles)

5.3 汉字字符与非汉字字符混合出现时排序

为了便于信息处理,依据国家标准 GB 2312 中字符的排列顺序,确定以下排序的前后次序:
空格—序号—阿拉伯数码—拉丁字母(大写、小写)—日文假名(平假名、片假名)—希腊字母—俄文字母—汉字

There is no official English version, following translated text is mainly made by Google Translate, with a few tweaks by me

5.3 Sort Order When Chinese Characters and Non-Chinese Characters Appear Together

For the purpose of facilitating information processing, the following sort order is determined according to the characters order of national standard GB2312:
Space - Ordinal Number - Arabic Numerals - Latin (uppercase, lowercase) - Kana (Hiragana, Katakana) - Greek - Russian - Chinese

Attachments

Change History

comment:1 Changed 8 months ago by xefbfbd@…

GB/T stands for "recommended national standard", it's not a mandatory one

see also: ticket:4020

comment:2 Changed 4 months ago by emmons

  • Status changed from new to accepted
  • Priority changed from assess to medium
  • Phase changed from dsub to rc
  • Milestone changed from UNSCH to 32
  • Owner changed from anybody to markus
  • Type changed from unknown to data

comment:3 Changed 4 months ago by emmons

  • Cc fredrik, kristi, kiara added

comment:4 Changed 4 months ago by markus

  • Cc mark, pedberg, emmons, yoshito added
  • Weeks set to 0.1

This needs some discussion. How far do we want to go in matching this standard? Is it a good standard to follow at all?

The subject line suggests [reorder Latn Hani].

CLDR puts the native script first so that lists of names start with native-script ones followed by foreign-script ones. The use of pinyin in Chinese might justify putting Latn first, but this is a big change in how contacts apps sort.

The standard suggests [caseFirst upper][reorder Latn Kana] with Hani naturally going last, or possibly [caseFirst upper][reorder Latn Kana Grek Cyrl Hani]. It recommends this based on the binary order of GB 2312 which stems from character code allocation convenience and thus does not seem like a good basis for a real sort order.

We cannot (easily) put Arabic-script digits after ASCII digits because they are primary-equal. Does GB 2312 even have Arabic-script digits? It clearly doesn't before the ASCII letters. Does the Chinese text for "Ordinal Number - Arabic Numerals" mean something different?

I would be more comfortable looking at a typical, printed Chinese phone book that includes Latn names. Are they in the front or in the back?

Last edited 4 months ago by markus (previous) (diff)
View

Add a comment

Modify Ticket

Action
as accepted
Author


E-mail address and user name can be saved in the Preferences.

 
Note: See TracTickets for help on using tickets.