[URL (current version);
L2/07-289 = WG2/N3307 (PDF snapshot)]

tangut

西夏文和統一碼

Tangut (Xīxià) Orthography and Unicode

Research notes toward a Unicode encoding of Tangut

by Richard Cook

STEDT, SEI, Unicode



“由於西夏字筆畫繁多, 是世界上最難識的文字之一.”

李范文 (《夏漢字典》, 1997:4)

‘A Tangut character generally consists of many strokes,
and so the Tangut written language is one of the most difficult in the world.’

Lǐ Fànwén (Tangut/Chinese Dictionary, 1997:11)


西夏 Xīxià (a.k.a. Tangut; 黨項 拓跋氏所建, 本名大夏, 又稱白上國, 宋人稱西夏; CH:2063; 大白高國, cf. 李范文 1997:302.1572) is an extinct Sino-Tibetan (ST) language of central China (都興慶府, 今寧夏銀川 modern Níngxià Yínchuān; 郭沫若 GMR 1979.2:41; [N38.4, E106.3]). Conquered by the Mongols in 1227, there were 10 emperors total, spanning some 190 years. Tangut is sometimes thought to bear an especially close affiliation to Qiangic (羌族語系; cp. 黨項羌, CH:1875; JAM:2004, “‘Brightening’ and the place of Xīxià in the Qiangic branch of Tibeto-Burman”), and vestiges of Tangut speech are said to be found in minority languages of 甘肅 Gānsù and 四川 Sìchuān provinces (木雅 = Muya = Mi-nyag; 道孚, 新龍, 爐霍 → 大木雅; 六巴 → 小木雅; 1997:2,11). The script and language have been studied in growing detail since the beginning of the 20th century, by Russian, Chinese, Swedish, Japanese, and American paleographers and phonologists (paleography is a criterion for phonology here), following discoveries in the early 1900s, and gradual publication of a number of manuscripts (some only published in the late 1990s).

The 12th century Tangut rulers, asserting their independence from the Chinese, began to use Tangut as the official language, and decreed that a writing system should be developed, and that classical Buddhist and Chinese texts should be translated. Like the Chinese writing system which served as its conceptual model, the Tangut script (a.k.a. 西夏文 Xīxià Wén, 河西字 Héxī Zì, 番文 Fān Wén, 唐古特文 Tánggǔtè Wén, Тангут, Си Ся, ...) is a heterographic syllabary, which is to say that the writing is phonographic and syllabic (syllabographic), each element of the syllable canon having multiple representations, the representation of a specific member of any given homophone class being semantically motivated. Both cursive and square forms of Tangut writing are known, though the standard writing is the latter. There are also ornamental styles of Tangut writing, for example, there is a Seal Script form styled after Chinese 篆體 zhuàntǐ. As in modern Chinese writing, each Tangut syllabograph is confined to a roughly (depending on the style or font) uniform em-square and conjoins elements drawn from a set of stroke primitives, this set being an extension of that refined in the 宋 Sòng Chinese orthographic and lexicographic traditions.

The Tangut script is componential, which is to say that similar assemblages of strokes recur in different characters, and these recurrent elements are variously used as classifers for indexing (several competing 部首 bùshǒu radical/stroke and also component schemes have been developed by modern scholars). Thus, Tangut script elements may be described using a stroke-based CDL, and in fact, structural analysis (as by Nishida and Kychanov) of this complex script would benefit greatly from future use of slightly augmented CJK CDL. Unlike Chinese writing, Tangut script elements (characters and components) are said to be wholly lacking in pictographic basis, which is to say that there do not appear to be native traditions seeing the characters or components as stylized depictions of real-world objects. In this respect Tangut writing is the opposite of its ST cousin 納西 Nàxī Tomba (¹Na-²khi ²Dto-¹mba), which is primarily pictographic (and rather non-phonographic).

All Tangut characters are highly reminiscent of Chinese characters (more or less so, depending upon the calligrapher), and give the general impression that squinting hard might bring them into focus as some form of ancient Chinese writing. But even to native Chinese readers the Tangut script is strange and impenitrable. A story was told to me by an historian at Academia Sinica of the discovery in 明清 Míng-Qīng times of a stela bearing Tangut writing: the people were convinced that it was the work of the devil, and walled it up to protect themselves from its evil influence (they were too afraid to simply destroy it). Tangut characters are not simply non-standard (heretical) Chinese characters: rather, they write a non-Chinese language, and constitute a unique and innovative offshoot of sinitic graphological traditions. Unlike Vietnamese Chu Nom and Japanese Kanji (also writing non-Chinese languages), Tangut characters use several non-CJK stroke types and have mostly unanalysable (from a CJKV perspective) components. Tangut script lies completely outside CJK lexicographic traditions and outside the scope of UCS CJK unification, and Tangut characters are not treated as CJK “ideographs”.

In contrast with Chinese and Chinese-derived scripts, which have well-defined graphological traditions filtered through standardizing reanalyses in the Eastern Hàn Dynasty (121 A.D. 《說文解字》; see Cook 2003), there seems to be no surviving comparable native analysis of the Tangut script elements (cp. 《文海》Wén Hǎi). The “components” (according to some analysis) of Tangut characters are sometimes independent characters, but the contribution to (role in) the character composition (aside from contributing to the formation of a unique sign) is not always clear. Some few Tangut characters and components seem in fact to be abbreviations or variant forms of Chinese characters, though the vast majority do not.

For example, the Tangut character T3712 wuo ‘round’ (李范文 LFW 1997:4743) seems to be a simplified writing of Chinese 員 (貟员贠) yuán (< SBGY:110.14, 142.15, 396.06 /Gjuən/, /Gjuan/, /Gjuəns/) ‘round’; it is used as the left-side component in several other characters with ‘round’ meanings, and so seems to serve as semantic determiner in those compounds. As another example, the Tangut characters T2709 (LFW 1997:3745) dzja ‘surname’ and T2713 dzji ‘male’ (‘雄,男’; 1997:3746) both seem to be variant writings of Chinese 雄 xióng (< SBGY 026.06 /Gjuŋ/) ‘male (bird)’: the phonetic component 厷 gōng (< 肱 SBGY 201.46 /kuəŋ/ ‘arm’) of Chinese 雄 is given unique form in the Tangut (unattested in known Chinese variants), such that 厷 is miswritten (from the Chinese perspective) rather like Chinese 冬 dōng (< SBGY 032.30 /tuoŋ/) ‘winter’; the 隹 zhuī ‘bird’ component written on the right-side of both, in the latter writing (3746) has a rather conservative form of the Chinese (clearly derivative of Eastern Hàn traditions, giving an explicit loop for the head of the bird, as in the Hàn Small Seal Script), whereas the former writing (3745) used for the surname is the normal simple Chinese form 隹. The ‘bird’ component of the Tangut ‘male’ writing is productive in the script (e.g. on the left side of LFW 1997:5567..9, 5543, 5282), but the contribution (phonetic or semantic) in those writings is unclear.

Such examples of relatively obvious parallels to Chinese characters are the exception: most Tangut characters make use of unanalysable (from a CJKV perspective) components, themselves comprised of decidedly non-CJK stroke types. It may turn out that a comprehensive and persuasive graphological analysis of Tangut will appear in the future, but such work has not yet (to my knowledge) been accomplished, and a great deal of groundwork remains to be done. Nevertheless, the character repertory itself is very clearly delimited, as we shall see below.


The 中央研究院語言學研究所 Zhōngyāng Yánjiūyuàn Academia Sinica Linguistics Institute 西夏 Xīxià type face and mapping database, which played a key part in the early stages of the UCS Proposal development, were developed by researchers under the direction of the historical phonologist 龔煌城教授 Prof. Gong Hwang-cherng (Gōng Huángchéng). The primary source for this work was a native 12th century Tangut phonological text with the Chinese title《同音》 Tóngyīn ‘Homophones’ (TY; Xīxià name /Ge_ ləu/ ‘sound same’, TYYJ:417-43B48, 469-53A72). A database was constructed on the basis of the character forms catalogued in collation of editions of this text analyzed by 李范文 Lǐ Fànwén (1932- ; a.k.a. 卜平, LFW) in his study《同音研究》 Tóngyīn Yánjiū [‘Homophony’ Research; TYYJ] (寧夏人民出版社, 1986). This database originally contained mappings for a total of 5,805 Tangut characters: in the proofing process, a total of 4 omissions were found, bringing the total for this data set (and for the “W” column TY source font in the Multi-column Code Chart) to 5,809 TY Tangut characters (the following 4 missing TY characters were added, and assigned virtual TY indices [TY9990..TY9903]: LFW 1997: 2585, 3007, 3044, 4480).

In his introduction, 李范文 Lǐ Fànwén wrote (1986:1) [and I translate, inserting my notes in square brackets]:

“《同音》是西夏王朝編修的一部韻書, 是研究西夏語言文字尤其是語音系統的重要資料. 這部書最早刊印於西夏崇宗乾順元德七年 (公元 1125 年), 正德六年 (公元 1132 年) 再次刊印, 世稱舊版本. 到了仁宗仁孝乾祐七年 (公元 1176 年) 前後, 西夏著名學者梁德養修訂重編, 於乾祐十八年 (公元 1187 年) 再次刊印出, 世稱新本或再版本.”

‘The Tóngyīn (TY) text is an imperial Xīxià rhyme book compilation, important for the study of Tangut language, literature and especially phonology. The earliest printing of this book dates to the year 1125, and reprinted in 1132, these are known as the old editions of the text. Around the year 1176 the famous Tangut scholar Liáng Déyǎng revised the text, and in the year 1187 reprinted what is known today as the new or reprint version of the text.’

“目前海內外流行版本, 即羅福成先生的手抄刊印本, 為西夏正德六年 (公元 1132 年) 刊印的舊版本. 新舊兩種版本原件現均藏蘇聯列寧格勒東方學研究所, 是 1909 年俄國人柯茲洛夫 (P. K. Kozlov) 從我國黑水古城 (今內蒙古額濟納旗) 發掘所得, 原件迄今尚未公布, 只是在索夫羅諾夫 (M. V. Sofronov) 著的《西夏語文法》 (1968) 第二卷裡將《同音》兩種版本按聲韻排列剪貼發表.”

‘The current Chinese editions of the text derive from the hand copy by Luó Fúchéng. [Made by him and published in 1935 in 大連 Dàlián, on the basis of the hand copy made by his father 羅振玉 Luó Zhènyù (1866-1940) in 1919 in Japan, on the basis of photographs; see Lǐ Fànwén 1986:5,14,929: “1935年, 羅福成先生又將其父抄錄出版的《同音》舊版本重新抄寫在大連刊印, 名曰: 《西夏國書字典音同》一卷, 由劉楚人題簽, 庫籍整理處印.” Color photographs of this complete text are available in 62-page PDF online from the 早稲田大学図書館 Waseda University Library manuscript collection (as of 2010-02-16).] This is a copy [of a copy] of the old text of 1132. [LFW 1986:656-767 publishes yet another hand copy, this one completed by 盧桐 Lú Tóng in October of 1985 in 寧夏 Nínxià, reflecting the results of LFW’s collation and corrections. The 盧桐 copy served as the basis for the (LFW 1986) font produced at Academia Sinica by 龔煌城 et al., corrected for use in the multi-column code charts.] The old and new texts were excavated by Kozlov [Пётр Кузьмич Козлов] in 1909 at ancient Hēishuǐ (modern Éjìnàqí [a.k.a. Khara-Khoto], in inner Mongolia) and taken back to Russia. The originals up to the present have not been published [** this is no longer true, as of 1997; see 《俄藏黑水城文獻: 西夏文世俗部分》 Hēishuǐchéng Manuscripts Collected in the St. Petersburg Branch of the Institute of Oriental Studies of the Russian Academy of Sciences, Vol. 7; Tangut Secular Ms.; Shanghai Chinese Classics Publishing House, 1997; ISBN: 7-5325-2213 X/Z-307; Sinica Ling. Lib.: 797.9 2652 6:7], and are only known from Sofronov’s 1968 Tangut Grammar [Софронов, М. В., Грамматика Тангуцково Языка].’

“據已公佈的資料分析, 可以看出新舊兩種版本最大的區別在於舊版本大體上不分聲調, 平聲和上聲放在一起, 視為同音; 新版本則把平聲與上聲分開, 各成一類. [...] ”

‘According to analysis of the published material it can be seen that the major difference between the old and new editions of the TY text relate to their respective tonal classifications: the old text in large part does not distinguish the even and rising tone classes, which are seen as homophones; the new text distinguishes the two tones, putting them in separate classes. ...’

(1986:14): “羅福成先生將《同音》一書鈔寫出版, 對學術界的貢獻, 為世所公認. 這部書自一九三五年出版至今整五十年了. 由於羅先生當時未見《文海》原件,也未見《同音》新版本, 他僅根據《同音》舊版的照片. 當時資料缺乏, 他不可能校勘得十分精確, 筆誤和錯別字在所難免. 現在, 我們根據《同音》兩種版本 [見索夫羅諾夫: 《西夏語文法》第二卷, 第 102-273 頁, 及本書 484-655 頁.] 和《文海》原件 [見柯萍等: 《文海》第二卷, 第 499-607 頁, 及史金波《文海研究》第 559-668 頁.], 以及西夏陵墓出土的西夏文殘 [見李范文: 《西夏陵墓出土殘碑粹編》文物出版社 1984 年版.] 等原始資料, 對羅氏抄本進行校勘, 發現錯別字八百六十四個之多, 約占全文 (大字和注字) 7.3%. [...]”

‘The academic contribution which Luó Fúchéng made with his hand-copied edition of TY is generally recognized. A full fifty years have gone by since the book was first published in 1935. Because Luó had not seen the Wén Hǎi [native Tangut dictionary] manuscript, and had not seen the new text of TY, but only had access to photos of the old TY text, materials at that time were lacking. He was unable to collate the different editions with complete accuracy, and slips of the pen and miswritten characters were difficult to avoid. Now, we have access to both old and new TY texts (thanks to Sofronov’s Grammar), the Wén Hǎi manuscript (thanks to Kepping et al. 1969 [К. В. Кепинг, В. С. Колоколов, Е. И. Кычанов: Море письмен], and Shǐ Jīnbō [1983], as well as the Tangut tomb inscription fragments (see Lǐ Fànwén 1984), and other primary resources. Collating the Luó Fúchéng text against these, we find a total of 864 errors, accounting for roughly 7.3% of the total text (including both large and small characters). [Tabulation of these errors follows.] [...]’

Some ten years later, in the Introduction to his 《夏漢字典》 Tangut-Chinese Dictionary Lǐ Fànwén (1997:15) says more about these numbers:

“本字典共收集 6,000 個單字 (包括異體字). 西夏國書《同音》字典的作者在《序言》中說: 其書 ‘大字 6,133, 注字 6,230’. 注字無誤, 大字多統計近 300 字, 其後以訛轉訛, 學術界至今誤認為 ‘西夏字共計 6,000 多字’, 從現有的資料可以肯定, 西夏字只有 5,800 多字, 其餘為異體和訛字.”

‘The present dictionary collects altogether 6,000 characters (including variants [and duplicates]). The author of the Tangut national book Tóng Yīn says in his preface that he wrote 6,133 main characters, and 6,230 annotation characters. The total number of annotation characters is not incorrect, but the count of main characters is over-estimated by about 300, and thereafter, as error breeds error, in academic circles down to the present day it is mistakenly reckoned that Tangut has about 6,000 characters. From extant materials we are certain that Tangut writing has only about 5,800 characters, and the rest are variants and mis-spellings.’


According to 韓小忙 Hán Xiǎománg (HXM), Xīxià characters were in use for a total of less than 500 years (1036-1502) [韓小忙, 2004.5:iii; 《西夏文正字研究》 (On Tangut Orthography; Ph.D. dissertation K246.3 H211.7, directed by 李范文 Lǐ Fànwén)]; their use extended beyond the Mongol conquest (1227) for less than 300 years. HXM undertakes a comprehensive and systematic collation of Tangut characters, based on nine Tangut dictionaries (《同音》, 《文海寶韻》, 《同音文海寶韻合編》, 《番漢合時掌中珠》, 《三才雜字》, 《纂要》, 《同義》, 《五音切韻》, 《新集碎金置掌文》), and catalogues a total of 6,066 forms, including 169 variants, 36 errors, and 5,861 unique ‘standard-style characters’ (“正字” zhèngzì ‘orthography’). Because of the relatively short period of usage, and invention by decree, there is fairly little variation in character forms (Lǐ Fànwén 1997:11; especially relative to the Chinese script, which by comparison is written over perhaps 3,000 years, and reinvented, reanalyzed, and redefined periodically). HXM gives a complete set of mappings to the various primary manuscripts, to the Xià-Hàn Dictionary (Lǐ 1997), and to Sofronov (1968). HXM’s total of 5,861 unique forms (larger by 21 than the Sinica TY inventory of 5,805 [+4; see above] characters), is divided into 3 broad categories (given below as 1, 2, 3).

HXM’s multi-column text-based variant typology makes it clear that the Sinica character set includes all but two of the 5,784 Class 1 elements of the script, and that the Sinica character set also admits a total of 21 other forms which are distributed over Classes 1, 2 and 3. That is, by his analysis, each of these 21 forms is either a variant of a primary class member, or a secondary or tertiary class member. (The precise relation of the HXM varclasses to the Sinica character set is set forth in the online Tangut mapping data.) As with CJK Unified Ideographs, the set of Tangut characters is somewhat ill-defined, due to manuscript deficiencies and scribal variation, and yet, it is anticipated that (barring unexpected manuscript discoveries) the total number of candidates for future Tangut encoding (beyond the proposed repertory) will be small. As with CJK, issues relating to distinctive features and variant mapping can be handled by combination of variation selector (VS) and higher-level protocol (CDL). In contrast with CJK, although component-based lexical classifiers (部首) are in use for Tangut, the classifier systems are not standardized. Though it may be appropriate in the future to encode a block of Tangut Radicals (and perhaps even some stroke types), the present proposal does not include these: instead, we provide radical assignments and residual stroke-counts in the mapping data, following several such systems (e.g. LFW 1986, 1997; HXM 2004; documented in TR43). Character components or recurrent stroke patterns are best identified using CDL, the preferred method for indexing the members of this character set.


As mentioned in the proposal document, the first attempt to define a standard electronic encoding of Tangut was Grinstead’s “Tangut Telecode” (Analysis of Tangut Script, Copenhagen: Scandinavian Institute of Asian Studies Monograph, 1971 [UCB DS 3 A2 S4 M6 No.8-10; ISBN 91-44-09191-5]). This early system, which assigns serial numbers to 5,819 characters and maps them to Wén Hǎi (Kepping et al. 1969), apparently never became widely used. In more recent times, electronic Tangut data has been in use among Japanese and Russian scholars, based (it seems) primarily on the Mojikyo (文字鏡) character collection, which applies a Shift-JIS encoding to all 6,000 李范文 Lǐ Fànwén (1997) entries (Mojikyo numbers 570001 .. 576000 = LFW:1997: 0001 .. 6000; see 荒川慎太郎 Arakawa Shintaro below; this outline font [augmented by a bitmapped radical & component font] was used by Е. И. Кычанов [E.I. Kychanov], Tangut-Russian-English-Chinese Dictionary, Kyoto Univ., 2006; cf. 池田巧 Ikeda Takumi). See also the recent work on Xīxià undertaken at 中易中标电子信息技术有限公司 (Běijīng Zhōng Yì Electronics Ltd.).

In addition to the primary lexical sources mentioned above, primary manuscript sources of Tangut are translations of Buddhist (cf. Kychanov:1999 Catalogue; 西田龍雄 Nishida Tatsuo, Tangut Lotus Sutra, 2004) and classical Chinese texts (e.g. 林英津 Lín Yīngjīn,「夏譯《孫子兵法》研究」 The Tangut text of ‘The Art of War’, 1994). For a general introduction to Tangut primary and secondary documents, see “西夏語文獻導讀” [Readers’ guide to Tangut literature] (林英津, 《遼夏金元史教研通訊》, 2004.2). For a recent bibliography of Tangut research in Chinese, Japanese, Russian, English, etc., see:《西夏關連研究文獻目錄》 Xīxià guānlián yánjiū wénxiàn mùlù [A catalogue of Tangut-related research literature] (2002: ISBN: 4-902325-00-4). In addition to the recent computerization work by 韓小忙 Hán Xiǎománg (2004), see e.g. 中嶋幹起 NAKAJIMA Motoki et al. (中嶋幹起, 今井健二: 電腦處理 西夏文字諸解對照表 1998; 中嶋幹起, 李范文: 電腦處理 西夏文雜字研究 1997). For further information on Tangut, see also: 中国社会科学院民族学与人类学研究所 Chinese Academy of Social Sciences (CASS), Institute of Ethnology and Anthropology 中国少数民族文字字符总集.



Images of some Tangut (Xīxià) texts and manuscripts

(JPG images, will open in a new window)


IMG_2158.gif

Xīxià /Ge_ ləu/ ‘sound same’: Tangut title of the ancient Tangut phonology book called《同音》 Tóngyīn (‘Homophones’) in Chinese (TY).


IMG_2159.gif

Title page of 《同音研究》Tóngyīn Yánjiū (‘Homophones’ Research, 李范文 Lǐ Fànwén, 1986, 寧夏人民出版社).


IMG_2160.gif

First page of the 1986 recension (title characters are at upper right, reading top-to-bottom in columns).


IMG_2161.gif

Random inner page (1986:53A,B), showing “large” (大) and “small” (注 ‘note’) Tangut characters.


IMG_2162.gif

Random index pages (1986:820-1), exemplifying primary mapping data print source.


IMG_2167.gif

Sample manuscript pages reproduced in 《夏漢字典》(李范文 1997; ISBN: 7-5004-2113-3): 《文海》, 《同音》,《番漢》,《雜字》.


sc0616eacc.gif

Another page from 李范文 1997, showing (lower right) examples of Tangut in Seal Script (篆體 Zhuàntǐ) style.


sc06172a60.gif

A page from the main body of the dictionary (李范文 1997:302.1572), showing the entry for the native name of the Tangut state (glossed “大白高國”).


strokes_HXM.gif

The set of Tangut stroke types, 韓小忙 Hán Xiǎománg (2004).



For assistance in preparation of the present notes and materials for the encoding proposal, I am indebted to the following people at 中央研究院語言學研究所 Linguistics Institute, Academia Sinica, Taiwan: 龔煌城 Gong Hwang-cherng, 莊德明 Chuang Derming, 高雅琪 Gau Yeachyi, 鄭錦全 C.C. Cheng, 林英津 Lin Ying-chin, 余文生 Jonathan Evans; and thanks also to 黃銘崇 Hwang Ming-chorng of 中央研究院歷史語言研究所 Institute of History and Philology, Academia Sinica.

Special thanks also to 池田巧 Ikeda Takumi, Jim Matisoff, Ken Lunde, Martin Heijdra, Andrew West, 加藤昌彦 Atsuhiko Kato, 阿南康宏 Anan Yasuhiro, and 荒川慎太郎 Arakawa Shintaro. See the full acknowledgments in the encoding proposal.



Page Generated in Perl:
Mon Mar 15 18:25:55 2010
Berkeley, California, USA


STEDT
SEI Unicode
Valid XHTML 1.0 Strict