Re: character names (questions)

From: Mark Davis (
Date: Thu Apr 06 2000 - 11:00:21 EDT

These are good questions. Some brief answers:

1. The Unicode Standard 3.0 describes how to generate names for the abbreviated lists -- CJK ideographs and Hangul syllables -- so check there. The algorithm for Hangul names is also in TR#15 -- see

The surrogate characters currently have no names, since they are unassigned.

2. The naming guidelines for 10646 will be up whenever it is published. They are limited to uppercase letters from A-Z, digits 0-9 and hyphen-minus (maybe another character in there too -- I can't remember offhand).

3. When surrogate characters are assigned, they will be given properties and names just like all other codepoints. We haven't decided the format that they will take. My guess right now is that we will list the codepoint with 5 or 6 digits. [Since the only codepoints needing 6 digits are private use, 5 is enough for this purpose. For other uses, such as character mapping tables, 6 digits would be needed because byte sequences may map to private use codepoints.] Thus you would have something like:

23456;SOME CHAR NAME;Lo;0;L;...

in the main data file, and similar entries in the other files of the Unicode Character Database.

4. >why are they called "character names" and not "code point names"?
The term "character name" is a holdover. As we developed more experience with Unicode, we developed better terminology for distinguishing between the many different uses of the term "character", and now distinguish "abstract character", "code point", "code unit", "grapheme" and "glyph" more clearly. "Code point name" would indeed now be a better term because it more precisely indicates the function. (E.g. there is not, in general, a 1:1 relationship between abstract characters and codepoints.)

However, "character name" is ok, as long as you remember which sense of character is being used.


Viranga Ratnaike wrote:

> Dear Unicoders,
> I have 4 questions about character names:
> (1) how does one figure out the character names of the code points
> (in ranges in the UnicodeData.txt file)? Is there a separate
> file? Can you auto generate them and if so how?
> For example: if I wanted to find the name of code point U+5728
> where would the information be?
> I'm auto generating data structures; Using UnicodeData.txt, as
> input, gets me most of the way (I think). The gaps occur for
> the ranges:
> 3400;<CJK Ideograph Extension A, First>;Lo;0;L;;;;;N;;;;;
> 4DB5;<CJK Ideograph Extension A, Last>;Lo;0;L;;;;;N;;;;;
> 4E00;<CJK Ideograph, First>;Lo;0;L;;;;;N;;;;;
> 9FA5;<CJK Ideograph, Last>;Lo;0;L;;;;;N;;;;;
> AC00;<Hangul Syllable, First>;Lo;0;L;;;;;N;;;;;
> D7A3;<Hangul Syllable, Last>;Lo;0;L;;;;;N;;;;;
> D800;<Non Private Use High Surrogate, First>;Cs;0;L;;;;;N;;;;;
> DB7F;<Non Private Use High Surrogate, Last>;Cs;0;L;;;;;N;;;;;
> DC00;<Low Surrogate, First>;Cs;0;L;;;;;N;;;;;
> DFFF;<Low Surrogate, Last>;Cs;0;L;;;;;N;;;;;
> ...and also for the private use ranges
> (which we'll probably be needing).
> (2) how do I locate the ISO/IEC character naming guidelines?
> I looked in "The Unicode Standard Version 3.0" and it refers
> me to Informative Annex K of ISO/IEC 10646. Is the information
> available electronically? I looked at the ISO site and it said
> that "there is no electronic access to the contents of ISO
> standards" ( It did
> mention that this was in the pipeline, but didn't say when.
> (3) when surrogates are introduced, will there be mappings from
> surrogate pairs to character names? Will they be included
> in later versions of UnicodeData.txt? It's not an issue at
> the moment, but I'd like to structure my code such that I can
> just slot in surrogate code later.
> (4) why are they called "character names" and not "code point names"?
> Regards,
> Viranga
> Email:
> Phone: +61 3 9925 4124 (Work)

This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:01 EDT