Re: PRI #200: Draft UTR #49, Unicode Character Categories from Andrew West on 2011-07-14 (Unicode Mail List Archive)

From: Andrew West <andrewcwest_at_gmail.com>
Date: Thu, 14 Jul 2011 10:35:17 +0100

On 14 July 2011 00:03, <announcements_at_unicode.org> wrote:
> The Unicode Technical Committee has posted a new issue for public review and
> comment. Details are on the following web page:
>
> PRI #200 Draft UTR #49: Unicode Character Categories
>
> This document presents an approach to the categorization of Unicode
> characters, and documents data files that implementers can use for defining
> and labeling Unicode character categories.

==General Rant==

I like the idea of categorizing characters hierarchically, but any
categorization scheme is necessarily subjective to a greater or lesser
degree, and I do not think that the Unicode Consortium should be
pushing one particular hierarchical categorization model as the
definitive categorization of Unicode characters. It seems to me that
this is one of several recent expansions to the scope of Unicode
Character Database (ScriptExtensions.txt is another example) that are
neither necessary nor particularly helpful.

==Specific Comment==

There are 18 top-level categories:
[Control]
[Diacritic]
[Format]
[Hieroglyph]
[Ideogram]
[Ideograph]
[Letter]
[Logogram]
[Logograph]
[Mark]
[Number]
[Punctuation]
[Sign]
[Syllable]
[Symbol]
[Virama]
[Vowel]
[Word]

What are the differences between [Ideograph] and [Ideogram], and
between [Logograph] and [Logogram] ? Even if UTR #49 does give
distinctly different definitions for each of these four top-level
categories, it will not be obvious to most users of Categories.txt
what the difference between Ideograph and Ideogram and between
Logograph and Logogram is as the -graph/-gram versions are synonymous
in general use:

<http://en.wikipedia.org/wiki/Logogram>
<http://en.wikipedia.org/wiki/Ideogram>

Andrew
Received on Thu Jul 14 2011 - 04:40:42 CDT

This archive was generated by hypermail 2.2.0 : Thu Jul 14 2011 - 04:40:44 CDT