Re: Largest character

From: Kenneth Whistler (kenw@sybase.com)
Date: Fri Mar 31 2000 - 17:59:41 EST


Gary responded to Samir's question:

> Characters <128 take one byte.
> Characters <2048 take two bytes.
> All others in the 64K normal range take three bytes each.

True.

> There are provisions for characters above 2,097,152 to use three bytes, but
> normally unicode is only up to 64K. However when additional space is used,
> it still uses two bytes up to the 2,097,152 point.

False.

Unicode (= UTF-16) is defined for the Unicode scalar range 0 .. 0x10FFFF.

Unicode scalar values in the range 0x10000..0x10FFFF (decimal 65,536..1,114,111)
require *four* bytes when expressed in UTF-8. (And also four bytes when
expressed as a surrogate pair for UTF-16, by the way.)

> Using Character Agent from Bjondi, we find that 2048 (hex 0800) is in the
> middle of the arabic characters.

Actually, that boundary is *above* all the Arabic characters, and also above
Syriac and Thaana.

> Below this are things like hebrew,
> armenian, cyrillic, greek, and some other misc stuff. All the asian sets are
> above this point.

True.

If Samir's question is reinterpreted as "what are those languages which
require use of 3-byte UTF-8 forms?", then the answer is roughly:

   Any language written using one of the Indic scripts (e.g. Hindu, Marathi,
   Nepali, etc. using the Devanagari script, and so on); Sinhalese; Thai
   and Lao; Tibetan and Dzongkha; Myanmar and any other language written
   using the Myanmar script (e.g. Shan, Karen, etc.); Georgian; any
   language using the Ethiopic script (e.g. Amharic, Tigré, etc.); Cherokee;
   any language using Unified Canadian Aboriginal Syllabics (e.g. Cree,
   Inuit, etc.); old Irish written in the Ogham script; old European
   languages written with Runes; Khmer; any language written with the
   Mongolian script (Mongolian, Manchu, Todo, etc.); Vietnamese and a
   number of minority European languages (Livonian, Welsh, etc.) if
   using Latin extended precomposed characters; polytonic Greek, if using
   Greek extended precomposed characters; Japanese, Chinese, Korean, and Yi.

Furthermore, since most other languages, including English, make use of
some of the general punctuation characters in U+2000..U+206F, including,
notably, the curly quote marks, en dash, em dash, and bullet, it can be
expected that 3-byte UTF-8 forms will be found mixed into Unicode text
for almost any language, even when all of the letters of that text only
require 1- or 2-byte UTF-8.

--Ken

>
>
> ----- Original Message -----
> > Forwarding for Samir...
> >
> > > Subject: Largest character
> > > Date: Fri, 31 Mar 2000 10:16:33 +0530
> > >
> > > Hi,
> > > Which are those languages whose characters requires maximum number
> of
> > > bytes to store using UTF 8?
> > >
> > > - Samir Mehrotra,



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:00 EDT