Re: Abstract character?

From: Mark Davis (mark@macchiato.com)
Date: Mon Jul 22 2002 - 23:34:54 EDT


A small correction to Ken's message:

> The Unicode scalar value
> definitionally excludes D800..DFFF, which are only code unit
> values used in UTF-16, and which are not code points associated
> with any well-formed UTF code unit sequences.

The UTC in has decided to make scalar value mean unambiguously the
code points 0000..D7FF, E000..10FFFF, i.e., everything but surrogate
code points. While surrogate code points cannot be represented in
UTF-8 (as of Unicode 3.2), the UTC has not decided that the surrogate
code points are illegal in all UTFs; notably, they are legal in
UTF-16.

Ken is pushing for this change; I believe it would be a very bad idea.
(I think the reasons have already appeared on this list, so I am not
trying to reopen the discussion; just state the current situation.)

Mark
__________
http://www.macchiato.com
◄ “Eppur si muove” ►

----- Original Message -----
From: "Kenneth Whistler" <kenw@sybase.com>
To: <larsga@garshol.priv.no>
Cc: <unicode@unicode.org>; <kenw@sybase.com>
Sent: Monday, July 22, 2002 13:38
Subject: Re: Abstract character?

> Lars Marius Garshol asked:
>
> > I'm trying to find out what an abstract character is. I've been
> > looking at chapter 3 of Unicode 3.0, without really achieving
> > enlightenment.
> >
> > The term Unicode scalar value (apparently synonymous with code
point)
> > seems clear. It is the identifying number assigned to assigned
> > Unicode characters.
>
> Here is one of my attempts at a more rigorous term rectification:
>
> Abstract character
>
> that which is encoded; an element of the repertoire (existing
> independent of the character encoding standard, and often
> identifiable in other character encoding standards, as well
> as the Unicode Standard); the implicit basis of transcodings.
>
> Note that while in some sense abstract characters exist a
> priori by virtue of the nature of the units of various writing
> systems, their exact nature is only pinned down at the point
> that an actual encoding is done. They are not always obvious,
> and many new abstract characters may arise as the result of
> particular textual processing needs that can be addressed by
> characters. (E.g. WORD JOINER, OBJECT REPLACEMENT CHARACTER,
> etc., etc.)
>
> Code point
>
> A number from 0..10FFFF; a "point" in the codespace 0..10FFFF.
>
> Encoded character
>
> An *association* of an abstract character with a code point.
>
> Unicode scalar value
>
> A number from 0..D7FF, E000..10FFFF; the domain of the
> functions which define UTF's. The Unicode scalar value
> definitionally excludes D800..DFFF, which are only code unit
> values used in UTF-16, and which are not code points associated
> with any well-formed UTF code unit sequences.
>
> Assignment (of code points)
>
> Refers to the process of associating abstract character with
> code points. Mathematically a code point is
> "assigned to" an abstract character and an abstract
> character is "mapped to" a code point.
>
> This is distinguished from the vaguer sense of "assigned"
> in general parlance as meaning "a code point given some
> designated function by the standard", which would include
> noncharacters and surrogates.
>
> >
> > So far, so good. Some questions:
> >
> > - are all assigned Unicode characters also abstract characters?
>
> Yes. Or rather: all encoded characters are assigned to abstract
> characters.
>
> (See above for my distinction between "assigned" and
> "designated", which would apply to noncharacters and surrogate
> code points -- neither of which classes of code points get
> assigned to abstract characters.)
>
> >
> > - it seems that not all abstract characters have code points
(since
> > abstract characters can be formed using combining characters).
Is
> > that correct?
>
> Yes. (Note above -- abstract characters are also a concept which
> applies to other character encodings besides the Unicode Standard,
> and not all encoded characters in other character encodings
automatically
> make it into the Unicode Standard, for various architectural
reasons.)
>
> >
> > - do <U+00C5> () and <U+0041, U+030A> (A followed by combining
ring
> > above) represent the same abstract character?
>
> Yes. That is the implicit claim behind a specification of canonical
> equivalence.
>
> --Ken
>
> >
> > Would be good if someone could clear this up.
> >
> > --
> > Lars Marius Garshol, Ontopian <URL: http://www.ontopia.net
>
> > ISO SC34/WG3, OASIS GeoLang TC <URL:
http://www.garshol.priv.no >
> >
> >
> >
>
>
>



This archive was generated by hypermail 2.1.2 : Mon Jul 22 2002 - 21:59:27 EDT