Re: Definitions (Was: Re: Charsets + encoding + codesets)

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Oct 09 1997 - 21:26:05 EDT


>
> Definitions (Was: Re: Charsets + encoding + codesets)
>
> In message <9710081510.AA08562@unicode.org> unicode@unicode.org writes:
> > Keld J|rn Simonsen wrote:
> >
> > > John Cowan writes:
> > > > There is, as far as I can tell, no single term used in the Unicode
> > > > Standard for what you are calling an "abstract character" above.
> > > > I would like there to be one, myself.
> > >
> > > 10646 has the term "composite sequence".
>
> John Cowan (http://www.ccil.org/~cowan cowan@ccil.org) wrote:
>
> > No, that won't work. We need a term for the underlying abstraction
> > that can be represented either by a single (concrete) character
> > or by a composite sequence. Ken Whistler has used the
> > term "grapheme" (starting today). This term, AFAIK, is
> > always collocated with "phoneme" and is used in discussions of
> > text-to-speech conversion, speech-to-text conversion, and
> > learning to read (which is a kind of text-to-speech conversion).
> > Still, terminological buccaneering may be useful.
>
> John Clews writes:
>
> Why not do, as ISO/IEC 10646 (and all ISO/IEC/JTC1/SC2 standards) do,
> or at worst imply, and use the term "character" for those elements
> which can be related to a single code point, and "graphic character"
> for that class of characters which can be related to more than one
> code point, using any permitted combining options?
>
> See the definitions section of ISO/IEC 10646 and/or other standards
> from ISO/IEC/JTC1/SC2 for further details, and/or my own email of a
> couple of days ago to the Unicode list.
>

I am afraid that the use of "graphic character" for the concept
we are trying to get at would lead to further misunderstandings.
It is not at all clear to me that the definitions in ISO/IEC 10646
(and other SC2 standards) use "graphic character" as John Clews has
just indicated.

From ISO/IEC 10646 definitions:

4.6 character: A member of a set of elements used for the organisation,
control, or representation of data. [kenw: and we have agreed that this
means the atomic unit of the repertoire of a coded character set,
and that the Chapter 3 usage of the Unicode term "abstract character"
also means this.]

4.19 graphic character: A character, other than a control function, that
has a visual representation normally handwritten, printed, or displayed.

Now, as part of a definition set, I can only read this definition of
"graphic character" as meaning that a "graphic character" is a
type of "character" as defined in 4.6, and in particular means that
portion of the repertoire that normally has a visual representation.
Nearly everybody I know has interpreted that to mean all the characters
in the standard that have a visible form splatted in one of the code
positions, as opposed to control functions encoded as characters.
This interpretation is strengthened by clauses such as 17.3 "Identification
of subsets of graphic characters", which can only mean subsets of the
repertoire of this standard, i.e. of the "characters" of the standard.

So I don't see any way to get "graphic character" to mean anything
representable by more than one code point, using permitted combining
options.

4.20 graphic symbol: The visual representation of a graphic character
or of a composite sequence.

This doesn't work, either, since that refers to the
form of the splat in the box, or of a sequence of characters (composite
sequence) represented as a single splat. By the way, if "graphic
character" meant what John Clews has suggested, then the definition
of "graphic symbol" wouldn't need the "...or of a composite sequence"
tacked on at the end.

Aside on "grapheme" and "phoneme": These terms arise out of the
American structuralist linguistic tradition. They were intended,
roughly, to refer to structurally significant units of orthographies
and of phonological systems, respectively. The terms come in
contrast pairs with:

      phone : phoneme
      graph : grapheme

Where "phone" refers to a phonetically minimal unit of the sound (of
human speech), considered independently of the phonological system
of the language. (There are lots more complications, but that's close
enough.) Likewise, "graph" is basically just a mark on paper (or
in stone, or whatever), a minimal unit of writing, considered
independently of the rules of the orthography. A "grapheme", on
the other hand, is a significant unit within an orthography.

To give an example from Unicode, the written form of an Arabic
letter BEH, considered systemically, could be considered a
grapheme, whereas each of the particular positional presentation
forms could be considered a "graph".

I concur with John Cowan that it was "terminological buccaneering"
to try to coerce the term "grapheme" to also serve for what we
are striving to express, which is actually more abstract that
the written forms of an orthography, though generally related to
orthographies.

So can somebody come up with a term for:

   Unit of text potentially representable either by an encoded character
   or by a composite sequence (= Unicode "combining character
   sequence").

Note that "potentially" is important here, because we are talking
about an open-ended class of "things" that exist prior to decisions
made by standards committees to encode them as "characters", or
to decide that they can be represented by sequences of existing
encoded combining characters. Much of the heat of the "precomposed
character" encoding wars results from the first step over the line,
asserting that one of these "things" *is* a "character"; once it
gains that ontological status in the mind of many standards
mavens, there is no choice but to *encode* it as a "character"--
since it *is* a "character", the composite sequence representation
ceases to exist as a viable option.

Since this "thing" can't be called a "character" or an "abstract
character" without collapsing
things that should be different, perhaps in the interest of the
rectification of names, we should seek something completely different.
(Note: it also isn't a "letter".)

splattee ? (thing which gets a splat in visual representation)

textion ? (arbitrary Greek suffix to suggest atomicity)

chark ? (character quark)

charlecue ? (implying possible structure, reminiscent of molecule
                --besides it has a certain zing,
                rhyming with BAR-B-Q or curlicue.)

-- Ken (The Logomachist) Whistler

"When 500,000 English words just won't do, invent another."



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT