Re: Definition of character from Jukka K. Korpela on 2011-07-13 (Unicode Mail List Archive)

From: Jukka K. Korpela <jkorpela_at_cs.tut.fi>
Date: Wed, 13 Jul 2011 10:45:17 +0300

12.07.2011 19:57, Asmus Freytag wrote:

> Jukka,
>
> reminding everyone of the definition of "technical term" as opposed to a
> word in everyday language isn't helping address the underlying issue.
> Everyone is familiar with this distinction.

I’m afraid the distinction is not widely enough known, and even people
who know it often fail to apply it. Most people that I know use the word
“term” or its equivalent to denote any word, though usually with an
overtone that suggests some “tech stuff.”

> The truism goes like this: "A character is what character encodings encode".

That’s not a very exact formulation, but a good start. Unicode has the
concept of a code point, and a classification of code points so that
some of them are classified as characters, or denoting characters. The
concept of “character” in that sense is essential, the most important
“character” concept in Unicode. So in good terminology, a single term is
assigned to that concept, and the term consistently means just that
concept and nothing else.

The trouble starts with the observation that the concept does not fully
correspond to the age-old concept of “character,” which predates Unicode
and computers, by thousands of years. Such problems are not rare in the
modern world. You can solve them either by using a common word as a
technical term, as long as you continuously keep it clear whether you
are using it as a common word or as a term, or by coining a new word, or
sequence of words, or abbreviation.

The Unicode standard mostly uses “character” as the technical term, but
it makes frequent use of “character” as a common word, too, though
usually the prefixing it with the adjective “abstract” (as if Unicode
characters weren’t abstract!).

> Historically, character encodings have also encoded, on otherwise equal
> footing, units that are intended for device control. Over time, some of
> the device control characters have been redefined as indicators of
> logical division of text. (TAB and LF are the most prominent examples of
> this evolution).

Besides, space “characters” might not be seen as characters in the
common-language sense. They are somewhere between “graphic characters”
and “control characters.”

This is part of the complexity of the correspondence between Unicode
characters and text characters (i.e., characters in the old everyday sense):
1) some Unicode characters are not text characters but e.g. formatting
controls
2) some text characters cannot be represented as Unicode characters at
all except as Private Use characters
3) some text characters need to be represented as a sequence of two or
more Unicode characters (or as Private Use characters)
4) many text characters have alternative representations as Unicode
characters
5) it is often not self-evident at all how a character used in text
could or should be represented using Unicode characters, and many notes
in the Unicode Standard are meant to clarify such things.

> These historical developments have left us with this and other examples
> of deep ambiguities in the definition of the members of those sets we
> call "character encodings".

Ambiguities may exist, but this is basically a matter of distinguishing
two concepts from each other.

> Let's look at the putative benefit of a better definition. I think such
> a benefit has implicitly been claimed to exist, but I would ask for a
> demonstration in this case.

For one thing, defining “Unicode character” as a technical term and
using it consistently makes it possible to formulate clearly its
relation to “character” in the common meaning, thereby helping people to
understand and use Unicode better.

> One possible benefit of a solid definition of the members of a set is in
> helping decide which additional entities should be made members of the
> set.

That’s a completely different issue. The purpose of definitions and
consistent use of terms is not to set guidelines for decisions. It must
be possible to say that a particular text character is not a Unicode
character without implying (as a naturalistic fallacy of a kind) that it
should be.

The entire “definition” of the word “character” in the Unicode Glossary
is highly confusing, and so is “abstract character.” They would perhaps
best be replaced by the following:

Unicode character. A Unicode code point classified to be a character
code point. It may represent a text character, a component of a text
character (such as an accent symbol), or a control code for text formatting.

Text character. An element of writing recognized as a basic unit of
text, such as a letter, digit, punctuation mark, currency symbol, a
syllable symbol in syllabic writing, or an ideograph. This is a
non-technical definition, and there are differences in how people
mentally divide text into text characters or recognize different graphic
symbols as forms of a text character or as separate text characters. A
text character is usually representable as a Unicode character or as a
sequence of Unicode characters.

Character. A Unicode character or a text character. Normally the context
makes it clear which one is meant. In the Unicode Standard, “character”
normally means “Unicode character.”

(I’m sure this would need clarifications and tuning. I presented it
mainly to illustrate that clarity is possible.)

-- 
Yucca, http://www.cs.tut.fi/~jkorpela/

Received on Wed Jul 13 2011 - 02:51:21 CDT

This archive was generated by hypermail 2.2.0 : Wed Jul 13 2011 - 02:51:28 CDT