Janusz S. Bień
jsbien at mimuw.edu.pl
Fri Sep 16 08:52:26 CDT 2016
On Thu, Sep 15 2016 at 21:56 CEST, jsbien at mimuw.edu.pl writes:
> 1. Graphemes, if I understand correctly, are language dependent, textels
> are not.
> 2. Textel "ń" means both U+0144 and <U+006E,U+0301>, so it is a notion
> on a higher abstraction level then a grapheme cluster.
In other words, textels are equivalence classes of some set of Unicode
characters strings by an equivalence relation which at the moment is
open to the discussion but is very close to the official Unicode
canonical equivalence (when working on a corpus of historical Polish we
noticed some cases where standard Unicode equivalence was not
On Thu, Sep 15 2016 at 21:27 CEST, leoboiko at namakajiri.net writes:
> Isn't the Swift "character" and the "textel" merely the same thing as
> what Unicode already named "grapheme clusters"?
As for the Swift "character", perhaps someone fluent in Swift will answer
> (Well, technically UAX
> #29 defines them as "user-perceived characters", but then says
> grapheme clusters approximate user-perceived characters
> And, indeed, Swift "Characters" are explicitly defined as "extended
> grapheme clusters" (also from UAX #29):
Thank you very much for the link. Let me quote the relevant fragment:
Extended Grapheme Clusters
Every instance of Swift’s Character type represents a single extended
grapheme cluster. An extended grapheme cluster is a sequence of one or
more Unicode scalars that (when combined) produce a single
Here’s an example. The letter é can be represented as the single Unicode
scalar é (LATIN SMALL LETTER E WITH ACUTE, or U+00E9). However, the same
letter can also be represented as a pair of scalars—a standard letter e
(LATIN SMALL LETTER E, or U+0065), followed by the COMBINING ACUTE
ACCENT scalar (U+0301). The COMBINING ACUTE ACCENT scalar is graphically
applied to the scalar that precedes it, turning an e into an é when it
is rendered by a Unicode-aware text-rendering system.
In both cases, the letter é is represented as a single Swift Character
value that represents an extended grapheme cluster. In the first case,
the cluster contains a single scalar; in the second case, it is a
cluster of two scalars:
*Two String values (or two Character values) are considered equal if
their extended grapheme clusters are canonically equivalent.*
For me it means that Swift's characters are equivalence classes of the
set of extended grapheme clusters by canonical equivalence relation.
> Such a notion is indeed needed, but it has been always there.
>  http://unicode.org/reports/tr29/
I don't see there a notion of such equivalent classes.
On Thu, Sep 15 2016 at 16:36 CEST, john.w.kennedy at gmail.com writes:
> In the new Swift programming language, which is white-hot in the Apple
> community, Apple is moving toward a model of a transparent, generic
> Unicode that can be “viewed” as UTF-8, UTF-16, or UTF-32 if necessary,
> but in which a “character” contains however many code points it needs
> (“e” with a stacked macron, acute accent, and dieresis is
> algorithmically one “character” in Swift). Moreover,
> e-with-an-acute-accent and e followed by a combining acute accent, for
> example, compare as equal. At present, the underlying code is still
If you insist that Swift's "character" are just grapheme clusters, than
you add different, although related, meaning to the term "grapheme
cluster". I think the notion deserves a term of its own.
Prof. dr hab. Janusz S. Bien - Uniwersytet Warszawski (Katedra Lingwistyki Formalnej)
Prof. Janusz S. Bien - University of Warsaw (Formal Linguistics Department)
jsbien at uw.edu.pl, jsbien at mimuw.edu.pl, http://fleksem.klf.uw.edu.pl/~jsbien/
More information about the Unicode