From: Arcane Jill (arcanejill@ramonsky.com)
Date: Tue Dec 16 2003 - 11:06:13 EST
There was talk recently on this list of mapping grapheme clusters to the
PUA (for application internal use only, obviously, not for export to the
real world). I actually did this recently, though my app ended up in an
incomplete state since I got bored and moved onto something else. The
algorithm worked though, so I present it here and place it in the public
domain, licence free, for anyone to use who wants to do so. Such an
encoded string I called a "grapheme string", or "gstring" for short. Of
course, that was before "grapheme" was renamed as "default grapheme
cluster", so the name doesn't work quite as well now.
The range of characters I resereved for my private use actually
consisted of the surrogate codepoints, not the PUA codepoints. I
reasoned that the PUA area might actually be being used for something
(else), but the surrogate codepoints were illegal and therefore
available. Despite the fact that number of possible graphmes is
infinite, I never actually ran out of codepoints.
Here's the algorithm in pseudo-code:
// The following are static and global
max_word (a 16-bit unsigned integer, initially the lowest codepoint you
reserve (e.g. the start of the PUA) minus one)
map_grapheme_to_word[] (a mapping from grapheme (=array of codepoints)
to 16-bit word, initially empty)
map_word_to_grapheme[] (a mapping from 16-bit word to grapheme,
initially empty)
// Convert unicode text to internal representation with one 16-bit word
per grapheme
// -- input (text_unicode) is an array of codepoints (ie. it has already
been decoded from UTF-whatever)
// -- output (text_internal) is an array of 16-bit words, each
representing one grapheme. THIS STRING MAY NEVER BE EXPORTED.
text_internal = ""
for (each grapheme in text_unicode) // each grapheme is a substring of
one or more codepoints
{
grapheme = convert_to_NFC(grapheme);
if (num_codepoints(grapheme) == 1 && codepoint_of(grapheme) < 0x10000)
{
text_internal += codepoint_of(grapheme);
}
else
{
if (!exists(map_grapheme_to_word[grapheme]))
{
if (max_word still in range)
{
map_grapheme_to_word[grapheme] = ++max_word;
map_word_to_grapheme[max_word] = grapheme;
}
else
{
text_internal += U+FFFD; // Whoa!! Ran out of reserved
characters! Could add error handling here.
}
}
text_internal += map_grapheme_to_word[grapheme];
}
}
return text_internal;
// The converse process
text_unicode = "";
for (each word in text_internal)
{
if (word in correct range) // e.g. PUA but doesn't have to be
{
if (exists(map_word_to_grapheme[max_word]))
{
text_unicode += map_word_to_grapheme[max_word];
}
else
{
// error - should never happen
text_unicode += U+FFFD;
}
}
else
{
text_unicode += word;
}
}
return text_unicode;
Jill
This archive was generated by hypermail 2.1.5 : Tue Dec 16 2003 - 11:53:14 EST