From: Jim Allan (jallan@smrtytrek.com)
Date: Sat Nov 09 2002 - 17:49:00 EST
Patrick Rourke posted:
/So either there should only be one sigma, with the presentation being
determined by position (unless the font defines both positions as
lunate), or there should only be the medial and terminal and no lunate
"symbol," with lunate being defined only by the font - but then most
people entering Greek text would just use the medial form for all
sigmas, regardless of the position. Maybe
text entry could correct this . . .
/This would only work in /most/ cases in modern Greek, and less often in
historical documents.
Yannis Haralamous in “From Unicode to Typography, a Case Study: the
Greek Script” at http://omega.enstb.org/yannis/pdf/boston99.pdf, writes:
The letter sigma has a final form, written ς. Although this is a
contextual property, there is a Unicode character for this letter:
U+03C2; this is perfectly justified, because in some cases there is
a semantical difference between the medial and final form of σ: for
example, “φιλοσ.,” is necessarily the abreviation of some word (like
φιλοσοφία) while “φιλος.” is a single non-abbreviated word, followed
by a sentence period. In cases like this the form of the σ cannot be
determined by a simple algorithm.
There is a typographical curiosum, involving the final sigma:
the /Grammar of Pontiac Language/ by K. Topkharas ([Top], reprinted
in [Top₂]), published in 1928, in the Soviet Union, for the
(Pontiac) Greek speaking minorities. This grammar completely
abolishes accents, breathings, diphthongs, and uses only part of the
alphabet. The ς is used for the sound ‘s’, and a double ςς for the
English ‘sh’. Here is an excerpt of this book [Top, p. 49]:
Σιν γλοςανεμυν επεμνεν ας αρχεον τιν γλοςαν κε το ακλιτον το
λεκςοπον α πυ μεταχιριςκυςανατο ι παλιεμυν, ονταν εθελναν να
φανερονε πος καπιον ιδιοτιταν πυ εςς εναν προςοπον για πραμαν,
λιφταςςκετιατο καπιον αλο λ.χ. δινατος κε αδινατος.
From "SIGMA" by Katerina Sarri at
http://users.otenet.gr/~bm-celusy/sigma.html:
By c.400 B.C.E. sigma took its final shape Σ at all greek
city-states. The *final <ς>* was a later calligraphic version, when
ending some words, and gradually, when ending all words. In old
manuscripts it may be marked also within composed words (as the
final letter of the first word) as in: ειςβάλλω = εισβάλλω <
εις+βάλλω ( /I go in, attack/). Also, the *'lunate sigma'* (as looks
the third letter of the latin alphabet) *C* was used instead of
Σ,σ,ς (in the byzantine manuscripts, and today as a calligraphic
variety, especially by the church).
One might indeed work with a smart-sigma text entry routine, like the
smart-quotes routines, but would also want to be able to turn off or
override it if necessary, as one can with smart-quotes routines without
relying on propietary switches in a particular font, not always
accessible through every program, and perhaps different algorithms used
by different fonts.
A /intelligent/ font in which the above quotations could not be properly
produced because it has its own ideas where variants ought to appear or
does not have them is less useful than a /stupid/ font which puts out
what the writer produced. Unicode with three versions of lower-case
sigma is more useful than Unicode with a single version. Encoding only
one lower-case sigma would not reduce the complexity, only push it up to
differing and incompatitable higher protocols.
When the character variants have distinct semantics or distribution that
cannot be predicted algorithmically and is not random, encoding these
variants at the Unicode plain text level is simple and robust and does
not prevent a higher protocal from identifying the characters for
particular purposes.
Patrick Rourke also posted:
/I just can't wait for all the search failures resulting from searching
for τις in a text
with τιϲ
/
Given the increased number of characters and variants allowed by
Unicode, complexity of intelligent searching also increases.
Search engines should allow variant insensitivity and diacritic
insensitivity as they now usually allow case insensitivity. Case
insensitivity is usually the default setting and so should be variant
insenstivity and perhaps diacritic insensitivity.
This should be better supported than it is.
Even Google distinguishes, I think foolishly, between /caesar/ and
/cæsar/ and between /fluss/ and /fluß/, to give two examples.
But even given that a search engine recognizes such variants, one still
has to deal with spelling differences, eg. /ecumenical/,
/oecumenical/,/œcumenical, eucumenical/.
From the specifications for the Pandora search engine at
http://etext.lib.virginia.edu/helpsheets/pandora.html:
Note that Pandora treats medial, final, and lunate sigma as the same
letter.
As Unicode becomes more widely used, search engines will adapt.
Jim Allan
This archive was generated by hypermail 2.1.5 : Sat Nov 09 2002 - 18:38:05 EST