Re: Romanized Singhala - Think about it again from Jean-François Colson on 2012-07-17 (Unicode Mail List Archive)

From: Jean-François Colson <jf_at_colson.eu>
Date: Tue, 17 Jul 2012 20:11:50 +0200

Le 17/07/12 02:43, Naena Guru a écrit :
> Jean, sorry I am late. I used spare time as and when I got it.
>
> On Sun, Jul 8, 2012 at 10:20 PM, Jean-François Colson <jf_at_colson.eu>
wrote:
>
> Le 09/07/12 01:29, Naena Guru a écrit :
>> Jean-François,
>>
>> Let me approximate it in Romanized Singhala: 'jyó-frósvaa'. Just
trying...
> I don’t know how that transcription should be pronounced but in
IPA, Jean-François is /ʒɑ̃.fʁɑ̃.swa/.
>
> They came as rectangles (in XP).
That’s not surprising. Windows XP is an old out-of-date system with, by
default, a very limited set of fonts. But nobody prevents you to
download some additional free fonts.

> They showed correctly in your message inside Firefox running in Puppy
Linux, but where I an replying, it shows a reversed Euro like character
That is surprising. Which font does include a reversed euro sign ?

> in place of the a-umlaut.
I didn’t use any umlauts but two tildes.

> This again illustrates how hazardous it is for characters outside
Latin-1.
It illustrates how hazardous it is to use such an old OS as Windows XP.

>
> I can only approximate the first letter as English j+y,
English j + y????? I don’t know that, neither in English nor in French.

/ʒ/ is the French j. It is not the English j plus something but rather
the English j minus something.
The English j is /dʒ/. That’s an affricate, i.e. roughly a sound which
begins as a plosive and evolves to end as a fricative.
The French j is a fricative. its nearest approximation in English is the
z in azure or the s in leisure.

/ɑ̃/ is not an /ɑ/ with umlaut but an /ɑ/ with tilde. It is pronounced as
the a in the English word “car” but with a nasal quality, i.e. some air
passes through the nose.

Jean is an homophone of gens which you can hear here:
http://fr.wiktionary.org/wiki/gens#Prononciation
The speaker has recorded “des gens”, so focus your attention on the
second syllable.

/f/ is pronounced like in English.

/ʁ/ is the French r, but there are several varieties of r among the
French dialects, so using the English r instead is not a big problem.

/s/ and /w/ are pronounced as in English.

/a/ is very similar to /ɑ/. It is the begining of the diphtong in the
English word “sky”.

> which is same on Singhala. The rest is pretty close, I think.
>
>
>
>>
>> Thank you for your interest. See inline responses.
>>
>> On Thu, Jul 5, 2012 at 7:35 AM, Jean-François Colson
<jf_at_colson.eu> wrote:
>>
>> Le 05/07/12 10:02, Naena Guru a écrit :
>>>
>>>
>>> On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy
<verdy_p_at_wanadoo.fr> wrote:
>>>
>>> Anyway, consider the solutions already proposed in
Sinhalese
>>> Wikipedia. There are verious solutions proposed,
including several
>>> input methods supported there. But the purpose of these
solutions is
>>> always to generate Sinhalese texts perfectly encoded
with Unicode and
>>> nothing else.
>>>
>>> Thank you for the kind suggestion. The problem is Unicode
Sinhala does not perfectly support Singhala!
>> What’s wrong? Are there missing letters?
>>
>> Many, many.
>>
>>> The solution is for Sinhala not for Unicode!
>> Or rather for Sinhala by Unicode.
>>
>> Sure, if you want to do it with proper deliberation.
>>
>>
>>
>>> I am not saying Unicode has a bad intention but an
ill-conceived product.
>> What precisely is ill-conceived?
>>
>> Anglo-centric thinking is what is wrong.
> ?????
>
> Letters have no direct relation to speech -- very few. In Singhala
(perhaps as in French too as someone said?) you write what you say. In
Singhala, the exception is clearly understood rule set about how to
pronounce short 'a' -- whether muted or not.
>
> Therefore, the approach should have been to encode the vowels,
diphthongs and consonants as base letters. I assigned the acute accent
to the 'ng' sound and the umlaut to the guttyral H in Sanskrit, but they
could be assigned independent codepoints.
>
>
>
>> Let me take you on the scenic route:
>>
>> Number of letters in Singhala is only theoretical. In the case
of Singhala orthography, the actually used number depends on the
Sanskrit vocabulary.
> Do you mean there are many conjunct consonants, sometimes with a
separate glyph?
>
> Yes, many. There are three orthographies. Singhala does not have CCs
at all. Sanskrit has a lot. Pali has touch letters in addition to what
Sanskrit has. Modern Singhala is mixed Singhala and Sanskrit. With
Unicode Sinhala, you need to know which ones join and provide the ZWJ
and hope and pray that the font has the CC. Often they are absent.
So I guess your problem could be solved by providing new fonts with a
better support of conjunct consonants. What you did for 8-bit Sinhala,
you could do it for Unicode Sinhala too.

> Then SLS1134 gives wrong advice too.
Could you explain in details what is wrong in their advice?

>
> In Devanagari, they’re made by typing two or more consonants
separated by halants. Isn’t that possible with Sinhala?
>
> Yes. That is possible, which means the fonts need lookup tables that
provide the CCs in the Private Use Area of the font.
Isn’t that what you did in your 8-bit Sinhala font?

>
> BTW, 'halant' means the last consonant in a word. 'hal' means
consonant. What Unicode calls halant (and virama) is the sign that
indicates a letter is a (free-standing) consonant. virama means the
marker of a consonant at end of a sentence. (Sigh, cry)
> In Singhala, we call the sign 'hal kiriima' = the sign which makes it
a hal and is not location specific.
>
>
>
>> My test font has about 1500.
> IIRC there are only 191 code-points in Latin-1. Or 96? I’m not
sure whether ASCII is part of Latin-1.
>
> These are the two that I use to look up letters of the Latin-1
repertoire:
> http://www.unicode.org/charts/PDF/U0000.pdf
> http://www.unicode.org/charts/PDF/U0080.pdf
But there are not “about 1500” code points there.

>
>
>
>> I need much more. Pali orthography enforces touch-letter rule
(see later). Modern Singhala, meaning 1st century onwards, is an
admixture of Singhala and Sanskrit. Pali is not mixed into Singhala text
(except to quote like a foreign language).
>>
>> Generally, the Unicode approach is to treat the consonants as
base shapes. Then the vowel signs are added around them. The vowel signs
have their own codepoints. We hit upon the first problem here because
there are two possible codepoints for each single-mora vowel,
double-mora vowel and each diphthong. We lose all hopes of traditional
search, replace etc. It complicates collation too.
> In my first name, I usualy type the ç as U+00E7 because I have a
Ç key on my keyboard, but using the two code-points U+00E7 U+0327 (c +
combining cedilla) would be correct too and in many software the search
function won’t find the ç if I’m looking for a ç.
>
> You are tight!
Thanks

> Notepad does not find ç. It is certainly a bug. I tried Geany text
editor and Abiword in Linux. They both are okay. I think Notepad forgets
those letters in the Latin 1 Supplement. This is not a big deal.
Not really a smaller one than improving the Sinhalese fonts if they miss
some conjuncts.

> I remember that VOLT guys took awhile to realize that Open Type
'simple' script includes those.
>
>
>
>>
>> Then they went up the gum tree of the notion that the Singhala
consonant is actually a consonant with a vowel inside it -- an absurdity
-- the Abugida theory. They then added two ligatures without normalizing
them. Singhala has 15 ligatures in that category of ligatures. They
included upadhmAnIya and left out jihvAmUlIya.
> Can’t those ligatures be typed as separate signs with a zero-with
joiner? I’ve seen there’s one on Sinhalese keyboards.
>
> Yes, you can but only to get a few.
Could you describe in details the missing conjuncts, which letters
should generate them and what they should look like?
Also, it should be very useful to send those descriptions to the
designers of Sinhalese fonts.

> It hurts the orthography, Jean. People forget orthography when it is
in disuse. It is better to make the CCs and let the user prevent their
formation by means of ZWNJ where they do not want the joining.
Oh! I see. It would be possible to make input methods which work as you
want and generate Sinhalese characters as they are used today.

It is not possible to change the behaviour of sinhalese characters: that
would mangle all presently existing texts.

> I implemented this in my font. It is easier for input thanh
>
>
>
>>
>> That concludes that Unicode Sinhala is not grammar compliant.
> What do you mean?
>
> That the rules of orthography are ignored. e.g., just two conjuncts
(conjoints) out of a group of 15
15 or about 1500

> in their class inside the standard and others missing.
You’re not alone. In the South of Belgium and the North of France, there
are local languages which, in some orthographies, use the ring above the
vowels A and E. Å and å are available as precomposed consonants. E̊ and e̊
are not and must be composed of E/e + U+00A0. That’s not a big problem.

I think the good question is “Why were those two CCs included as
precomposed characters?”, not “Why were the other ones forgotten?”.

> There is no consistency in this code block.
Just forget the useless characters if they are a mistake of the past.

> This is why I say, just encode the vowels and consonants. You comply
with the grammar comply only if you start with the phoneme set. That is
what HK Sanskrit, IAST Sanskrit and PTS Pali did. They are backward
compatible with older text due to this.
>
>
>
>> This is the first requirement. It is not Unicode compliant
because is has canonicals that are not normalized.
> Can’t you describe in details those missing “canonicals” to make
a proposal?
>
> I can, but that is the wrong approach.
Why is it wrong?

>
> Unicode correctly advised not to introduce canonicals trivially.
What do you mean?

> Then they gave a deadline to register them. The whole idea is to
differentiate the text of one language from the other with these
national ligatures. Those languages that had Latin cononical forms are
justified in declaring codepoints for them. Whereas, Singhala
artificially made two 'canonicals'.
Do you mean there’s a historic way to write Sinhala and a second way,
introduced by the persons who submitted the Sinhalese Unicode proposal
years ago?
Is that what you call “canonicals”?

> Now they have to be normalized down to base codepoints.
>
> Besides, no change can ever happen because of the dog-in-the-manger
effect.
I don’t understand. What is that strange effect?

> If you insist, the alternative is to define Singhala in a new code
block without canonicals
Without canonicals????? Could you reformulate that please? I don’t
understand. Which meaning of canonical
(http://dictionary.reference.com/browse/canonical) are you using here?

> and the codepoints are purely for vowels, diphthongs and consonants.
Next you make fonts to whatever depth of complexity that you want.
OpenType allows this.
I’m not a font designer, but can’t that be done with the present
Sinhalese encoding?

> Then those who care about orthography can make font templates for
traditional orthographies. The user simply types as they speak and the
font responds with the combinations intended by the font author.
Not changing anything, it would be possible to type as you speak with a
well designed input method.

> I tested it for nine years with various kinds of users. It is in my
(proof-of-concept) font.
Is it possible to download your font?

> There are few instances where (pure) Singhala words that have two
consonants together, such as, kan|ða කන්ද and aþ|þa අත්ත. There are also
names which are like hyphenated ones that have three vowels together
(e.g. සුමිත්‍ර‌ආරච්චි -> sumiþra‌|aaracci.). In those instances, the user can
intervene ZWNJ. The Bar represents ZWNJ in the RS examples. Users know
obvious Singhala words and names.
>
>> It has a jumble of Singhala letters and signs and duplicates the
same phoneme.
>
> Duplicates the same phoneme…
> Do you mean there are characters which are encoded twice? In this
case, you can ignore one of them.
> Do you mean vowels and diphtongs can be written both as separate
letters and as diacritics after a consonant? In this case, that works
like in Indic scripts. I don’t see anything wrong in that.
>
> What is wrong in that is in the struggle you would have to go through
for text processing like search and replace and sorting.
?????

> It is very straight forward in RS:
> See the sample names getting sorted according to English and Singhala
here:
> http://www.lovatasinhala.com/liyanna.php#sort
> Meanings of the buttons left to right:
> [erase] [show sample] [sort-Singhala] [sort-English]
>
>
>
>> It is good only for the trash can.
> Isn’t that an exaggeration?
>
> Kind of.
> Jean, it can only be redefined. First, if you tinker it, you orphan
the present text made with the flawed one. If you define it elsewhere,
then you can write a routine to convert the flawed one to the new one.
That is what I did in the case of romanized SInghala.
>
>
>
>>
>> Here are the considerations for a successful encoding:
>> A consonant is called a 'hal akSara'. Check the Sanskrit dictionary:
>> http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche
>>
>> In the pre-printing tradition, adjoining consonants either had
standard ligatures or they were written touching each other to indicate
they are digraphs or trigraphs. The vowel signs surround these. When a
consonant occurred at the end of a word, that was flagged by the halant
sign. 'halant' means hal at end.
>>
>> With the advent of letterpress printing, touch letter technique
became difficult or impossible to implement. So, now we have a new
concept of 'antara' hal -- interior consonant. The modern orthography
first honors the Sanskrit ligature rules and then drops the touch-letter
rule. The ligatures are described in the following books:
>> A.M. Gunasekara - Acomprehansive Grammar of the Sinhalese
Language (1891) - pp 16-18, Rev Theodore G. Perera - The Sinhala
Language (1932) -- 57 - 58.
>
OK. I’ll buy them, but not immediately: I’m currently unemployed and my
budget is very limited.

>>
>> There is only one way that Singhala could be digitized to do
justice to the continuation of its writing system and to smoothly
support the past. That is to define the Vowels, diphthings, consonants
and prenasals as individual codepoints. You can give a codepoint each
for the anusvara (ng sound) and visarga (the guttural postfix on vowels)
> That’s what has been done in Unicode.
>
> No, Jean, no. What it has as 'consonants' are not consonants.
What are they?

> It has two coedepoints for each vowel...
Where is the problem?

> (Please don't make me repeat)
Sorry

>
>
>> as the earlier Sanskrit transliterations did or you could
provide one codepoint each for the modified vowels.
> That’s the Unicode dependent vowel signs.
>
> What I meant by modified vowels is the set of letters you get when
you add the anusvara and visarga, five in each.
>
>
>
>> I did this latter in romanized Singhala.
> What did you exactly do?
>
> anusvara:RS [á í ú é ó] (PTS Pali and HKS use M for anusvara)
> visarga: [ä ï ü ë ö] (HK-Sanskrit uses H and IAST dot below h)
Aren’t those characters supported yet?

>
>
>
>
>
>>
>>
>>> The fault is with Lankan technocrats that took the proposal
as it was given and ever since prevented public participation. My
solution is 'perfectly encoded with Unicode'.
>> No. It’s an 8-bit character set independant from Unicode.
>>
>> To think 8-bit is outside Unicode is wrong.
> Could you explain it?
>
> I'll try.
>
> I am wrong technically because there is a leading zero byte added to
each SBCS character to make it UCS-2.
Only if you use the UCS-2 or UTF-16 Little Endian encodings. But that’s
not the problem.
My main concern is the name you are using.
I won’t deny your right to make a Sinhalese 8-bit font: that could be
useful in some old applications which can’t handle Unicode text, such as
the MS-DOS window on Windows.
But when you say it is Unicode compliant, that’s not true because in
Unicode, those code-points are used mostly for Latin letters.
When you say it is compliant with ISO-8859-1, that’s not true because
ISO-8859-1 is used mostly for Latin letters.
Perhaps you could call it SiSCII (Sinhalese Standard Code for
Information Interchange). ☺
Or perhaps SACII (Sinhalese Alternate Code for Information Interchange). ☺

> In the world of putting things into use, what is significant to me is,
> "16-bit characters, however, are not compatible with many current
applications and protocols" -- RFC 2044 (UTF-8).
>
> My constant question is how is this working? Well, if you use
ISO-8859-1, it works. But ISO-8859 is SBCS. But those characters are in
the Unicode standard, and all characters are Unicode. SBCS have zeros
added to make them all same size with others in UCS-2,
?????
SBCS are Single Byte Character Sets. There is no need to add zeros and
SBCS are not used in UCS-2.

> and that makes sense, but it doesn't, because UCS-4 and UTF-16,
?????

> but I don't know about that, may be Chinese -- pulling my hair,
running up and down the stairs.
UCS-2, UCS-4, UTF-16, UTF-8, UTF-7 are encodings, i.e. ways to record
Unicode code-points in memory, but the Unicode numbers remain the same
and those encodings don’t require separate fonts.

>
> It occurs to me HK Sanskrit works. It is pure US-ASCII.
What exactly is HK-Sanskrit?

> Does IAST work? No. Why? It has the Added Bonus characters.
http://en.wikipedia.org/wiki/IAST
But that ’s a Latin transliteration. It was not meant to encode Sanskrit
texts in Devanagari but rather to provide a readable version of the text
to users of Latin alphabet languages who didn’t learn to read Devanagari.

> How about transliterating into iso-8859-1? That is the space between
HK and IAST. It works. Romanizing Singhala into iso-8859-1 works!
And then…?

>
> I do not know how time proven applications made believing that a
character is a one-byte data type still work,
Most of them appeared before Unicode and were made by people using SBCS
to write their own languages.

> but they were written before 'wide-character' happened. They still
work. Reading... MSDN discussions, blogs and so on. iso-8859-1 works in
spite of Unicode.
As well as many other SBCS.

> Windows-1252 is the best -- the oasis.
No. CP1252 is the best for Windows users who only use the Western Latin
alphabet.

There are many other SBCS which are note compatible with it.

And Windows CP-1252, ISO-8859-1 and ISO 8859-15 are three different
character sets.

>
> I hope you now understand.
>
>
>
>> 8-bit character set is the core of Unicode.
> I’m very surprised, voiceless.
>
> Okay, Jean. You win. I am speechless too wondering why people have to
go through all the rigmarole of perfecting a Unicode code block to get
to a thing called Plain Text and then continue to struggle making it
work with all the sweat, blood and money to finally end up with a
crippled system just to protect the reputation of a handful academics
and technocrats when transliteration is very simple and works instantly.
…

>
>> It is the best placer to park any language because it is the
most stable part of the Unicode character database.
> I wouldn’t like to type in Japanese with 8-bit fonts, changing
the font for every new glyph. I tried it once on Windows 3.1… It was
very time-consuming.
>
> You caught the slip! I did not make one qualification. When I write
in Singhala, I make a point to say that this does not include CJKV.
Didn't do it here. I accept my fault.. (Actually, I have to fine my
translator)
>
>
>
>>
>>
>>>
>>> Yes thee may remain some issues with older OSes that
have limited
>>> support for standard OpenType layout tables. But
there's now no
>>> problem at all since Windows XP SP2. Windows 7 has the
full support,
>>> and for those users that have still not upgraded from
Windows XP,
>>> Windows 8 will be ready in next August with an upgrade
cost of about
>>> US$ 40 in US (valid offer currently advertized for all
users upgrading
>>> from XP or later), and certainly even less for users in
India and Sri
>>> Lanka.
>>>
>>> The above are not any of my complaints.
>>> Per Capita Income in Sri Lanka $2400. They are content with
cell phones. The practical place for computers is the Internet Cafe.
Linux is what the vast majority needs.
>>>
>>>
>>> And standard Unicode fonts with free licences are
already available
>>> for all systems (not just Linux for which they were
initially
>>> developed);
>>>
>>> Yes, only 4 rickety ones. Who is going to buy them anyway?
>> Why would you buy them if they’re free?
>>
>> Brilliant!
>>
>>
>>
>>> Still Iskoola Pota made by Microsoft by copying a printed
font is the best. You check the Plain Text by mixing Singhala and Latin
in the Arial Unicode MS font to see how pretty Plain text looks. They
spent $2 or 20 million for someone to come and teach them how to make
fonts. (Search ICTA.lk). Staying friendly with them is profitable. World
bank backs you up too.
>>> Sometime in 1990s when I was in Lanka, I tried to select a
PC for my printer brother. We wanted to buy Adobe, Quark Express etc.
The store keeper gave a list and asked us to select the programs.
Knowing that they are expensive, I asked him first to tell me how much
they cost. He said that he will install anything we wanted for free! The
same trip coming back, in Zurich, the guys tried to give me a illicit
copy of Windows OS in appreciation for installing German and Italian
(or French?) code pages on their computers.
>>>
>>> there even exists solutions for older versions of iPhone
>>> 4. OR on Android smartphones and tablets.
>>>
>>> Mine works in them with no special solution. It works
anywhere that supports Open Type -- no platform discrimination
>> Is there any platform discrimination with Unicode Sinhala?
>>
>> You mean Apple / Windows / Linux?
> I mean what you meant one line above.
>
> Okay, whatever.
>
>
>
>> Not really, but Microsoft was ahead of others. They all just
support the crippled system.
>>
>>
>>
>>>
>>> No one wants to get back to the situation that existed
in the 1980's
>>> when there was a proliferation of non-interoperable 8
bit encodings
>>> for each specific platform.
>>>
>>> I agree. Today, 14 languages, including English, French,
German and Italian all share the same character space called ISO-8859-1.
>> In fact, ISO-8859-1 is not well suited for French (my native
language): it lacks a few letters which were added to ISO-8859-15.
However, I always use Unicode today, even for French-only texts.
>>
>> Jean, you are lucky because you use Latin letters. Latin letters
are always bare individual letters. Sinhala is not so. It has all these
other shaping complications and special rules are applied per Complex
language.
>>
>> I think you could appreciate my dilemma. This is how I see it.
Going outside ISO-8859-1 is lot of trouble.
> Which ones?
>
> Like the Added Bonus characters, specifically, the ones with macrons
and dots
But making an 8-bit Sinhalese font already is going outside ISO-8859-1.

>
>
>> Should I enumerate them to you?
> Why not?
>
> It becomes what we call in Singhala, hooðahooða madee = හෝදහෝද මඩේ --
getting in out of the puddle repeatedly washing your feet in it.
>
>
>
Received on Tue Jul 17 2012 - 13:14:23 CDT

This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 13:14:24 CDT