Re: Romanized Singhala - Think about it again from Naena Guru on 2012-07-08 (Unicode Mail List Archive)

From: Naena Guru <naenaguru_at_gmail.com>
Date: Sun, 8 Jul 2012 18:29:38 -0500

Jean-François,

Let me approximate it in Romanized Singhala: 'jyó-frósvaa'. Just trying...

Thank you for your interest. See inline responses.

On Thu, Jul 5, 2012 at 7:35 AM, Jean-François Colson <jf_at_colson.eu> wrote:

> Le 05/07/12 10:02, Naena Guru a écrit :
>
>
>
> On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy <verdy_p_at_wanadoo.fr>wrote:
>
>> Anyway, consider the solutions already proposed in Sinhalese
>> Wikipedia. There are verious solutions proposed, including several
>> input methods supported there. But the purpose of these solutions is
>> always to generate Sinhalese texts perfectly encoded with Unicode and
>> nothing else.
>>
> Thank you for the kind suggestion. The problem is Unicode Sinhala does not
> perfectly support Singhala!
>
> What’s wrong? Are there missing letters?
>
Many, many.

> *The solution is for Sinhala not for Unicode!*
>
> Or rather for Sinhala by Unicode.
>
Sure, if you want to do it with proper deliberation.

>
>
> * *I am not saying Unicode has a bad intention but an ill-conceived
> product.
>
> What precisely is ill-conceived?
>
Anglo-centric thinking is what is wrong. Let me take you on the scenic
route:

Number of letters in Singhala is only theoretical. In the case of Singhala
orthography, the actually used number depends on the Sanskrit
vocabulary. My test font has about 1500. I need much more. Pali orthography
enforces touch-letter rule (see later). Modern Singhala, meaning 1st
century onwards, is an admixture of Singhala and Sanskrit. Pali is not
mixed into Singhala text (except to quote like a foreign language).

Generally, the Unicode approach is to treat the consonants as base shapes.
Then the vowel signs are added around them. The vowel signs have their own
codepoints. We hit upon the first problem here because there are two
possible codepoints for each single-mora vowel, double-mora vowel and each
diphthong. We lose all hopes of traditional search, replace etc. It
complicates collation too.

Then they went up the gum tree of the notion that the Singhala consonant is
actually a consonant with a vowel inside it -- an absurdity -- the Abugida
theory. They then added two ligatures without normalizing them. Singhala
has 15 ligatures in that category of ligatures. They included upadhmAnIya
and left out jihvAmUlIya.

That concludes that Unicode Sinhala is not grammar compliant. This is the
first requirement. It is not Unicode compliant because is has canonicals
that are not normalized. It has a jumble of Singhala letters and signs and
duplicates the same phoneme. It is good only for the trash can.

Here are the considerations for a successful encoding:
A consonant is called a 'hal akSara'. Check the Sanskrit dictionary:
http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche

In the pre-printing tradition, adjoining consonants either had standard
ligatures or they were written touching each other to indicate they are
digraphs or trigraphs. The vowel signs surround these. When a
consonant occurred at the end of a word, that was flagged by the halant
sign. 'halant' means hal at end.

With the advent of letterpress printing, touch letter technique became
difficult or impossible to implement. So, now we have a new concept of
'antara' hal -- interior consonant. The modern orthography first honors the
Sanskrit ligature rules and then drops the touch-letter rule. The ligatures
are described in the following books:
A.M. Gunasekara - Acomprehansive Grammar of the Sinhalese Language (1891) -
pp 16-18, Rev Theodore G. Perera - The Sinhala Language (1932) -- 57 - 58.

There is only one way that Singhala could be digitized to do justice to the
continuation of its writing system and to smoothly support the past. That
is to define the Vowels, diphthings, consonants and prenasals as individual
codepoints. You can give a codepoint each for the anusvara (ng sound) and
visarga (the guttural postfix on vowels) as the earlier Sanskrit
transliterations did or you could provide one codepoint each for the
modified vowels. I did this latter in romanized Singhala.

>
>
> The fault is with Lankan technocrats that took the proposal as it was
> given and ever since prevented public participation. My solution is
> 'perfectly encoded with Unicode'.
>
> No. It’s an 8-bit character set independant from Unicode.
>
To think 8-bit is outside Unicode is wrong. 8-bit character set is the core
of Unicode. It is the best placer to park any language because it is the
most stable part of the Unicode character database.

>
>
>
>> Yes thee may remain some issues with older OSes that have limited
>> support for standard OpenType layout tables. But there's now no
>> problem at all since Windows XP SP2. Windows 7 has the full support,
>> and for those users that have still not upgraded from Windows XP,
>> Windows 8 will be ready in next August with an upgrade cost of about
>> US$ 40 in US (valid offer currently advertized for all users upgrading
>> from XP or later), and certainly even less for users in India and Sri
>> Lanka.
>>
> The above are not any of my complaints.
> Per Capita Income in Sri Lanka $2400. They are content with cell phones.
> The practical place for computers is the Internet Cafe. Linux is what the
> vast majority needs.
>
>>
>> And standard Unicode fonts with free licences are already available
>> for all systems (not just Linux for which they were initially
>> developed);
>
> Yes, only 4 rickety ones. Who is going to buy them anyway?
>
> Why would you buy them if they’re free?
>
Brilliant!

>
>
> Still Iskoola Pota made by Microsoft by copying a printed font is the
> best. You check the Plain Text by mixing Singhala and Latin in the Arial
> Unicode MS font to see how pretty Plain text looks. They spent $2 or 20
> million for someone to come and teach them how to make fonts. (Search
> ICTA.lk). Staying friendly with them is profitable. World bank backs you up
> too.
> Sometime in 1990s when I was in Lanka, I tried to select a PC for my
> printer brother. We wanted to buy Adobe, Quark Express etc. The store
> keeper gave a list and asked us to select the programs. Knowing that they
> are expensive, I asked him first to tell me how much they cost. He said
> that he will install anything we wanted for free! The same trip coming
> back, in Zurich, the guys tried to give me a illicit copy of Windows OS in
> appreciation for installing German and Italian (or French?) code pages on
> their computers.
>
> there even exists solutions for older versions of iPhone
>> 4. OR on Android smartphones and tablets.
>>
> Mine works in them with no special solution. It works anywhere that
> supports Open Type -- no platform discrimination
>
> Is there any platform discrimination with Unicode Sinhala?
>
You mean Apple / Windows / Linux? Not really, but Microsoft was ahead of
others. They all just support the crippled system.

>
>
>
>> No one wants to get back to the situation that existed in the 1980's
>> when there was a proliferation of non-interoperable 8 bit encodings
>> for each specific platform.
>>
> I agree. Today, 14 languages, including English, French, German and
> Italian all share the same character space called ISO-8859-1.
>
> In fact, ISO-8859-1 is not well suited for French (my native language): it
> lacks a few letters which were added to ISO-8859-15. However, I always use
> Unicode today, even for French-only texts.
>
Jean, you are lucky because you use Latin letters. Latin letters are always
bare individual letters. Sinhala is not so. It has all these other shaping
complications and special rules are applied per Complex language.

I think you could appreciate my dilemma. This is how I see it. Going
outside ISO-8859-1 is lot of trouble. Should I enumerate them to you? Just
see the daily questions and dedicated section for Indic at Unicode.org, and
think why ordinary people Anglicize instead of using Unicode Sinhala. (e.g.
elakiri.com).

It's a colossal failure! The people Anglicize than using Unicode Sinhala.
To be fair, the Lankan technocrats did not have a clue when they were asked
to approve the standard. It was a time when there was (perhaps even now) a
typist in the corner of the office of the bureaucrat. The big guys do not
know touch-typing even now. Proof: A university professor wrote me a
harangue using cyber-sex orthography (no capitals) accusing me for working
for Americans. I had suggested that Unicode is a conspiracy to confuse us.
(That is a bit way over, no such motive, nevertheless the effect is the
same)

>
>
> Romanized Singhala uses the same. So, what's the fuss about? The font?
>
> The problem is that only your translitteration scheme, with Latin letters,
> is supported by ISO-8859-1, not the Sinhalese letters themselves.
>
You are right partially. I do not need permission from anyone to use any
font.

Jean, the computer thinks it is ISO-8859-1. ISO 8859-1 is only a set of
numbers! [128-255]. Don't get stuck with the names of the codepoints. The
stupid computer cannot read their names. What travels the network are the
bytes in there bare form. When they are viewed, the user has the choice
(theoretically) to select the font.

Stop your imagination and do this. Go to this site:
http://www.lovatasinhala.com
On the right-hand-side column (directly below the lion), there is a link in
a light-blue box that says, "Latin Script". Click on that and get rid of
the dreaded Singhala script and be happy. What you see is not Icelandic. It
is romanized Singhala. And if you want to really read it, click on the next
link below and see the pronunciation key.

I have requested a fellow to translate at least the page on Unicode to
English. Hopefully, he does it quick.

>
> Consider that as the oft suggested IME. Haha!
>
>>
>> And your solution also does not work in multilingual contexts;
>
> If mine does not work in some multilingual context, none of the 14
> languages I mentioned above including English and French don't either.
>
> They do because they use Latin letters, not Sinhalese letters.
>
English, French and romanized Singhala do not work on multilingual
contexts.You are confusing letters and codepoints. Letters are provided by
FONTS in the user interface in the LOCAL device.

By the way, how do you localize in France? Do you know that the English
writing was romanized when the English people were forced into
Christianity? The only truly surviving English letter is þorn (þ).

>
>
>
> it does
>> not work with many protocols or i18n libraries for applications.
>
> i18n is for multi-byte characters. Mine are single-byte characters.
>
> OK. Do it as you want, but it won’t be Unicode compliant.
>
Thank you for your generosity, sire. I waited all this long for it. (I am
kidding).

>
>
> As you see, the safest place is SBCS.
>
> I don’t see. Why is it safer?
>
Just compare romanized Singhala and Unicode Sinhala.
First, the display of the script is not guaranteed. You get
Character-not-found rows if you do not have the font. Then you see garbage
with letters and signs mixed up if you did not update your font renderer
(e.g. uniscribe). (Only Windows 7 comes with latest Uniscribe). Different
fonts have different levels of letter construction, and some have wrong
letters for wrong codepoints. This is how it is in iPhone.

When you transport Romanized Singhala, you do not need to re-encode it
(e.g. UTF-8) for the purpose and bloat it. There is not even an HTML editor
for it. You need to re-write all well established and seasoned applications
using updated compilers that added wide-character functions.

Here is a test for you. The following is a Unicode Sinhala paragraph (a
random copy from the web site http://divaina.com/ news web site (Sunday
issue). Your computer must be Plain Text ready for this. I bet it is not.

ළමයින්ගෙ අධ්‍යා පනය කඩාකප්පල් වෙනව තමයි. ඒත් ඉතිං මොකද කරන්නෙ? රටේ ආණ්‌ඩුවට
පණිවිඩයක්‌ දෙන්න ස්‌ට්‍රයික්‌ නැතුව බැරි වීම අවාසනාවන්ත තත්ත්ව යක්‌. මේ
ප්‍රශ්න සමූහය දැන් තීරණාත්මක තැනකට ඇවිත් තියෙනව.

1 Copy it to Notepad
2. From Notepad, copy it to a new MS Word page
3. Copy what you pasted into Word back to Notepad below the original
4 Copy that second one from Notepad back to Word below the one it already
has

Observe that MS Word altered the codepoints in the underlying text runs.

>
>
> Or it
>> requires specific constraints on web pages requiring complex styling
>> everywhere to switch fonts.
>
> Did you see http://www.lovatasinhala.com? May be you are confusing
> Unicode Sinhala and romanized Singhala. Unicode Sinhala has a myriad such
> problems.
>
> Which problems?
>
See above including the test.

>
>
> That is why it should be abandoned!
>
> Why wouldn’t you try to solve the problems, whatever they could be,
> instead of proposing an entirely new character set nobody will support?
>
There are only two solutions. ONE: Completely redefine the Singhala code
block . TWO: Just abandon it and use the transliteration. Why go through
the trouble to satisfy fellows like you who do not use Singhala anyway?

If the rendering engines don’t work as you expect they should, how a new
> encoding scheme could solve the problem?
>
The rendering engine works just fine! It is the code block that is sick.
You are way off base, buddy.

>
>
> Please look at the web site and say it more coherently, if I
> misunderstood you.
>
>
>> Plain text searches in mutliingual pages
>> won't work. Usability tools won't work.
>>
> Have you tried to search a vowel in Unicode Sinhala? Romanized Singhala
> has no search problem. Try it in the my web site.
>
> Well, perhaps there’re problems with search engines.
>
Haha! I am not talking about search engines. I am talking about text
processing. I am sorry but talking to you is like the Singhala saying,
"biiri aliyaata veenaa gahanavaa vagee." -- Like playing the violin for the
deaf elephant.

Wouldn’t it be possible to correct search engines instead of inventing a
> new character set?
>
You need to go back to school. There is no new character set. A Unicode
character is just a numeric code Unicode character database goes from zero
to some very big number. There are no holes in it to define character sets
for somebody's fancy. Well, Doug Ewell did one for Esparanto expanding
fuþorc. We need to do something practical, and I did it already.

>
>
>
>> Really consider abandonning the hacked encoding of the Sinhalese
>> script itself.
>
> There is no re-encoding of Singhala. Singhala is transcribed into Latin!
> When I say Singhala, I don't mean Unicode Sinhala. It is the Singhala
> phoneme inventory that was transliterated.
>
> Using Latin letters for a transliteration of Sinhala is not a hack, but
> making fonts said to be Latin-1 with Sinhalese letters instead of the Latin
> letters is a hack.
>
Well, you can characterize the smartfont solution anyway you like. The
problem for you is that it works!

Sorry for this Kindergarten lesson, but you should understand the role of
the font. A font is a support application at the User Interface level. It
is what the user decides to use to see underlying text runs in an
application's view port. The same text one person reads at the computer in
Arial others read in Helvetica. In the same manner, if I did not deliver
the font with the web page, you will see it in some sans-serif font your
computer has. It is something that happens locally in the device. When text
moves between applications and between computers, they travel as numeric
codes representing the text in the form of digital bytes. The computer
can't say French from Singhala.

>
>
>
> It will however be more valuable if you just
>> concentrate on creating a simpler romanization system. that will use
>> standard Unicode encoding of Latin
>
> This is exactly what I did. Have I been talking to someone who did not
> know what he was evaluating?
>
> I think he was speaking of the translitteration, not of your hack.
>
I hope the fellow reads the above response. I wish you guys lived close by
here in US so that I could hold a special class to teach you how computers
function.

>
>
>
> (note that you are absolutely not
>> limited to the reduced ISO 8859-1 subset for Latin and that there's
>> already a much richer set of letters, symbols and diacritics for all
>> needs ; but here again this requires using Unicode and not just ISO
>> 8859-1).
>
> Oh, thank you for the generosity of allowing me use of the entire Latin
> repertoire. You don't have to tell that to me. I have traveled quite a bit
> in the IT world. Don't be surprised if it is more than what you've seen.
> (Did you forget that earlier you accused me of using characters outside
> ISO-8859-1 while claiming I am within it? That is because you saw IAST and
> PTS displayed. They use those wonderful letters symbols and diacritics you
> are trying to tout. Is there a problem with Asians using ISO-8859-1 code
> space even for transliteration?
>
>
>> The bonus will be that you can still write the Sinhalese
>> language with a romanisation like yours,
>
> Bonus?
>
> but there's no need to
>> reinvent the Sinhalese script
>
> Singhala script existed many, many years since before the English and
> French adopted Latin.
>
> Did any body say it didn’t?
>
He said reinvent the Singhala SCRIPT. The script is the script. I use the
same script in a more complete and correct manner than any Unicode font
even with my incomplete, rough design, proof-of-concept font.

>
>
> What I did was saving it from the massacre going on with Unicode
> Sinhala.
>
> Which massacre? What’s wrong with the Unicode support of Sinhala? Could
> you give details, please?
>
I gave the details earlier in this response

>
>
>
> itself that your encoding is not even
>> capable of completely support in all its aspects (your system only
>> supports a reduces subset of the script).
>>
> What is the basis for this nonsense?. (Little birds whispering in the
> background. Watch out. They are laughing).
> My solution supports the entire script, Singhala, Pali and Sanskrit plus
> two rare allophones of Sanskrit as well. Tell me what it lacks and I will
> add it, haha! One time you said I assigned Unicode Sinhala characters to
> the 'hack' font. What I do is assigning Latin characters to Singhala
> phonemes. That is called transliteration. There are no 'contextual
> versions' of the same Singhala letters like you said earlier.
>
> Ask your friends what they have more than mine in the Singhala script.
> Ask them why they included only two ligatures when there are 15 such.
>
> Can’t you make a proposal or describe the missing letters?
>
Let it rot in place. (Lankan government might need it to get loans from WB
to feed the IT guys over there). I proved that it is not
necessary. Romanizing takes care of it and the native readers can use the
orthographic font if they want. Otherwise, they can use Latin script just
like you and I do here. Remember that the font is a local decision. It need
not go out of your computer and cause heart ache among people like you. The
following is the first sentence at:
http://www.lovatasinhala.com/liyanna.php
oba kiyavana ðeruva heøa kramaya viðyaaþmakava haa vyaakaraµaanukuulava
saðaa æþi nisaa, eya batahira yuroopiiya bhaaxaa parigaµakaya þula labana
varaprasaaða elesama síhalataþ labaa ðeyi.

I suggest you get with it and move on.

>
>
> Ask them how many Singhala letters there are.
>
>>
>> Even the legacy ISCII system (used in India) is better, because it is
>> supported by a published open standard, for which there's a clear and
>> stable conversion from/to Unicode.
>>
> My solution is supported by two standards: ISO-8859-1 and Open Type.
> ISO-8859-1 is Basic Latin plus Latin-1 Extension part of Unicode standard.
>
> It is not supported by ISO-8859-1. ISO-8859-1 isfor Latin letters, not
> Sinhalese ones.
>
It is worth your traveling to America to learn what is a character
encoding. A character set is not anything you go and ask permission to use
it. If you use it, you have used it.

>
>
>
> Bottom line is this: If Latin-1 is good enough for English and French,
> it is good enough for Singhala too.
>
> No, because Sinhala is not written with Latin letters.
>
Declarations like that won't work in a technical discussion. You need to
explain. Singhala is a language. Singhala native SCRIPT is the traditional
way it is written. When I write Jean I really entered the four code points:
74 101 97 and 110. When you write naena, you enter 110 97 101 110 and 97.
We think the former is a name of a pretty girl and the latter is a name I
made up not in a particular language.

>
>
> And if Open Type is good for English and French, it is good for
> Singhala too.
>
> Of course.
>
Thank you for that.
Received on Sun Jul 08 2012 - 18:35:25 CDT

This archive was generated by hypermail 2.2.0 : Sun Jul 08 2012 - 18:35:28 CDT