Le 17/07/12 02:43, Naena Guru a écrit :
 > Jean, sorry I am late. I used spare time as and when I got it.
 >
 > On Sun, Jul 8, 2012 at 10:20 PM, Jean-François Colson <jf_at_colson.eu> 
wrote:
 >
 >     Le 09/07/12 01:29, Naena Guru a écrit :
 >>     Jean-François,
 >>
 >>     Let me approximate it in Romanized Singhala: 'jyó-frósvaa'. Just 
trying...
 >     I don’t know how that transcription should be pronounced but in 
IPA, Jean-François is /ʒɑ̃.fʁɑ̃.swa/.
 >
 > They came as rectangles (in XP).
That’s not surprising. Windows XP is an old out-of-date system with, by 
default, a very limited set of fonts. But nobody prevents you to 
download some additional free fonts.
 > They showed correctly in your message inside Firefox running in Puppy 
Linux, but where I an replying, it shows a reversed Euro like character
That is surprising. Which font does include a reversed euro sign ?
 > in place of the a-umlaut.
I didn’t use any umlauts but two tildes.
 > This again illustrates how hazardous it is for characters outside 
Latin-1.
It illustrates how hazardous it is to use such an old OS as Windows XP.
 >
 > I can only approximate the first letter as English j+y,
English j + y????? I don’t know that, neither in English nor in French.
/ʒ/ is the French j. It is not the English j plus something but rather 
the English j minus something.
The English j is /dʒ/. That’s an affricate, i.e. roughly a sound which 
begins as a plosive and evolves to end as a fricative.
The French j is a fricative. its nearest approximation in English is the 
z in azure or the s in leisure.
/ɑ̃/ is not an /ɑ/ with umlaut but an /ɑ/ with tilde. It is pronounced as 
the a in the English word “car” but with a nasal quality, i.e. some air 
passes through the nose.
Jean is an homophone of gens which you can hear here: 
http://fr.wiktionary.org/wiki/gens#Prononciation
The speaker has recorded “des gens”, so focus your attention on the 
second syllable.
/f/ is pronounced like in English.
/ʁ/ is the French r, but there are several varieties of r among the 
French dialects, so using the English r instead is not a big problem.
/s/ and /w/ are pronounced as in English.
/a/ is very similar to /ɑ/. It is the begining of the diphtong in the 
English word “sky”.
 > which is same on Singhala. The rest is pretty close, I think.
 >
 >
 >
 >>
 >>     Thank you for your interest. See inline responses.
 >>
 >>     On Thu, Jul 5, 2012 at 7:35 AM, Jean-François Colson 
<jf_at_colson.eu> wrote:
 >>
 >>         Le 05/07/12 10:02, Naena Guru a écrit :
 >>>
 >>>
 >>>         On Wed, Jul 4, 2012 at 11:33 PM, Philippe Verdy 
<verdy_p_at_wanadoo.fr> wrote:
 >>>
 >>>             Anyway, consider the solutions already proposed in 
Sinhalese
 >>>             Wikipedia. There are verious solutions proposed, 
including several
 >>>             input methods supported there. But the purpose of these 
solutions is
 >>>             always to generate  Sinhalese texts perfectly encoded 
with Unicode and
 >>>             nothing else.
 >>>
 >>>         Thank you for the kind suggestion. The problem is Unicode 
Sinhala does not perfectly support Singhala!
 >>         What’s wrong? Are there missing letters?
 >>
 >>     Many, many.
 >>
 >>>         The solution is for Sinhala not for Unicode!
 >>         Or rather for Sinhala by Unicode.
 >>
 >>     Sure, if you want to do it with proper deliberation.
 >>
 >>
 >>
 >>>         I am not saying Unicode has a bad intention but an 
ill-conceived product.
 >>         What precisely is ill-conceived?
 >>
 >>     Anglo-centric thinking is what is wrong.
 >     ?????
 >
 > Letters have no direct relation to speech -- very few. In Singhala 
(perhaps as in French too as someone said?) you write what you say. In 
Singhala, the exception is clearly understood rule set about how to 
pronounce short 'a' -- whether muted or not.
 >
 > Therefore, the approach should have been to encode the vowels, 
diphthongs and consonants as base letters. I assigned the acute accent 
to the 'ng' sound and the umlaut to the guttyral H in Sanskrit, but they 
could be assigned independent codepoints.
 >
 >
 >
 >>      Let me take you on the scenic route:
 >>
 >>     Number of letters in Singhala is only theoretical. In the case 
of Singhala orthography, the actually used number depends on the 
Sanskrit vocabulary.
 >     Do you mean there are many conjunct consonants, sometimes with a 
separate glyph?
 >
 > Yes, many. There are three orthographies. Singhala does not have CCs 
at all. Sanskrit has a lot. Pali has touch letters in addition to what 
Sanskrit has. Modern Singhala is mixed Singhala and Sanskrit. With 
Unicode Sinhala, you need to know which ones join and provide the ZWJ 
and hope and pray that the font has the CC. Often they are absent.
So I guess your problem could be solved by providing new fonts with a 
better support of conjunct consonants. What you did for 8-bit Sinhala, 
you could do it for Unicode Sinhala too.
 > Then SLS1134 gives wrong advice too.
Could you explain in details what is wrong in their advice?
 >
 >     In Devanagari, they’re made by typing two or more consonants 
separated by halants. Isn’t that possible with Sinhala?
 >
 > Yes. That is possible, which means the fonts need lookup tables that 
provide the CCs in the Private Use Area of the font.
Isn’t that what you did in your 8-bit Sinhala font?
 >
 > BTW, 'halant' means the last consonant in a word. 'hal' means 
consonant. What Unicode calls halant (and virama) is the sign that 
indicates a letter is a (free-standing) consonant. virama means the 
marker of a consonant at end of a sentence. (Sigh, cry)
 > In Singhala, we call the sign 'hal kiriima' = the sign which makes it 
a hal and is not location specific.
 >
 >
 >
 >>     My test font has about 1500.
 >     IIRC there are only 191 code-points in Latin-1. Or 96? I’m not 
sure whether ASCII is part of Latin-1.
 >
 > These are the two that I use to look up letters of the Latin-1 
repertoire:
 > http://www.unicode.org/charts/PDF/U0000.pdf
 > http://www.unicode.org/charts/PDF/U0080.pdf
But there are not “about 1500” code points there.
 >
 >
 >
 >>     I need much more. Pali orthography enforces touch-letter rule 
(see later). Modern Singhala, meaning 1st century onwards, is an 
admixture of Singhala and Sanskrit. Pali is not mixed into Singhala text 
(except to quote like a foreign language).
 >>
 >>     Generally, the Unicode approach is to treat the consonants as 
base shapes. Then the vowel signs are added around them. The vowel signs 
have their own codepoints. We hit upon the first problem here because 
there are two possible codepoints for each single-mora vowel, 
double-mora vowel and each diphthong. We lose all hopes of traditional 
search, replace etc. It complicates collation too.
 >     In my first name, I usualy type the ç as U+00E7 because I have a 
Ç key on my keyboard, but using the two code-points U+00E7 U+0327 (c + 
combining cedilla) would be correct too and in many software the search 
function won’t find the ç if I’m looking for a ç.
 >
 > You are tight!
Thanks
 > Notepad does not find ç. It is certainly a bug. I tried Geany text 
editor and Abiword in Linux. They both are okay. I think Notepad forgets 
those letters in the Latin 1 Supplement. This is not a big deal.
Not really a smaller one than improving the Sinhalese fonts if they miss 
some conjuncts.
 > I remember that VOLT guys took awhile to realize that Open Type 
'simple' script includes those.
 >
 >
 >
 >>
 >>     Then they went up the gum tree of the notion that the Singhala 
consonant is actually a consonant with a vowel inside it -- an absurdity 
-- the Abugida theory. They then added two ligatures without normalizing 
them. Singhala has 15 ligatures in that category of ligatures. They 
included upadhmAnIya and left out jihvAmUlIya.
 >     Can’t those ligatures be typed as separate signs with a zero-with 
joiner? I’ve seen there’s one on Sinhalese keyboards.
 >
 > Yes, you can but only to get a few.
Could you describe in details the missing conjuncts, which letters 
should generate them and what they should look like?
Also, it should be very useful to send those descriptions to the 
designers of Sinhalese fonts.
 > It hurts the orthography, Jean. People forget orthography when it is 
in disuse. It is better to make the CCs and let the user prevent their 
formation by means of ZWNJ where they do not want the joining.
Oh! I see. It would be possible to make input methods which work as you 
want and generate Sinhalese characters as they are used today.
It is not possible to change the behaviour of sinhalese characters: that 
would mangle all presently existing texts.
 > I implemented this in my font. It is easier for input thanh
 >
 >
 >
 >>
 >>     That concludes that Unicode Sinhala is not grammar compliant.
 >     What do you mean?
 >
 > That the rules of orthography are ignored. e.g., just two conjuncts 
(conjoints) out of a group of 15
15 or about 1500
 > in their class inside the standard and others missing.
You’re not alone. In the South of Belgium and the North of France, there 
are local languages which, in some orthographies, use the ring above the 
vowels A and E. Å and å are available as precomposed consonants. E̊ and e̊ 
are not and must be composed of E/e + U+00A0. That’s not a big problem.
I think the good question is “Why were those two CCs included as 
precomposed characters?”, not “Why were the other ones forgotten?”.
 > There is no consistency in this code block.
Just forget the useless characters if they are a mistake of the past.
 > This is why I say, just encode the vowels and consonants. You comply 
with the grammar comply only if you start with the phoneme set. That is 
what HK Sanskrit, IAST Sanskrit and PTS Pali did. They are backward 
compatible with older text due to this.
 >
 >
 >
 >>     This is the first requirement. It is not Unicode compliant 
because is has canonicals that are not normalized.
 >     Can’t you describe in details those missing “canonicals” to make 
a proposal?
 >
 > I can, but that is the wrong approach.
Why is it wrong?
 >
 > Unicode correctly advised not to introduce canonicals trivially.
What do you mean?
 > Then they gave a deadline to register them. The whole idea is to 
differentiate the text of one language from the other with these 
national ligatures. Those languages that had Latin cononical forms are 
justified in declaring codepoints for them. Whereas, Singhala 
artificially made two 'canonicals'.
Do you mean there’s a historic way to write Sinhala and a second way, 
introduced by the persons who submitted the Sinhalese Unicode proposal 
years ago?
Is that what you call “canonicals”?
 > Now they have to be normalized down to base codepoints.
 >
 > Besides, no change can ever happen because of the dog-in-the-manger 
effect.
I don’t understand. What is that strange effect?
 > If you insist, the alternative is to define Singhala in a new code 
block without canonicals
Without canonicals????? Could you reformulate that please? I don’t 
understand. Which meaning of canonical 
(http://dictionary.reference.com/browse/canonical) are you using here?
 > and the codepoints are purely for vowels, diphthongs and consonants. 
Next you make fonts to whatever depth of complexity that you want. 
OpenType allows this.
I’m not a font designer, but can’t that be done with the present 
Sinhalese encoding?
 > Then those who care about orthography can make font templates for 
traditional orthographies. The user simply types as they speak and the 
font responds with the combinations intended by the font author.
Not changing anything, it would be possible to type as you speak with a 
well designed input method.
 > I tested it for nine years with various kinds of users. It is in my 
(proof-of-concept) font.
Is it possible to download your font?
 > There are few instances where (pure) Singhala words that have two 
consonants together, such as, kan|ða කන්ද and aþ|þa අත්ත. There are also 
names which are like hyphenated ones that have three vowels together 
(e.g. සුමිත්රආරච්චි -> sumiþra|aaracci.). In those instances, the user can 
intervene ZWNJ. The Bar represents ZWNJ in the RS examples. Users know 
obvious Singhala words and names.
 >
 >> It has a jumble of Singhala letters and signs and duplicates the 
same phoneme.
 >
 >     Duplicates the same phoneme…
 >     Do you mean there are characters which are encoded twice? In this 
case, you can ignore one of them.
 >     Do you mean vowels and diphtongs can be written both as separate 
letters and as diacritics after a consonant? In this case, that works 
like in Indic scripts. I don’t see anything wrong in that.
 >
 > What is wrong in that is in the struggle you would have to go through 
for text processing like search and replace and sorting.
?????
 > It is very straight forward in RS:
 > See the sample names getting sorted according to English and Singhala 
here:
 > http://www.lovatasinhala.com/liyanna.php#sort
 > Meanings of the buttons left to right:
 > [erase] [show sample] [sort-Singhala] [sort-English]
 >
 >
 >
 >>     It is good only for the trash can.
 >     Isn’t that an exaggeration?
 >
 > Kind of.
 > Jean, it can only be redefined. First, if you tinker it, you orphan 
the present text made with the flawed one. If you define it elsewhere, 
then you can write a routine to convert the flawed one to the new one. 
That is what I did in the case of romanized SInghala.
 >
 >
 >
 >>
 >>     Here are the considerations for a successful encoding:
 >>     A consonant is called a 'hal akSara'. Check the Sanskrit dictionary:
 >> http://www.sanskrit-lexicon.uni-koeln.de/cgi-bin/tamil/recherche
 >>
 >>     In the pre-printing tradition, adjoining consonants either had 
standard ligatures or they were written touching each other to indicate 
they are digraphs or trigraphs. The vowel signs surround these. When a 
consonant occurred at the end of a word, that was flagged by the halant 
sign. 'halant' means hal at end.
 >>
 >>     With the advent of letterpress printing, touch letter technique 
became difficult or impossible to implement. So, now we have a new 
concept of 'antara' hal -- interior consonant. The modern orthography 
first honors the Sanskrit ligature rules and then drops the touch-letter 
rule. The ligatures are described in the following books:
 >>     A.M. Gunasekara - Acomprehansive Grammar of the Sinhalese 
Language (1891) - pp 16-18, Rev Theodore G. Perera - The Sinhala 
Language (1932) -- 57 - 58.
 >
OK. I’ll buy them, but not immediately: I’m currently unemployed and my 
budget is very limited.
 >>
 >>     There is only one way that Singhala could be digitized to do 
justice to the continuation of its writing system and to smoothly 
support the past. That is to define the Vowels, diphthings, consonants 
and prenasals as individual codepoints. You can give a codepoint each 
for the anusvara (ng sound) and visarga (the guttural postfix on vowels)
 >     That’s what has been done in Unicode.
 >
 > No, Jean, no. What it has as 'consonants' are not consonants.
What are they?
 > It has two coedepoints for each vowel...
Where is the problem?
 > (Please don't make me repeat)
Sorry
 >
 >
 >>     as the earlier Sanskrit transliterations did or you could 
provide one codepoint each for the modified vowels.
 >     That’s the Unicode dependent vowel signs.
 >
 > What I meant by modified vowels is the set of letters you get when 
you add the anusvara and visarga, five in each.
 >
 >
 >
 >>     I did this latter in romanized Singhala.
 >     What did you exactly do?
 >
 > anusvara:RS [á í ú é ó]  (PTS Pali and HKS use M for anusvara)
 > visarga: [ä ï ü ë ö] (HK-Sanskrit uses H and IAST dot below h)
Aren’t those characters supported yet?
 >
 >
 >
 >
 >
 >>
 >>
 >>>         The fault is with Lankan technocrats that took the proposal 
as it was given and ever since prevented public participation. My 
solution is 'perfectly encoded with Unicode'.
 >>         No. It’s an 8-bit character set independant from Unicode.
 >>
 >>     To think 8-bit is outside Unicode is wrong.
 >     Could you explain it?
 >
 > I'll try.
 >
 > I am wrong technically because there is a leading zero byte added to 
each SBCS character to make it UCS-2.
Only if you use the UCS-2 or UTF-16 Little Endian encodings. But that’s 
not the problem.
My main concern is the name you are using.
I won’t deny your right to make a Sinhalese 8-bit font: that could be 
useful in some old applications which can’t handle Unicode text, such as 
the MS-DOS window on Windows.
But when you say it is Unicode compliant, that’s not true because in 
Unicode, those code-points are used mostly for Latin letters.
When you say it is compliant with ISO-8859-1, that’s not true because 
ISO-8859-1 is used mostly for Latin letters.
Perhaps you could call it SiSCII (Sinhalese Standard Code for 
Information Interchange). ☺
Or perhaps SACII (Sinhalese Alternate Code for Information Interchange). ☺
 > In the world of putting things into use, what is significant to me is,
 > "16-bit characters, however, are not compatible with many current 
applications and protocols" -- RFC 2044 (UTF-8).
 >
 > My constant question is how is this working? Well, if you use 
ISO-8859-1, it works. But ISO-8859 is SBCS. But those characters are in 
the Unicode standard, and all characters are Unicode. SBCS have zeros 
added to make them all same size with others in UCS-2,
?????
SBCS are Single Byte Character Sets. There is no need to add zeros and 
SBCS are not used in UCS-2.
 > and that makes sense, but it doesn't, because UCS-4 and UTF-16,
?????
 > but I don't know about that, may be Chinese -- pulling my hair, 
running up and down the stairs.
UCS-2, UCS-4, UTF-16, UTF-8, UTF-7 are encodings, i.e. ways to record 
Unicode code-points in memory, but the Unicode numbers remain the same 
and those encodings don’t require separate fonts.
 >
 > It occurs to me HK Sanskrit works. It is pure US-ASCII.
What exactly is HK-Sanskrit?
 > Does IAST work?  No. Why? It has the Added Bonus characters.
http://en.wikipedia.org/wiki/IAST
But that ’s a Latin transliteration. It was not meant to encode Sanskrit 
texts in Devanagari but rather to provide a readable version of the text 
to users of Latin alphabet languages who didn’t learn to read Devanagari.
 > How about transliterating into iso-8859-1? That is the space between 
HK and IAST. It works. Romanizing Singhala into iso-8859-1 works!
And then…?
 >
 > I do not know how time proven applications made believing that a 
character is a one-byte data type still work,
Most of them appeared before Unicode and were made by people using SBCS 
to write their own languages.
 > but they were written before 'wide-character' happened. They still 
work. Reading... MSDN discussions, blogs and so on. iso-8859-1 works in 
spite of Unicode.
As well as many other SBCS.
 > Windows-1252 is the best -- the oasis.
No. CP1252 is the best for Windows users who only use the Western Latin 
alphabet.
There are many other SBCS which are note compatible with it.
And Windows CP-1252, ISO-8859-1 and ISO 8859-15 are three different 
character sets.
 >
 > I hope you now understand.
 >
 >
 >
 >>     8-bit character set is the core of Unicode.
 >     I’m very surprised, voiceless.
 >
 > Okay, Jean. You win. I am speechless too wondering why people have to 
go through all the rigmarole of perfecting a Unicode code block to get 
to a thing called Plain Text and then continue to struggle making it 
work with all the sweat, blood and money to finally end up with a 
crippled system just to protect the reputation of a handful academics 
and technocrats when transliteration is very simple and works instantly.
…
 >
 >>     It is the best placer to park any language because it is the 
most stable part of the Unicode character database.
 >     I wouldn’t like to type in Japanese with 8-bit fonts, changing 
the font for every new glyph. I tried it once on Windows 3.1… It was 
very time-consuming.
 >
 > You caught the slip! I did not make one qualification. When I write 
in Singhala, I make a point to say that this does not include CJKV. 
Didn't do it here. I accept my fault.. (Actually, I have to fine my 
translator)
 >
 >
 >
 >>
 >>
 >>>
 >>>             Yes thee may remain some issues with older OSes that 
have limited
 >>>             support for standard OpenType layout tables. But 
there's now no
 >>>             problem at all since Windows XP SP2. Windows 7 has the 
full support,
 >>>             and for those users that have still not upgraded from 
Windows XP,
 >>>             Windows 8 will be ready in next August with an upgrade 
cost of about
 >>>             US$ 40 in US (valid offer currently advertized for all 
users upgrading
 >>>             from XP or later), and certainly even less for users in 
India and Sri
 >>>             Lanka.
 >>>
 >>>         The above are not any of my complaints.
 >>>         Per Capita Income in Sri Lanka $2400. They are content with 
cell phones. The practical place for computers is the Internet Cafe. 
Linux is what the vast majority  needs.
 >>>
 >>>
 >>>             And standard Unicode fonts with free licences are 
already available
 >>>             for all systems (not just Linux for which they were 
initially
 >>>             developed);
 >>>
 >>>         Yes, only 4 rickety ones. Who is going to buy them anyway?
 >>         Why would you buy them if they’re free?
 >>
 >>     Brilliant!
 >>
 >>
 >>
 >>>          Still Iskoola Pota made by Microsoft by copying a printed 
font is the best. You check the Plain Text by mixing Singhala and Latin 
in the Arial Unicode MS font to see how pretty Plain text looks. They 
spent $2 or 20 million for someone to come and teach them how to make 
fonts. (Search ICTA.lk). Staying friendly with them is profitable. World 
bank backs you up too.
 >>>         Sometime in 1990s when I was in Lanka, I tried to select a 
PC for my printer brother. We wanted to buy Adobe, Quark Express etc. 
The store keeper gave a list and asked us to select the programs. 
Knowing that they are expensive, I asked him first to tell me how much 
they cost. He said that he will install anything we wanted for free! The 
same trip coming back, in Zurich, the guys tried to give me a illicit 
copy of  Windows OS in appreciation for installing German and Italian 
(or French?) code pages on their computers.
 >>>
 >>>             there even exists solutions for older versions of iPhone
 >>>             4. OR on Android smartphones and tablets.
 >>>
 >>>         Mine works in them with no special solution. It works 
anywhere that supports Open Type -- no platform discrimination
 >>         Is there any platform discrimination with Unicode Sinhala?
 >>
 >>     You mean Apple / Windows / Linux?
 >     I mean what you meant one line above.
 >
 > Okay, whatever.
 >
 >
 >
 >>     Not really, but Microsoft was ahead of others. They all just 
support the crippled system.
 >>
 >>
 >>
 >>>
 >>>             No one wants to get back to the situation that existed 
in the 1980's
 >>>             when there was a proliferation of non-interoperable 8 
bit encodings
 >>>             for each specific platform.
 >>>
 >>>         I agree. Today, 14 languages, including English, French, 
German and Italian all share the same character space called ISO-8859-1.
 >>         In fact, ISO-8859-1 is not well suited for French (my native 
language): it lacks a few letters which were added to ISO-8859-15. 
However, I always use Unicode today, even for French-only texts.
 >>
 >>     Jean, you are lucky because you use Latin letters. Latin letters 
are always bare individual letters. Sinhala is not so. It has all these 
other shaping complications and special rules are applied per Complex 
language.
 >>
 >>     I think you could appreciate my dilemma. This is how I see it. 
Going outside ISO-8859-1 is lot of trouble.
 >     Which ones?
 >
 > Like the Added Bonus characters, specifically, the ones with macrons 
and dots
But making an 8-bit Sinhalese font already is going outside ISO-8859-1.
 >
 >
 >>     Should I enumerate them to you?
 >     Why not?
 >
 > It becomes what we call in Singhala, hooðahooða madee = හෝදහෝද මඩේ -- 
getting in out of the puddle repeatedly washing your feet in it.
 >
 >
 >
Received on Tue Jul 17 2012 - 13:14:23 CDT
This archive was generated by hypermail 2.2.0 : Tue Jul 17 2012 - 13:14:24 CDT