From: linguist@artstein.org
Date: Tue Jul 27 2010 - 23:51:36 CDT
Hi Kamal,
Thanks for the helpful comment -- especially the URLs. A quick check
showed that at least on the BBC, U+064A and U+06CC are used
interchangeably, even in final position where the glyphs differ. My
Pashto is extremely weak, but even I can recognize that in the
following article, both 06A9 0631 0632 06CC (in the headline) and 06A9
0631 0632 064A (in the first line of text) spell the name of the
Afghan president.
http://www.bbc.co.uk/pashto/afghanistan/2010/04/100411_hh-kandahar-clash.shtml
The pattern I thought I had noticed, with an emerging distinction
between yeh with and without dots in final position, appears to be a
fluke of the data I had examined. In a broader sampling of texts,
writers use both U+064A and U+06CC and don't care much about whether
dots appear on the final forms.
I'm still a bit flummoxed as to how a single writer can produce U+064A
and U+06CC in such an apparently random fashion, given that they
require distinct keystrokes. The Mac on which I am presently writing
(actually my wife's computer) has an "Afghan Pashto" keyboard layout
where U+06CC is produced by the "d" key in the QUERTY layout, and
U+064A is produced by shift+d (this is the same as in the keyboard
layouts set by Iranian standards ISIRI 2901 and ISIRI 9147). Are the
BBC typists randomly pressing shift when typing yeh?
On a similar note, it didn't take me too long to find an article where
the word "Pentagon" had two variants for the "g" character -- U+06AB
in the headline, U+06AF in the first line of text.
http://www.dw-world.de/dw/article/0,,5842070,00.html
In my Afghan Pashto keyboard layout, these characters are ' and
option+' respectively. Are the Deutsche Welle typists randomly
pressing option when typing gaf?
(These are intended as rhetorical questions, but if someone has an
answer I'd be happy to hear.)
-Ron.
Quoting "Mansour, Kamal" <kamal.mansour@monotypeimaging.com>:
> Ron, as you've already noticed, there can be multiple conventions
> for the orthography of a single language.
>
> For the Yeh repertoire, typically the following are used:
> u+06CC
> u+06CD
> u+06D0
>
> For a current corpus, have a look at BBC News
> (http://www.bbc.co.uk/pashto) and Deutsche Welle
> (http://www.dw-world.de/)
>
> Kamal
>
>
> On 2010.7.22 10:17, "linguist@artstein.org" <linguist@artstein.org> wrote:
>
> Hi,
>
> This is a query I had originally sent to the Linguist List, modified
> based on feedback I got there. I am hoping that someone in the Unicode
> community can help resolve this.
>
> I'm interested in knowing if there is a standard way to encode the
> various Pashto yeh-characters in Unicode, and if so, what it is. This
> question is a bit more complicated than it sounds, so here's the
> background.
>
> Pashto is written using a derivative of the Arabic script. The Arabic
> language uses a single character for both /j/ and /i:/ sounds. Like
> many Arabic characters, this one is composed of a base form (which
> changes shape based on its position in a word) and dots (in this case,
> two dots below the base form). In most of the Arabic-speaking world
> the dots are present with both the medial and final form, though in
> Egypt (and possibly other places) the convention is to have two dots
> on the medial form but leave them off the final form. The standard
> arrangement of the two dots is horizontal, but they can be placed
> vertically or diagonally with no change in meaning.
>
> Persian also uses a single character for /j/ and /i:/, with the
> convention of two dots on the medial form, no dots on the final form
> (same as in Egypt).
>
> The two conventions for the /j/-/i:/ character were given distinct
> code points in unicode despite the fact that they do not contrast;
> documentation is scarce, but presumably this was done in order to
> allow writing both Arabic and Persian in the same document. Therefore,
> Unicode has the following code points (I'm not giving the names, but
> rather the typical visual representation of the glyphs and typical use).
>
> U+064A two dots medially and finally (/j/-/i:/ Arabic convention)
> U+06CC two dots medially, none finally (/j/-/i:/ Persian convention)
>
> There are a few additional yeh-base code points defined, some of which
> are relevant to Pashto (see below).
>
> U+0649 no dots medially or finally (Arabic /a/ from etymological /j/)
> U+0626 hamza above medially and finally (Arabic glottal stop in
> certain contexts)
> U+06D0 two dots medially and finally in vertical arrangement
> U+06CD tail and no dots in final position
>
> As it so happens, there is much confusion in how these characters are
> used in actual electronic documents, which is not surprising given
> that U+06CC looks like U+064A in medial position but like U+0649 in
> final position. There is an excellent article by Jonathan Kew that
> sorts out what this means for various languages that use derivatives
> of the Arabic script.
>
> http://scripts.sil.org/cms/scripts/render_download.php?site_id=nrsi=file_id=arabicletterusagenotes=ArabicLetterUsageNotes.pdf
>
> Unfortunately, this article does not discuss Pashto. I have little
> knowledge of the language, but here's what I managed to understand
> from the inspection of a few documents and with the help of friendly
> people on the Linguist List (and please correct me if I'm wrong).
>
> Traditionally, Pashto used a single character with the same convention
> as in Persian, of two dots in the medial form and none on the final
> form, and with no significance attached to the visual arrangement of
> the dots. The character was 3-ways ambiguous between the sounds /j/,
> /i:/ and /e/. In recent decades (probably since the 1970s or 1980s)
> there has been some differentiation, partly due to changes in the
> typesetting process and partly due to a deliberate effort of the
> Pashto Academy at the University of Peshawar, Pakistan.
>
> One convention that has gained fairly wide acceptance is a distinction
> between a horizontal arrangement of the dots, representing /j/ or /i:/
> as in Arabic and Persian, and a vertical arrangement representing the
> sound /e/. This distinction is the same as in Uighur, and the
> character with vertical dots has been codified as U+06D0. Additional
> conventions include a hamza (U+0626) or tail (U+06CD) to represent /j/
> at the end of a word in certain grammatical markers. All of these are
> quite standard by now and do not pose much of a problem.
>
> However, a further convention appears to have arisen, which as far as
> I can tell is unique to Pashto in that it distinguishes between /j/
> and /i:/ (though only in word-final position):
>
> /j/ is written with two dots medially, none finally
> /i:/ is written with two dots both medially and finally
>
> I have never seen this codified explicitly, but this is the impression
> I get from examining a few recent Pashto documents. Which brings me to
> my original question, of how to represent these characters in Unicode.
> The linguist in me notices a correspondence between sounds and Unicode
> code points (which, given the history I have just described, is most
> certainly accidental):
>
> /j/ corresponds to U+06CC
> /i:/ corresponds to U+064A
>
> The wikipedia article on the Pashto alphabet
> http://en.wikipedia.org/wiki/Pashto_alphabet gives a different
> correspondence, based on visual appearance:
>
> forms with dots: U+064A (/i:/ and /j/ medially, /i:/ finally)
> forms without dots: U+0649 (only /j/ in word-final position)
>
> And there is yet a third convention, which I encountered in an
> electronic lexicon and also appears in the following document:
> http://www.afghanan.net/pashto/pashto%20alifba.pdf
>
> U+06CC: medial forms with dots (/i:/ and /j/) and dotless final form (/j/)
> U+064A: final form with dots (/i:/)
>
> To wrap up, are my observations about the Pashto writing conventions
> correct? And is there a standard for assigning the Pashto characters
> representing /j/ and /i:/ to Unicode code points?
>
> -Ron.
>
>
>
>
>
This archive was generated by hypermail 2.1.5 : Wed Jul 28 2010 - 09:41:11 CDT