encoding ext Latin for PNG

From: Peter Constable (peter_constable@sil.org)
Date: Thu Mar 09 2000 - 19:57:18 EST


       We're taking a look at the encoding needs of one of our SIL
       field entities - in Papua New Guinea - in relation to Unicode
       as a trial for doing the same for our other entities around the
       world that use extended Latin script. There are some issues
       that arise, and I want to see if these have been considered
       before and, if so, whether there are established preferences
       for how these matters should be dealt with.

       The following discussion makes reference to SIL PNG's standard
       codepage, shown in the attached PDF. My understanding (I need
       to double check this) is that this codepage covers the
       character needs of most or all or the written languages of PNG
       (which number in the hundreds).

       The first question relates to the characters 0x8D and 0x8E, L/l
       with equal sign overlay. These are not currently defined in
       Unicode, neither is there a combining equal sign overlay
       character. Would it be preferable to propose addition of one
       combining character or of a pair of composite characters (with
       no canonical decomposition)?

       The second question relates to the following pairs of
       characters:

       0x8F, 0x90 L/l with tilde overlay
       0x9A, 0x9B U/u with middle bar
       0xD0, 0xF0 L/l with middle bar

       For each of these pairs, the lower case character - and only
       the lower case character - is already defined in the standard:

       U+026B LATIN SMALL LETTER L WITH MIDDLE TILDE
       U+0289 LATIN SMALL LETTER U BAR
       U+019A LATIN SMALL LETTER L WITH BAR

       All three of these characters could potentially have canonical
       decompositions using existing characters, but in fact none of
       these three characters has a canonical decomposition.

       The upper case counterparts to all three could be encoded using
       combining sequences as follows:

       L with tilde overlay: 004C + 0334
       U with middle bar: 0055 + 0335
       L with middle bar: 004C + 0335

       (It's not entirely clear that U+0335 is the appropriate
       combining mark for the latter two; the distinction between
       U+0335 and U+0336 appears to be purely visible. U+0335 seems to
       me to be the better choice here. I think it would be good to
       clarify which should be used for cases like this.)

       The question is this: Is there any potential problem having a
       Ll character with no decomposition that gets case mapped to an
       Lu character that is defined only as a (decomposed) sequence?
       The alternative would be to propose the upper case characters
       as additions to the standards, but if added they would
       certainly have to be added without canonical decompositions. (I
       don't think we'd want a case pair where one is decomposable but
       the other is not. Decompositions for the lower case characters
       could, of course, in principle be added. But any addtional
       decompositions are to be avoided at all costs since they create
       problems for existing implementations.)

       You may notice some other items of curiousity in this codepage.
       I don't have all the facts yet, so I'm not looking to discuss
       anything more than the questions I've raised here.

       Peter



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:59 EDT