Document L2/03-416 for UTC and X3L2 meeting, 4 November, 2003 The Cuneiform Encoding Proposal -- a View of its Current Status The following is written by Lloyd Anderson, 3 November 2003. It refers to the proposal posted at http://www.evertype.com/standards/iso10646/pdf/n2664-cuneiform.pdf Much supporting information on various points is now or will in December be posted on the web site http://www.CuneiformSigns.org *** Our work is an encoding for one of the most important bodies of information in human history, and if our encoding fits the writing system like a glove, it will make both data handling and additional discoveries much easier. To draw a parallel: when one assyriologist was discussing a hypotheses about the earliest records of lunar cycles, the question of a 19-year cycle came up, and I suggested simply arranging the data in 19-year columns so we could see visually any repeating patterns. The result was almost immediate, we could see that the cycles were strongly established almost 200 years before we had previously thought they were (before 700 BC instead of at 524 BC). I believe encoding Cuneiform so it works smoothly will have similar large effects on future discovery and understanding. In support of this effort, I am contributing results from a concordance to all major sign lists which I am generating in the process of producing an etymological dictionary of the origins and development of Cuneiform signs. *** It is probably most useful if I summarize my views of where we are now, and what remains to be done. The proposal which is before the UTC has been drafted with input from quite a number of people, and has benefited greatly from the efforts of Dean Snyder here at Johns Hopkins, in bringing people together so that more professional assyriologists have chosen to get involved. That proposal is progressing relatively well. We have a substantial repertoire under discussion, and will be refining that as to both sign inclusion / exclusion and sign names. Here follow some issues which are still being worked on or for which more information will be gathered. *** Cuneiform is unlike Han characters because Han uses blocks of constant size, so there is never any doubt where one character ends and the next begins. Cuneiform signs are by contrast of varying width. It is unlike Latin, because any combining elements are not singled out as different from base characters when standing alone. We must use more subtle methods in the inevitable borderline cases of various kinds, to identify just which are the independently functioning distinctive characters of the writing system. We are devising an encoding for a large historical range, of thousands of years, during which time of course some changes occurred. Our principle, fully agreed to, is that a distinctive contrast occurring anywhere in the time we cover must be provided for in the character set, even if not all users will need it at all times. Just as with extended-Latin or any other script. We are increasingly conscious that we run into a few odd cases when doing this, and that we cannot have identical text content of all eras encoded the same way if the set of distinctive elements has changed across those eras and we want them each to be encoded true to their own system of distinctions. In some cases we may make compromises among our several goals. Specifics: *** 1. Number signs Cuneiform number signs in general do not share glyphic appearances with signs used for non-numerical text. This is especially clear when we consider the historical range of Cuneiform, since signs which look identical in one era are distinct in another era. No problem here. There are a very few signs where identity of form is complete or nearly complete between a number sign and a non-number sign. We will be working further on these to see what will best serve the community of users of Cuneiform texts. An illustration of some of the oldest number signs will accompany this document at the UTC meeting of 4 November, 2003, to demonstrate the com bining-diacritic pattern among early Cuneiform number signs (those overlayed marks which signaled what kind of thing was being counted). *** 2. So-called "compound signs". We have progressed beyond a blanket rule that anything *called* a compound sign is to be encoded as glyphs which we define as its parts and into which we fragment it. A few such signs may be encoded as single characters we treat as atomic, even if some have sometimes treated them as a sequence. This can be for different reasons: (a) because in fact they both appear differently and are in functional contrast with the mere sequence of two other signs which look similar to the parts we claim to see, even if they are not distinguished under all circumstances; or (b) because the political repercussions of not doing so would be a widespread rejection of the encoding by those for whom it is intended, with some very high-profile and/or high-frequency signs. With the advice of Ken Whistler, our active participants on Nov. 3rd agreed to treat these two spellings as NOT canonically equivalent, and accept that those who do not understand the texts would be prone to a few spelling errors. Since only professional assyriologists are likely to be inputting significant amounts of text in any case, this was regarded as not a significant problem, but the issue will be discussed further. Some similarities may exist here to the historical debates over the digraph, which is a ligature for users of English, but is a single atomic character for users of Danish. So it was encoded as a distinct character, and is not in Danish usage a structural ligature at all, no more than is the ampersand "&". *** 3. For both Compound signs (just above) and Container-Infixed signs (next), we are increasingly recognizing that these are not simple or straightforward categories, that there are several groups of signs under each blanket term, and that we *may* choose to distinguish such groups in the final proposal. Additional data and patterns of signs are constantly being accumulated, so we will have gradually increasing support for our choices. *** 4. Container-infixed signs These can be encoded either as atoms as code sequences. Our group has so far chosen to encode primarily as atoms. Either is workable and extendable to additional signs as they are discovered, but under particular conditions. Chief advantages of atomic coding: The parts of A-with-infixed-B may develop in the combination the same way they do when independent signs, or they may develop in a special way in the combination. We can handle certain changes in fused components over time, treat the sign as still the "same" sign so texts retain their identity in encoded form across at least a substantial span of historical change, as when an original component NA is replaced later by the rather similar KI, yet the sign as a whole sign retains its identity. This is an especially obvious solution for irregularities in signs which mostly behave as fused, so that the "parts" cease to be recognizable. In other words, deep etymological origins can be disregarded in such cases, we are not *forced* to encode the sign two different ways simply because we know it underwent some change, is not a direct inheritance. Chief advantages of code sequences (SIGN SIGN or more inclusively SIGN SIGN): Certain of the "container" signs are highly productive, permitting many infixed signs. The vast majority of signs we have found which may be added to the repertoire are of the form container-sign-with-infixed-sign. One of the container signs (GA2) takes the widest range of infixed signs. It may have conveyed the meaning content "basket of ____", so utterly transparent that it is like an independent phrase in a sentence. That will be more conveniently encoded as a sequence of codes, CONTAINER SIGN INFIXED SIGN. There are other complexes container-with-infix which are at the opposite extreme, fused and not productive. Glyphic representation: All systems which can handle the code points for Cuneiform will also be able to handle fonts in which a sequence of codes is represented by a single glyph. So font makers can add support for new sequences and thus new container-infixed signs without needing any change in the standard, if encoding is as a sequence SIGN SIGN. The default binary sort order will continue to work for new signs encoded that way. There are some fluctuations either at one time or across eras, what appears as A with B infixed at one time may appear as A followed by B at another time. Recognition of equivalences between infixed and extraposed versions of the "same" sign is much easier if the container-infixed signs are encoded as sequences. Yet there are not very many examples of this type. Neutral or nearly so: Sort order can keep all signs with the same "Container" component together either by binary sort order or by table-driven sort, under either method of encoding. A difference is that new signs with known components will not be automatically sorted correctly if container-infixed signs are encoded as atoms, not in binary sorts, and not in table-driven sorts until the table is modified. In the case of encoding as atoms, the table can specify sorts *as if* sequences SIGN SIGN, so new container-infix signs can be added; all signs known distinct should however be included from the beginning, since only the container-infix ones can be added later and sorted corrrectly by default. A particular encoded character will be needed which we can refer to as INFX, or better both to support the table sort instructions and to permit the addition of new container-infixed signs. A minor problem with encoding as sequences. We NEED CODE POINT or better under EITHER encoding method. Problems of hierarchy of the following types (nested infixation or infixation of a sign sequence not merely of a single sign) are very rare; (A B) (C D) or A (C D) so we can probably manage without support for hierarchies, handle only the simple containers with simple infixed signs. Having a large number of atomic encoded signs decreases need for such hierarchical structures. *** 5. Sort order: I can see no justification yet for any binary sort order other than the de facto standard. And the same for the default table sort order. The traditional sort order reflects the dominant practice across generations of scholarship. It is based on the types of wedges in the signs, and their arrangement. This has been most often done for later forms of the signs, specifically for Neo-Assyrian. When done for other scribal traditions, the sign order may be slightly different because the actual sign forms are used, or it may be the same as the most common standard because the forms of equivalent signs of the standard NeoAssyrian may be used in determining the sort order. Concerns about where to interleave additional signs which do not have a later sign representative are now to me very small. Ellermeier's list has already done this for the matches between Neo-Assyrian and the quite old Gudea (Lagash) signs, and most of the rest are equally straightforward. The proposal currently before you uses an ordering based on a selection from among the many possible ways in which each particular sign is named. Some of this will change since some of the name choices are under discussion and are expected to change. It is my experience that it is almost *NEVER* a good idea to change a de facto standard, unless there is an overwhelming preponderane of evidence that the newer way will be overall *substantially* better. The reason is that the change itself has such a high cost. Proponents of new ways almost always overrate the virtues of a new way, and unerrate the virtues of stability. Encoding standards emphasize stability, and I hope we can in this matter also. *** 6. Sort order and sign identity implementations: There are a host of particular decisions that cuneiform specialists will need to make about sorting and searching particular sets of signs, and on treating them as same vs. distinct. Also about what changes across historical eras are such as to force us to encode signs differently, and what changes we can treat as purely glyphic, not affecting text content encoding. I am compiling lists of the most difficult borderline cases to highlight for specialists to consider decisions on them. Most of them have not been offered yet for any discussion at all. *** 7. Inclusion of older signs (Fara, Uruk, etc.). Assyriologist specialists are most reluctant to consider encoding older signs because they feel that knowledge of the older eras is not complete. Yet inclusion of all signs on which we have secure knowledge is especially important now before an encoding standard reaches a final proposal. What is known about the older signs can make encoding even of later stages more useful, so the encoding fits the long cuneiform tradition more "like a glove". The wording of the proposal before the UTC is consistent with the approach I am urging, where it says that we "will take into account factors arising from the earliest stages of cuneiform to the extent that these are already known and understood". It is inconsistent where it says there must be a fist stage which "will *not* include Archaic Cuneiform". That is a large generalization. If we add to the previous statement that we "will take into account both signs and factors arising from the earliest stages of cuneiform to the extent that these are already known and understood" I think we would have full agreement in principle. There is still the question of in practice. There is a great reluctance concerning the earlier stages. I believe this reluctance is in part perfectionism, in part it is simply the avoidance of what can be a lot of work ("no special effort has been made to go back farther than Ur III"), and partly it is a consequence of using blanket terms to cover a complex and multi-faceted situation. Instead of using a word like "archaic", I believe our principle should simply be the one stated in the proposal, that we take into account all secure knowledge of all eras from the beginning of the encoding process. That allows us to use late "dialects" of cuneiform and early attestations of cuneiform (including its earliest forms produced not using wedge tools) whenever we have solid information, and to disregard them when we do not. No one is compelled to work on the earliest levels, or on the latest for that matter, but we should not throw away information which is easily available. Nor is absolute comprehensiveness a goal -- "comprehensiveness at the level of pre-Old-Akkadian periods is not appropriate given the current state of paleographic research" is certainly a commen-sense statement. On the other hand, a large proportion of pre-Old-Akkadian signs are known and unerstood, including some which function differently from later signs. To support this inclusion of more solid knowledge, I have been pressing forward more rapidly with a complete concordance to all eras of cuneiform, and expect a nearly publishable version will be done by the end of December. Here follow, as illustrations, comments about what is known of two features of older signs, to clarify why they are *not* a problem. *** 8. "Turned" signs In the Uruk IV stage, a number of signs occur turned 90 or 180 degress, or 45 degrees. The vast majority of such turned signs do not occur later, they occur only in that oldest layer with substantial writing, Uruk IV. (R. Englund statement). The vast majority of turned signs appear never to have expressed any distinction in content. They were merely random variation or adjustment of their glyphic shape to better fit their context. One possible method to represent this is to have a few COMBINING CHARACTERS (turned 90 degrees, 45 degrees, 180 degrees) or the existing variant-selection characters, which can be used as needed, and which can be disregarded if as appears to be the case they do not convey differences of content. We can quite easily list the few cases where turning does create a separate sign, as $E (vegetation, originally upright) vs. harvested vegetation (originally horizontal); or a reversed hand with a meaning referring to the left-hand, or the like. The number of unclear cases, where we can't tell whether we have any substantive evidence for a significant distinction, is tiny. This paragraph is intended to implement what I agree with Englund is an approach placing high value on security of analysis, and the provision of a mechanism to make addtional distinctions which *may* be significant although we do not yet want to attribute status as fully independent signs. 9. "Fused" signs in the archaic typography of Uruk. It is possible to distinguish even in archaic cuneiform between freely recombining signs or sign components, and those combinations which are fused in some way (which together essentially constitute a single sign), as NAMESHDA, NAM x SHE, etc. Citations will be provided to the specialist literature. Similarly, other questions in archaic signs are not difficult to handle. Some comments from Englund (1998) on what he considers conservative choices in these matters, and what is known, will be available by noon 4 November on the web page http://www.CuneiformSigns.org/Strategy.htm *** 10. Implementation specifications are needed. The major lack I see currrently in preparation of the encoding proposal is the lack of a draft of implemention guidelines. The task of preparing such guidelines will bring to the attention of all involved a number of technical and structural questions which it is othewise much too easy to pass by unnoticed, and questions about individual characters. Here are a couple of sentences from the proposal which have implications for implementation, but whose implications I think have not yet been discussed, at least not publicly. The two following sentences both apply to splits of what was one character into two significantly distinct characters, but they recommend different implementations. There is nothing wrong with having these two different implementation strategies available, because there are situations appropriate to each. But the decision will have to be made for each split or merger, and specialists will need to consider each of these. Perhaps generalizations can be discovered which work, but it can be dangerous if we let the generalizations themselves become the goal, rather than the goal being the best implementation for each split and each merger on its own terms. Facing such implementation questions is simply one of the consequences of attempting an encoding across historical changes such as splits and mergers. (a) "[mergers and splits] must be encoded at the point of maximum differentiation, with reduplication of glyphs as necessary in other periods" (b) "Glyph variants such as TA*, a Middle Assyrian form of the sign TA which in Neo-Assyrian usage has its own logographic interpretation, will be assigned their own code positions, to be used only when the new interpretation applies." For sentence (a), if signs S1 and S2 are distinct at one period, then in periods where they are not distinct, the single glyphic rendering shall be duplicated and used for both characters. Users would of course have to choose which of the two characters to input based on their knowledge of the distinction in texts other than the one they are inputting. For sentence (b), the character code for S2 (TA*) shall not be used except for those texts where S1 (TA) and S2 (TA*) are signficantly distinct. *** Best wishes, Lloyd Anderson Ecological Linguistics PO Box 15156 Washington DC 20003 (202) 547-7683 ecoling@aol.com