========================================================================= Date: Thu, 1 Aug 1991 17:34:21 EDT Reply-To: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> Sender: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> From: schein@TOROLAB5.VNET.IBM.COM Subject: New version of C0 contribution from Tom Hastings - Digital ISO INTERNATIONAL ORGANIZATION FOR STANDARDIZATION ORGANISATION INTERNATIONALE DE NORMALISATION Multiple-Octet Coded Character Set ISO-IEC JTC1/SC2/WG2 N Date: 27-July-1991 This paper is in response to the proposal that the second ISO DIS 10646 adopt the approach of using the C0 and C1 space for coding graphic characters (and then unifying the coding of the two standards). There is a summary at the end. Since the first version of this paper, dated 16-July, I've added comments and suggestions (indicated by change bars) ÝI took the change bars out of this electronic version, because they messed up the formating. They are in the paper copy that you will get. -Avery¨: 1. Fold C1 characters also, since some mail systems remove them. 2. Indicate that escape sequences (control sequences and control strings from ISO 6429 too) can be used in ISO 10646 without interior NULL padding, only initial and final NULL padding. 3. Add ESC 2/5 2/15 F, ESC 2/0 F, and ESC Fs announcer alternatives to using HOP (hex 81). 4. Add parameters to HOP announcer alternatives. 5. Remove announcer for Little Endian, since data should be in a single standard order for interchange and in programs; byte swapping should happen when interchange data is read and written on Little Endian machines. 6. Change terminology from "one-octet form" to "current ISO 2022 form", since ISO 2022 data can be two-byte, such as in current ideographic character sets. Change bars not used, since not a substantive change. 1 Relationship to current coding standards At first blush, this seems to be a highly in-compatible proposal. However, another way of looking at the proposal is that it expands the fundamental (ISO 2022) coding unit used in character coding from being only one (7-bit or 8-bit) byte to also being two and four octets as well. With this view, each character in current standards can be thought of being coded in one, two, or four octets, depending on the form of coding. When a current ISO 2022 conforming standard is represented in this expanded two or four-octet form, the high-order, mostsignificant octet(s) are zero (corresponding to the ASCII/ISO 2022 NULL character when viewed in current ISO 2022 form). Thus the C0 and C1 control standards are preserved and can be used in this expanded form with each C0 (8-bit) bit combination being represented in two octets with the first octet zero. Converting some software programs in some programming languages may consist solely of specifying that all character data is two or four octets, instead of one-octet, and recompiling. This may be worth SC2 pointing out to SC22. Finally, with this Unicode approach, 65/65536 of the code space is used for coding control functions which is less than 0.1%, compared to the current ISO DIS 10646 which uses 44% for coding control functions. In fact, ISO/IEC JTC1 SC2 could consider the expansion to twoand fouroctet forms as an addition to ISO 2022 by specifying that an ISO 2022 bit combination can be two or four octets, not just one 7or 8-bit byte, but more on that later. NOTE The reason to consider the expansion as two and four octets, rather then 16 and 32-bits, is 1) to avoid the confusion of Big Endian vs. Little Endian, 2) to make it clear how the data maps onto oneoctet transmission channels and storage media, and 3) all protocol standards are octet based. 1.1 Representation of escape sequences, control sequences, and control strings The representation of escape sequences of ISO 2022 and ISO 6429 and the control sequences and control strings of ISO 6429 would not need to require NULL padding of each bit combination in the sequence of bit combinations that follows the introducer character. This is because ISO 2022 and ISO 6429 indicate that the bit combinations that follow (hex range 20 to 7E) are to be interpreted independently of the graphic character set(s) currently designated and invoked. Only initial NULL padding of the introducer character (ESC, CSI, DSC, OSC, PM, etc.) and NULL padding after the last octet to align the next character on a character boundary would be required. Example: the escape sequence ESC 6/0 (hex 1B60) would be represented as: In 2-octets: 001B 6000 In 4-octets: 0000001B 60000000 Example: the escape sequence ESC 2/5 4/0 (hex 1B2540) would be represented as: In 2-octets: 001B 2540 In 4-octets: 0000001B 25400000 Longer escape and control sequences and control strings approach the efficiency of current ISO 2022 representation. LET'S NOT ASSIGN GRAPHIC CHARACTERS TO ROW HEX 1B In order to reduce the possible confusion of ISO 10646M charac- ters with escape sequences, lets skip assigning characters to row hex 1B for now, though 1B appears as the second octet of many characters, so there will still be confusion with existing devices. 1.2 Coding of ISO 10646M controls: PAD, SGCI, HOP, IUCS The current ISO DIS 10646, uses up 3 precious C1 control character positions (see ISO 6429). They are PAD (hex 80), HIGH OCTET PRESET (HOP = HEX 81) and SINGLE GRAPHIC CHARACTER INTRODUCER (SGCI = hex 99). Since PAD can now be done with NUL (hex 00), we don't need PAD. Since SGCI would need to have C0 and C1 bit combinations following to represent many ISO 10646M graphic characters, SGCI cannot be used in current ISO 2022 coding; therefore, we can code SGCI in ISO 10646M using one of the code extension functions code positions: SO (hex 0E), SI (hex 0F), SS2 (hex 8E), or SS3 (hex 8F), since they cannot be used from within ISO 10646M for ISO 2022 code extension. Any of the other C0 or C1 control characters could be used from within ISO 10646M, so we don't want to preclude that by using those bit combinations in row 0 of ISO 10646M. The IDENTIFY UNIVERSAL CHARACTER SUBSET (IUCS coded as an ISO 6429 control sequence: CSI Ps... 02/00 6/13 which is hex 9B Ps... 20 6D) is used to identify subsets. We can continue to use it if it is useful outside ISO 10646, or we could use one of the code extension code points: 000E, 000F, 008E, 008F. 2 Interworking/co-existing with existing equipment and software In order to decide whether we can change ISO DIS 10646 to use the Unicode approach and coding, we have to answer the following three questions: 1. What problems will using the C0 and C1 space for graphic characters in ISO 10646 cause when such data is used with existing equipment and existing software Ýthat only views character data in current ISO 2022 form and so may look for any C0 or C1 octet and take some action¨? 2. In order to be fair, we also need to ask the same question about the current ISO DIS 10646: "What problems will using the 02/00 (hex 20) octet (= ASCII SPACE character) for graphic characters in ISO 10646 cause when such data is used with existing equipment and existing software Ýthat only views character data in current ISO 2022 form and may look for the SPACE (hex 20) octet and take some action¨? 3. We also need to consider future equipment and software that may wish (or need) to support the current ISO 2022 form as well as the new ISO 10646M two and four-octet forms in order that new ISO 10646M data and programs can be introduced into existing existing systems in an evolutionary way. Existing equipment includes: Terminals Printers Modems Terminal/printer concentrators Host connections Existing software includes: Operating Systems: Terminal and printer I/O drivers Command language interpreters Call interfaces File Systems Compilers Application programs 3 Terminals and Printers One area that this new coding form will have a major impact is terminals and printers that use asynchronous, serial full-duplex and halfduplex lines to connect to modems, concentrators, and hosts. There are the following possibilities for system connections using serial-lines: 3.1 Output to existing terminals and printers Obviously, existing terminals and printers cannot be expected to send/receive the new twoand four-octet data. However, terminals and printers that receive ASCII and ISO 8859-1 one-octet data would be able to receive and image the proposed ISO 10646/Unicode twoand four-octet data forms that are represented by row 0, since the terminals and printers ignore NULL (hex 00) octets. While existing terminals and printers ignore NULL (hex 00) when received in graphic character data and within single byte control characters, I'm not so sure about embedded NULLs in ISO 6429 escape sequences and controls sequences. However, there isn't a need to embed the NULLs in the middle of such sequences (see Section 1.1). Existing terminals and printers interpret a NULL (or 3 NULLs) preceding the single character format effectors: HT, CR, LF, VT, FF, and BS correctly. Even NULL CR NULL LF is interpreted as CR LF on existing terminals and printers. Existing asynchronous full-duplex serial-line terminals and printers also correctly interpret received NULL XOFF (DC1 = hex 11) and NULL XON (DC3 = hex 13) used for flow control (see below). These same existing ASCII and ISO 8859-1 terminal and printers receiving ISO DIS 10646 data would image one or three SPACEs between each character. So for output of the ISO 8859-1 subset, the Unicode approach is more compatible with existing terminals and printers than the 1st ISO DIS 10646 coding. 3.2 Output to future terminals and printers The need for a standardized announcer for switching from current ISO 2022 form to twoor four-octet forms is vital for output to new terminals since implementation of the terminals is often done by different vendors than the concentrators, modems, and host systems. ÝWe will see later that an announcer is also vital for interchange of file data for software use and for checking that the data is as expected for processing (or do code expansion/reduction), so we might as well use the same announcers there too.¨ 3.3 Input from existing terminals Existing terminals input only in current ISO 2022 form and do not announce that. 3.4 Input from new terminals New twoand four-octet terminals will probably want to operate in a one-octet ASCII, ISO 8859-1, or other ISO 2022 form (including two byte sets) as well. Therefore, terminal concentrators, modems, and host systems will want to be able to switch between current ISO 2022, two, and four-octet forms of terminal input. Often this switching will happen during a session. The beginning of the session would likely start up in current ISO 2022 form and switch to two or four octet form if both parties agree and support that. A new terminal sending two or four-octet data to existing operating system software such as command line interpreters and text editors, would probably still work correctly if the user limited himself to row 0 (ASCII or ISO 8859-1 characters) with Unicode coding, since most such software probably filters out NULLs. However, application software might NOT work so well, that depends on the run-time library handling of NULL octets. However, I predict that most new Unicode/ISO 10646 terminals that support serial-lines will have a user controlled mode to run in current ISO 2022 form as well, so that they can be connected to existing systems as well. 3.5 Input from existing printers Existing ISO 6429 and other character oriented serial-line full-duplex printers input limited amounts of status information, if enabled. PostScriptÝ1¨ printers send arbitrary amounts of input data in ASCII and need flow control on input as well as output. Ý1¨ PostScript is a registered trademark of Adobe Systems, Inc. 3.6 Input from future terminals need announcers The need for a standardized announcer for switching from current ISO 2022 from to twoor four-octet forms is vital input from new terminals since implementation of the terminals is often done by different vendors than the concentrators, modems, and host systems. ÝWe will see later that an announcer is also vital for interchange of file data for software use and for checking that the data is as expected for processing (or do code expansion/reduction), so we might as well use the same announcers there too.¨ 4 Modems and Terminal Concentrators 4.1 Asynchronous serial-line communication Full-duplex, serial-line modems and terminal concentrators pass the character data through in both directions. All data is passed through transparently, except for the C0 octets used for flow control: DC1 (hex 11) and DC3 (hex 13), called XOFF and XON (Control Q and Control S). Also most concentrators and modems pass NULLs through. This is becoming increasing necessary, since the IBM PRO printer has embedded NULLs in its escape sequences. This will help get Unicode data through existing concentrators and modems. 5 XOFF/XON Flow Control on asynchronous serial full-duplex lines The XOFF/XON flow control problem is perhaps the biggest impediment to using the Unicode encoding technique. Unicode coding uses XOFF (hex 11) and XON (hex 13) octets as the second octet for a number of graphic characters, starting with Cyrillic. Hex 11 and hex 13 hasn't been assigned as a first octet for any characters yet in Unicode. It may be well to hold off allocating characters to hex rows 11 and 13 for a while. 5.1 Background The XOFF/XON flow control technique on asynchronous full-duplex serial lines permits the recipient (terminal/printer or modem/concentrator) to stop the sender (model/concentrator or terminal/printer) temporarily, if the sender sends too much character data for the receiver to keep up. The recipient send the XOFF down the other line to stop the sender. For terminals it works in both directions. For printers, except PostScript printers, the flow control is only needed for the printer to restrain the sender. For PostScript printers which can send an arbitrary amount of data, the host needs to be able to restrain the printers input as well. The XOFF/XON flow control technique is the least expensive method for implementing flow control in the serial-line asynchronous market. It does not require any additional wires. Wiring in buildings is typically four wires, as is phone company practice. Thus existing wiring in buildings can implement a single full-duplex connection. Another method for flow control over serial-lines is to use two additional signals: DTR and DSR for flow control. This takes two additional wires. Asynchronous printers often offer the customer both methods when connecting up his printer. However, the video terminal market has not done this since low cost is even more important and they are more often connected to wiring in a building. Printers tend to be installed near to the host, concentrator or modem, so the extra cost of more wires isn't a factor. Another alternative rarely used today, is to implement a full handshaking packetized protocol over the full-duplex serial-line. The sender and receiver use the protocol to control their rate of flow. QUESTION Is there an ISO, Internet, or other standard for a protocol for use over serial, full-duplex lines? Some answers: Internet Mail is working on some techniques: uuen- code, btoa, and binhex (see attached mail from Greg Vaudreuil, Chairman ITEF SMTP Extensions Working Group). 5.2 Possible solutions to handling the XOFF/XON problem on serial lines There are a number of solutions to avoiding the use of single octet hex 11 and hex 13 with ISO 10646/Unicode data on serial-lines: 5.2.1 Use twoand four-octet XOFF and XON One approach for new terminal, printers, concentrators, modems and hosts is to use the twoand four-octet XOFF and XON controls when operating in twoand four-octet form. This requires the use of an announcer to indicate when twoor four-octet form is being used in each direction and when returning to current ISO 2022 form. 5.2.2 Use additional wires Use out of band flow control with additional wires (not possible in many circumstances), but a good alternative when it can be used (especially printers). This may become a popular method, but needs a standard (defacto or de-jure), so that terminal/printer manufacturers can connect to concentrator/modem manufacturers equipment. 5.2.3 Use a real protocol on the serial-line Needs standards here. Are there ones in existence? 5.2.4 Byte stuff hex 11 and hex 13 when it occurs in data Another approach is for the sender to convert any hex 11 and hex 13 data, to something else, so that the existing equipment won't think an XOFF or XON is being sent. The receiving equipment converts it back. Byte stuffing can be implemented in hardware (or by hardware assist which interrupt on particular bit patterns) or completely in host operating system, run-time library or even application program software. 5.2.4.1 BISYNCH byte stuffing algorithm One alternative would be to use IBM's BISYNCH method using a transparency mode entered by DLE STX, left by DLE ETX. Real XON and XOFF would be sent as usual; data that looks like XON and XOFF and DLE would be sent as DLE XON and DLE XOFF and DLE DLE. We might use the HOP code to get into two or four-octet mode that requires transparency, instead of using DLE STX and DLE ETX (see Section 7). I strongly recommend that the ISO 10646 standard include an informative annex that recommends a particular byte stuffing algorithm so that XOFF (hex 11) and XOFF (hex 13) can be used for flow control as single octets as is current practice. ISSUE Are there other byte stuffing algorithms? Any ISO standard ones? 5.2.5 Fold graphic two-octet data out of C0, SP, DEL, C1, FF space Another approach that can be used, more typically in software and new Unicode/ISO 10646 terminals and printers, is for the sender to convert the Unicode/ISO 10646 data about to be sent so that is isn't confused with current data. The Apple proposal from Mark Davis, Rick Sewill, and Rob Hawley seems a good one (reproduced here with slight modification as indicated), though we need to extend it to four-octet ISO 10646 data as well ÝI didn't do this yet.¨. It meets the following properties: 1. The pervasive C0 and ASCII characters are sent as one-octet data compatible with existing standards, equipment, and software. 2. The new Unicode/ISO 10646 graphic characters above hex 007F are folded so that they do not use hex 00..1F (C0), SPACE (hex 20), DEL (hex 7F), C1 (hex 80..9F), or hex FF (sometimes thrown away as if DEL) and are sent in two or three octets. The algorithm is: 1. Map Unicode/ISO 10646 characters 0000..007F to 00..7F. ÝI didn't include mapping the C1 characters (0080..009F to 80..9F), because some mail systems remove C1 characters (see attachment from IETF SMTP Extensions Working Group Chairman). I also didn't include mapping hex 00FF to FF, since 00FF is Unicode/Latin-1 SMALL LATIN LETTER Y WITH DIAERESIS and some existing communication systems and/or software confuse FF with DEL (7F) and remove it.¨ Then Unicode/ISO 10646 C0, SPACE, ASCII graphics, and DEL characters are represented as current one-byte C0, SPACE, ASCII graphics, and DEL characters and so can pass through existing software and communications channels that assumes C0, SPACE, ASCII graphics, and DEL characters. 2. Map the next 93*179=16,647 Unicode/ISO 10646 characters starting with hex 00A0 into two octets in which the first octet is in the range hex A0..FC (93 possible values) and the second octet is in the range hex 21..7E, A0..FC (179 possible values). 3. Map the remaining 2*179*179=64082 Unicode/ISO 10646 characters into three octets in which the first octet is hex FD..FE, the second and third octets is hex 21..7E, A0..FC (179 possible values each). I strongly recommend that the ISO 10646 standard include a second normative annex that recommends this particular folding that can be used to transform Unicode/ISO 10646 data into a form that can get it by existing hardware and software. By having two recommendations in two annexes, implementors will chose among them (or do both), rather than having a proliferation of single vendor, implementor workshop agreements, or various consortia solutions. These annexes would NOT be required for conformance of interchange or equipment. The announcer technique could flag that this folded data follows. Then it wouldn't require prior agreement between sender and receivers whether the data was being folded or not. 6 Host connections Host connections include connecting to concentrators and modems and directly to terminals and printers. Hosts connected to concentrators or modems control their flow with "out of band" methods, rather than using XOFF/XON. Hosts that connect directly to the terminal or printer with a serial-line, have the same problems that a concentrator or modem has when connecting to the terminal or printer with a serial line (see above). 7 Alternatives for Announcers to indicate two or four-octet form Announcers are needed to flag data, whether interchanged on communication lines or as complete files. The use of announcers with fields of records is probably not done, since fields are usually declared as to data type (e.g., which character set) when the record is declared. PostScript has used an announcer on all platforms, consisting of the two ASCII characters: PERCENT (%) EXCLAMATION MARK (!). Printing system then distinguish PostScript data from ordinary (or other) text, if the first two characters of the file are %!. This has been an invaluable technique for introducing PostScript into existing systems. ISO 10646 should use a similar technique to announce two-octet form and four-octet form. It is desirable to also have a method to return to current ISO 2022 form after having entered two or four-octet form, i.e., return to character sets that conform to the current ISO 2022 code extension standard. 7.1 Requirements for announcers The announcer mechanism must meet the following requirements: 1. The announcer mechanism must distinguish the following types of interchange data: 1. ISO 10646 two-octet data BMP (4 needed) a. Whether or not non-spacing accents are used (to dynamically compose characters). b. Whether or not SINGLE GRAPHIC CHARACTER INTRODUCER (SGCI) is used to select single characters in other planes. 2. ISO 10646 four-octet data (1 needed) 3. return from ISO 10646M data to existing ISO 2022 coded character sets (including ANSI C multi-byte, and ISO 2022 one-byte and multi-byte sets, EUC, etc). 4. two octet-compaction in which the ideographic zone of the specified plane of group 00 replaces the corresponding ideographic region of the BMP to form two-octet data (4 * 10 or so needed) a. Whether or not non-spacing accents are used (to dynamically compose characters). b. Whether or not SINGLE GRAPHIC CHARACTER INTRODUCER (SGCI) is used to select single characters in other planes. NOTE NO NEED FOR SOME COMPACTION METHODS One-octet compaction (using C0 and C1 coding assignments) no longer yields any national or ISO standards, except ISO 8859-1, so I did not list that as an announcement requirement here. Three octet compaction does not seem to be needed, since reaching out to get seldom used characters can be done from two-octet form using SGCI when using that form or use full four-octet form, so I did not list that as an announcement requrement here. 5. Possibly that "folded" data follows (see section Section 5.2.5) (1 needed in combination with all of the above) 2. The announcer mechanism must be unambiguously interpretable in all 3 forms (current ISO 2022, two-octet ISO 10646M, and four-octet ISO 10646M forms) or, alternatively, all data (CC-data-elements) is assumed to start out in current ISO 2022 form. 3. The announcer mechanism must also meet the ISO/ANSI C programming language standard requirements for so-called Multi-byte data, in which the data stream is assumed to start out in one-octet ASCII and then switch to any other form. (The switch can happen immediately as the first data, so any announcer that is interpretable in one octet form suffices here). NOTE Requirement 2 above means you can read the first two-octets or first four-octets as a unit if that is what you expect and simply check that the announcer is as you expect, else its an error condition or requires code conversion. 4. The announcer must occupy a multiple of two or four octets, depending on whether twoor four-octet form is being announced. 7.2 Use of announcers for conforming interchange These announcers are required for conforming interchange of files or over communication lines, unless there is a higher level protocol, such as an OSI or SMTP protocol, record description, RPC data description, etc., or unless there is prior agreement. However, prior agreement precludes so-called blind interchange. The announcer(s) could be used merely as a check that a program was opening a file that was in the anticipated form or could be used to dynamically convert from one or more types to the desired types, depending on the design of the system run-time (likely to vary between different programming languages). On communication lines, the announcers are used to indicate the form of following data. In closed systems that are entirely ISO 10646/Unicode, the use of the announcer would be optional. However, if such a system interchanged its data with other types of systems, it must include the announcer, whether the interchange was through files or communication lines, unless there is a high level protocol, or unless there is prior agreement. 7.3 Alternatives proposals for the announcer mechanism There are several approaches for encoding announcers: 1. Use the C1 HOP (hex 81) that the current ISO DIS 10646 uses for announcing with one or two octets immediately following being the parameters. 2. Use ESC Fs sequences and ESC I Fs sequences, where the Fs characters are assigned by the ISO Registrar, where I is in the range hex 20..2F and Fs is in the range 60..7E. 3. Use ESC 2/0 F announcers from ISO 2022 4. Use ISO 2/5 2/15 F complete code designators from ISO 2022. SC2 should pick one of the following alternatives or another one that meets the requirements. 7.3.1 Alternatives using HOP The following alternatives use the C1 HOP (hex 81) that the current ISO DIS 10646 uses for announcing with some number of following bytes being parameters, indicating two-octet BMP, using using non-spacing accents, using SGCI, two-octet compaction, four octet, etc. 7.3.1.1 Alternative 0: one-octet HOP character with one parameter 1st octet: not 81 existing ISO 2022 data follows 1st-2nd octets: 81xx any of the twoor four-octet forms, SGCI used, non-spacing accents used, depending on the value of xx; a limited range of xx specifies which plane's ideographic zone replaces the corresponding ideographic zone of the BMP. 7.3.1.2 Alternative 1: two-octet HOP character with two parameter octets 1st-2nd octets: not 0081 existing ISO 2022 data follows 1st-4th octets: 0081 xxyy any of the twoor four-octet forms, SGCI used, non-spacing accents used, depending on the value of xx; a limited range of yy specifies which plane's ideographic zone re- places the corresponding ideographic zone of the BMP. RESTRICTIONS Alternative 0: the first octet must be looked at separately from the second octet; therefore, alternative 0 cannot be used in the middle of data to switch to another form, only at the beginning. Alternative 1: Group 00 plane 81 shouldn't have any graphic char- acters assigned to it, so that it wouldn't be confused with the announcer. 7.3.2 Alternatives using ESC 2/0 F or ESC 2/5 I F An alternative approach would be to use some ISO 2022 announcer escape sequences of the form ESC 2/0 F, where different F values indicate two-octet BMP, using using non-spacing accents, using SGCI, twooctet compaction, four octet, etc. Possibly need to use ESC 2/0 I F, since may need 20 to 40 different forms. As with the HOP alternatives, either require an initial pad of NULL or require scanning an octet at a time at the beginning (and don't assign characters to row 001B in BMP). Example: Alternative 0: 1B20 xx00 and 1B20 20xx Alternative 1a: 001B 20xx and 001B 2020 xx00 (when going to2-octets) Alternative 1b: 001B 20xx and 001B 2020 xx00 0000 (when going to 4-octets) A second alternative approach would be to use the ISO 2022 invocation of a complete code (which ISO 10646M certainly is) using the ESC 2/5 2/15 F sequences, where different F values indicate two-octet BMP, using using non-spacing accents, using SGCI, two-octet compaction, four octet, etc. Example: Alternative 0: 1B25 2Fxx Alternative 1a: 001B 252F xx00 (when going to 2-octets) Alternative 1b: 001B 252F xx00 0000 (when going to 4-octets) NOTE Even if ISO 10646M were to use the ESC 2/5 4/0 sequence to re- turn to ISO 2022 (see section Section 7.4.2), we can't use the ESC 2/5 F sequence to invoke ISO 10646M, because graphic character data could accidentally look like ESC 2/5 4/0. This is because ISO 10646M has already assigned a graphic character to hex 2540 and lots of graphic characters have the second octet hex 1B (ESC). ISO 10646M has to use initial padding of the first octet of ESC 2/5 4/0 to return to ISO 2022 (i.e., hex 001B 2540) in order to avoid the conflict with graphic characters. 7.4 Character Synchronizing An announcer must occur at the boundary of a two-octet or four-octet character, not in the middle. Thus the sender must know which form the data is (one, two or four-octet) when inserting the announcer. Can anyone think of an announcer scheme that is self-synchronizing for use when the sender isn't sure of the form of data or where the character boundary is? Perhaps sending four hex 00 octets before the announcer would be sufficient for the recipient to start to look for the hex 00000081 announcer pattern. If it doesn't occur, the data is treated normally. If the sender knows what the form of data is and where the character boundaries are, the extra four hex 00 octets need NOT be sent. So for example, at the beginning of a file or at the beginning of a communication session, there is no need for the extra four hex 00 octets in front of the announcer for synchronization (00 HOP or 000000 HOP) since it is clear what the form is and that we are at a character boundary. When files are concatenated (for example, UNIX pipes), the announcer might occur in the middle of the resulting data. However, the concatenator might have to convert the data to that expected by the program anyway, so the announcers in the middle would disappear (or be treated as no-ops, if they are the same as the original errors if different from the original). 7.4.1 Little Endian data announcement In order to ensure data portability, data must be interchanged in a single standard order, namely with the most significant octet first. For programming languages that choose to represent ISO 10646M characters as integers, such as C, their run-time libraries on Little Endian systems can swap the bytes when reading and writing data that could be interchanged. Other languages can chose to represent ISO 10646M characters as two octet strings which both Big Endian and Little Endian store with most significant octet first; these languages need never swap bytes and can interchange data on the same system with C produced/consumed data. 7.4.2 Alternatives for switching back to current ISO 2022 form We need a way for a data stream to switch back to the default current ISO 2022 data form of current systems, i.e., to current character sets conforming to the current ISO 2022, including so-called multi-byte character sets. If ISO 10646/Unicode is registered as a complete code according to ISO 2022 and ISO 2375, then it is desirable to have a way to get back. Alternatives: 1. Use the existing ESC 2/5 4/0 escape sequence specified in ISO 2022 (see ISO 2022 clause 6.3.11) to return to ISO 2022 conforming representation. This escape sequence would be used to switch back to current ISO 2022 form, i.e., to current ISO 2022 conformant character sets that include oneand multi-byte character sets. The escape sequence ESC 2/5 4/0 would be represented in two-octet form as hex: 2-octets: 001B 2540 4-octets: 00000011 25400000 2. A particular parameter value of the HOP announcer sequence indicates return to ISO 2022: 2-octets: 0081 xxyy 4-octets: 00000081 xxyy0000 8 Revising ISO 2022 to include two-octet and four-octet forms I recommend that ISO 2022 be revised after ISO 10646 is approved to include the expansion idea to two and four octets with ISO 10646 given as the only standard for its use. The ISO 10646 announcers would be included. 9 Summary This paper proposes the following for the 2nd ISO DIS 10646 regarding the C0 space: 1. Its OK for the 2nd ISO DIS 10646 to use C0 and C1 space for graphic characters (see Section 1), as long as: 1. it specifies that all current C0 (hex 00..1F) and C1 (hex 80..9F) control characters from existing standards and implementations can be represented in the new two and four octet form with leading hex 00 or hex 000000, respectively, and 2. it specifies announcers which indicate the following data is to be interpreted as two-octet or four-octet forms, i.e., differently than current systems and standards (see Section 7), and 3. it specifies the additional capabilities used with two octet form: using BMP, using non-spacing accents, using SGCI, two-octet compaction of a specified other basic plane with the BMP. 4. it specifies an announcer for "folded" data for use on some existing communications systems that avoids using C0, DEL, and C1 to represent graphic characters 5. it specifies an announcer to return to current ISO 2022 conforming data of current systems. 2. Use HOP xx or 00 HOP xxyy for the announcers. 3. The appropriate announcer must be required for conforming interchange. It may be omitted only if there is a higher level protocol that specifies the form or if there is prior agreement on the form. 4. Specify an announcer sequence to return to (current) ISO 2022 conformant character sets that include oneand multi-byte character sets (see Section 7.4.2). 5. Specify that four hex 00 octets serve as a synchronization sequence that can be sent when the sender isn't sure what form the communication line is in and/or is not sure where the character boundaries are (see Section 7.4). 6. An informative annex to recommend a particular byte stuffing algorithm or point to a suitable standard for that, so that XOFF (DC1 = hex 11) and XON (DC3 = hex 13) can be used for flow control on asynchronous, full-duplex, serial communication lines. The BISYNCH algorithm is OK (see Section 5.2.4). 7. A second normative annex to specify how to fold ISO 10646/Unicode data so that its twoand four-octet forms cannot be confused with existing C0 (hex 00..1F), DEL (hex FF), and C1 (hex 80..9F) characters (see Section 5.2.5). 8. If any escape and control sequences and control strings are used within ISO 10646M, they only need initial control character prepadding and final character post-padding; interior padding is not needed (see Section 1.1). 9. Code any ISO 10646M controls that can only be used inside ISO 10646M using C0 or C1 code extension code points (SO, SI, SS2, SS3); leave the other C0 and C1 control character code points (with 00 and 000000 leading padding in twoand four-octets, respectively) for use with ISO 10646M (see section Section 1.2). ========================================================================= Date: Thu, 1 Aug 1991 15:14:05 PDT Reply-To: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> Sender: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> From: "F. Avery Bishop 01-Aug-1991 1512" Subject: DEC position on 10646M To:10646M mailing list Subj:DEC position on 10646M Digital supports the work of the 10646M ad hoc group chaired by Ed Hart to form a single worldwide character code set jointly based on 10646 and Unicode. From the national comments on DIS 10646, it is now clear that ISO needs to address the desire of users to have just one universal character set. In addition, the Unicode standard, Unicode member companies, computer users, and others will realize the following benefit from having a single worldwide character code. - Support by implementors Implementors will be encouraged to adopt the multilingual code when they know there is only one standard rather than two "competing" standards. Interoperability will be enhanced when companies not now involved in the Unicode consortium support the code in multiple hardware platforms and operating systems. - Penetration to markets where international standard conformance is required. This includes procurement requirements for contracts with government agencies (including EC) and international organizations. - Much broader review, resulting in an improved Unicode. The ISO voting procedure will provide more feedback from countries and other information technology groups. For example, there were significant comments from the USSR and Greece on their requirements for DIS 10646. Unicode can be improved to get better acceptance in those areas by meeting the requirements. - Support by other international and national standards. There are many standards which must be extended to deal with a universal character code, including file structures, ASN.1, all OSI protocols, programming languages, application level standards such as ODA, etc). These standards can only support other de-jure standards. On the other hand, they will be strongly encouraged to support the Unicode code structure if it becomes an ISO standard. Without these extensions, the scope of Unicode will be restricted, and users would need to use other character sets for many applications. Digital therefore urges all concerned to cooperate with the 10646M effort to create a unified universal character code set that meets the needs of industry and international users and is acceptable as an ISO standard. ========================================================================= Date: Fri, 2 Aug 1991 13:00:48 PDT Reply-To: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> Sender: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> Comments: Warning -- original Sender: tag was Joseph_D._Becker.OSBU_North@XEROX.COM From: Becker.OSBU_North@XEROX.COM Subject: Re: DEC position on 10646M In-Reply-To: "%pucc.princeton.edu!10646M%JHUVM:BITNET's message of 1 Aug 91 15:14:05 PDT (Thursday)" I agree with one-worldism in general, and with the force of the DEC position as expressed by Avery in particular. But without meaning to stir up sleeping hornets, I feel the need to point out again that the goal "one standard" may apply at two different levels: > In the loose sense, "one standard" means one 10646M document that we all give our blessing. > In the strict sense, "one standard" means that each logical sequence of characters has one only one legal encoding (modulo byte-swapping). I am worried that if we are unable to hold down on the "compaction methods" which might make 10646M into a compose-your-own-encoding portmanteau, then many people who sincerely supported unification will find that the bottom line at implementation time and run time is that they will STILL be facing a myriad of incompatible representations. We had a fruitful discussion of "compaction methods" in San Francisco, coming to understand that their intent is to try to provide a form of backward compatibility (with systems/data using current encodings), by building this compatibility into the language defined by the standard. After careful consideration, I think we discovered that it is more effective to take compatibility issues out of the encoding syntax and implement them via explicit code-conversion processes. Certainly doing so removes the undesired side effect of ambiguous representation introduced by compaction methods. If we really hope to arrive at the goals/benefits listed in the DEC position, then I think we have to aim for "one standard encoding" in the narrow sense of just one representation ... or, okay, just one 16-bit representation and its 32-bit extension as specified in 10646U. Joe ========================================================================= Date: Fri, 2 Aug 1991 16:52:55 EDT Reply-To: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> Sender: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> From: schein@TOROLAB5.VNET.IBM.COM Subject: 10646M effort To illustrate one of the points in Avery's note I am attaching X/Open policy statement on standards. Isai --------------------------------------------------------------- +---------------------------+ To : SSC Standards Policy . +---------------------------+ From : Andrew Walker: :: Date : 19th July 1991 Cc : Ladies and Gentlemen, You will be please to know that the X/Open Board approved the Standards Policy on 17th July 1991. The final wording, which is below, has one change which was requested by the Board Technical SubCommittee, which pointed out that CCITT 'Standards' are called 'Recommendations'. The words 'Recommendations approved by ' have therefore been added to the second paragraph. I would like to thank all those who have contributed to the development of this policy. It is a real achievement which I hope will in time help considerably to create a good working relationship between X/Open and the standards world. I was also asked by the Board Technical SubCommittee to carry out two actions: 1:To prepare a communications plan for the Standards Policy, for approval by the Marketing Managers. (I will also seek the approval of the Standards Steering Committee) 2:To prepare a set of Questions and Answers to clarify, for internal use, how the standards policy will be applied. (I will seek the approval of these by the Standards Steering Committee). The text of the approved Standards Policy is: X/Open Standards Policy 1:X/Open shall cooperate with formal standards bodies :to bring standards-based Open Systems to the market :in a timely and effective manner. It shall make its :work available to standards bodies with such release :of copyright as is required to permit material to be :incorporated into formal standards. 2:Where de jure standards exist, X/Open shall conform :to them. Wherever possible, X/Open shall use :International Standards approved by ISO/IEC or :Recommendations approved by CCITT. In their absence :it may adopt Regional or National standards which :are likely to become internationally adopted. 3:Where de jure standards are under development, :X/Open shall ensure that its specifications are :aligned with them. 4:Where the results of X/Open work extend beyond that :covered by the development of de jure standards, :X/Open shall, in situations where formal :ratification is appropriate, and where resources :permit, submit its work to the standardization :process. 5:Where there is no de jure standard, X/Open may use :de facto standards if they are broadly acceptable in :the market place. 6:X/Open, and its Technical Working Groups, shall :observe the rules of the standards bodies with which :they work and shall offer reciprocal liaison as :required. -- ---------------------------------------------------------------------------- Andrew Walker X/Open Company Limited Standards Manager Apex Plaza, Forbury Road EMail: a.walker@xopen.co.uk Reading, England, RG1 1AX Tel: +(44) (0)734 508311 FAX: +(44) (0)734 500110 ---------------------------------------------------------------------------- ========================================================================= Date: Fri, 9 Aug 1991 08:26:38 EDT Reply-To: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> Sender: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> From: Edwin Hart Subject: Publication of the Unicode Book I purposefully delayed writing this until after the successful completion of the CJK-JRG in Japan. I am asking the Unicode Consortium to delay publication of the Unicode book. I know this is very controversial within the Consortium, but I think you should seriously consider delaying it. I can present several reasons for this: 1. It would be an ADDITIONAL gesture of "good faith" on the part of the Consortium. This would definitely help the merger. It gives more moderate members of WG2 a better bargaining position with those who will oppose any cooperation with the Consortium. In short, it will promote the merger. 2. It gives the Consortium another item to use to bargain with ISO. Many in WG2 feel the competition of Unicode to be the first to publish an approved code. The Consortium does not need to tell WG2 that if the merger discussions fail, the Unicode book will go to press almost immediately. If you publish the book, (a) you have less "good faith" and (b) WG2 MAY have less incentive to cooperate. (I believe that the 9 negative votes that mention a merger between 10646 and Unicode gives WG2 a lot of incentive to reach an accommodation.) In short, do not put your chips in the center of the table too soon--you might need them later. 3. You have delayed publication of the UniHan portion of Unicode. This means that if you publish the non-Han portion of Unicode now, people will need to buy the UniHan part later. That is extra expense for the customer and the publisher. You could keep the cost down by publishing them together. 4. Assuming the merger is successful (and this is not certain right now) I expect that the merged 10646-Unicode code will be slightly different from what Unicode looks like now. Therefore, if you publish Unicode now, it will be different from what the merged 10646 international standard will be. As a customer, I always hate it when the real thing is subtlely different from the documentation. Moreover, it is always a pain to insert the update pages OR buy the new book with the correct information. By the way, guess who will be publishing the book with the correct information? ISO! In short, you do the developers who will implement Unicode a disservice if the book is just "slightly" different from the 10646 standard. They will need to buy another book, either the ISO 10646 or a second edition of the Unicode book to obtain the current information. That also leaves the publisher with a surplus of first edition books that will be obsolete a couple of months after it is published. When you are ready with your second edition and the publisher has a warehouse full of unsold first edition books, he is not going to be very happy and I am sure the unhappiness will be passed on to the Consortium in the cost of publishing the second edition. In summary, I am begging you to not only consider delaying publication of the Unicode book but to actually delay publication of it. I believe it is in the best interest of obtaining a merger, in the best interest of people who will implement Unicode, and in the best interest of the Unicode Consortium. You cannot wait until the end of the WG2 meeting. I would suggest that at the WG2 meeting, Mark Davis is prepared to tell WG2 that with satisfactory progress of the merger, the Unicode Consortium will delay publication of its book. Best regards, Ed ========================================================================= Date: Fri, 9 Aug 1991 10:28:49 PDT Reply-To: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> Sender: "10646M: Multibyte code working group" <10646M@JHUVM.BITNET> From: "K. Yoshimura" Subject: Possibility of a TC 46 Representative Attending August WG 2 Meeting In early July I noted that ISO TC 46 delegates who voiced interest in attending WG2 meetings to help forge closer cooperation said that they couldn't attend the August WG 2 meeting since it coincides with IFLA. This morning I received a call from Berlin: Axel Ermert of the National Library said he would try to go to at least part of the meeting. I'm faxing him the WG 2 meeting announcement; he said he'd know for sure next week. (He already has Mike Ksar's contact information.) I hope you see him there. Karen To: 10646M@JHUVM.BITNET cc: BB.WED, SALLY, NISONBS