TC154 proposal for UTF-8

> -----Original Message-----

> From: Frank Farance [mailto:frank@farance.com]

> Sent: Thursday, January 10, 2002 11:05 PM

> To: jtc1tag_talk@lists.itic.org

> Subject: FYI: My informal comments on TC154's use of UTF-8 as "sole

> encoding" for info interchange

> Please be advised that replying to this message goes to the sender. If

> you wish to send a reply to all on the list, please respond with "Reply

> All".

> _____________

> Dan Gillman (NCITS/L8 Chair) and I (NCITS/L8 IR) received the following

> E-mail regarding TC154 (and ITU?). I've provided an *informal* response

> to the TC154 TAG (below). I suggest that JTC1 get involved ... this

> "character problem" is already a big problem in SC2 vs. SC22, and I'd

> imagine SC34 would have some interest in this topic, too.

> I think Arnold has been working the "character" issue (right?). Arnold,

> do you have any suggestions for addressing the TC154 concerns?

> -FF

> --------------------------------------------------------------

> [Message from TC154 ...]

> From: Henrietta Scully <HSCULLY@ANSI.org>

> To: Gaile Spadin <gspadin@disa.org>, Mary K Blantz <MBlantz@iona.com>,

> Barbara Bennett <bbennett@itic.org>,

> Daniel Gillman <gillman_d@bls.gov>, Frank Farance

> <frank@farance.com>

> Subject: FW: [TC154-P:143] use of UTF-8 UNICODE II of IS 10646-1

> Date: Fri, 28 Dec 2001 13:48:01 -0500

> X-Mailer: Internet Mail Service (5.5.2650.21)

> Dear ISO/TC 154 Distribution:

> The following email message has been received at the American National

> Standards Institute. It is transmitted for information of the ANSI

> Accredited Technical Advisory Group (TAG) for ISO/TC 154.

> Regards,

> Henrietta Scully

> Program Manager

> Standards Facilitation/ISOT

> isot@ansi.org <mailto:isot@ansi.org>

> -----Original Message-----

> From: Francois Vuilleumier [mailto:fvuille@attglobal.net]

> Sent: Friday, December 28, 2001 10:27 AM

> To: TC154-P List

> Subject: [TC154-P:143] use of UTF-8 UNICODE II of IS 10646-1

> hello TC154,

> the MoU/MG on e-business has adopted the following Resolution:

> "The MoU/MG recommends that all e-business standards generated by member

> organisations should support the UTF-8 encoding defined in ISO 10646 [1]

> as

> the sole encoding for information exchange. Member organisations are

> invited

> to inform the MoU/MG [2] by end February 2002 of any business or

> technical

> barriers to implementation of this recommendation."

> [1] of JTC1/SC2

> [2] mailto <mailto:moumg@itu.int> :moumg@itu. int <mailto:moumg@itu.int>

> to get more, hook to the TC154 web (see below) and goto:

> ++ b. liaisons/partners

> ++ 1. internal

> ++ MoU on e-business

> ++ click on "IS 10646-1 UTF-8 UNICODE II"

> Bonne Année ;-) François

> ===

> ISO François Vuilleumier

> TC ISO/TC154 chair

> 154 c/o DGD, Monbijoustrasse 40, CH-3003 Berne

> ===

> --------------------------------------------------------------

> [INFORMAL response to TC154 TAG ...]

> [The following is an *informal* response to the issue raised below. If

> you'd like a formal NCITS/L8 position, Dan and I can get a formal

> statement. -FF]

> The use of UTF-8, for the purposes described above, is inadequate for

> "E-business standards", especially when TC154 has framed the question as

> "UTF-8 vs. UTF-16". The use of characters, especially in an

> international context, is a complex standardization process ... merely

> advocating ISO/IEC 10646-1 (the 2000 edition) or UTF-8 (or UTF-16) may

> still cause significant implementation and interoperability problems.

> In order to properly use international characters within international

> standards, one must consider several aspects:

> - The Conceptual Model/Framework. What characters are being

> used (independent of encoding)? This interoperability point is known as

> the "character set repertoire". For example, both ASCII and EBCDIC

> include the Latin alphabet in their repertoire, but their encodings are

> different. ISO/IEC 10646-1 specifies a repertoire.

> - The Character Encoding. There are several encodings of

> ISO/IEC 10646-1 and UTF-8 has its own strengths and weaknesses. For

> example, UTF-8 is space efficient for certain characters and inefficient

> for others; regardless, UTF-8 requires significantly more processing

> effort (with a greater programming error rate). In short, in the space

> vs. time trade-off, UTF-8 favors space efficiency. Meanwhile, UCS-4

> (another ISO/IEC 10646-1 encoding) is less space efficient (requires 4

> octets for each character), but is more efficient to process ... UCS-4

> favors time efficiency.

> - The Character Processing Model. Many systems process

> characters as "strings", i.e., a series of characters. Two types of

> processing are common: multibyte character strings (i.e., characters may

> vary size, and character "state" information may be embedded) and wide

> character strings (i.e., characters are all the same size and there is

> no embedded "state" information). Note that the choice of character

> processing features (e.g., multibyte vs. wide) is independent of the

> choice of character encodings. Examples: Storage systems typically

> favor space efficiency, while processing (API) systems typically favor

> time efficiency.

> - The Whitespace/Record Processing Model. Even 35 years after

> the standardization of ASCII, improper handling of whitespace can cause

> interoperability problems. The same "whitespace problem" exists today

> with the processing and transfer of XML records: incomplete

> specification of whitespace semantics/processing can cause

> interoperability problems. (Read: Simply specifying UTF-8 is not

> enough.) For example, more attention needs to be focused on

> specification of line processing (e.g., newline boundaries),

> leading/trailing whitespace (may be converted/reduced/removed by

> processing systems), etc..

> - The Markup/Escaping Mechanism. There always will be a need to

> intermix "control" features with "data" features in character

> processing. Markup is at one end of the spectrum (e.g., the top-level

> feature is "control"; and "data" is subordinate), while escaping

> features are at the other end of the spectrum (e.g., the top-level

> feature is "data"; and an escaping mechanism signals "control"). One of

> the main hazards is managing the proper sequencing and precedence of

> substitutions implied by markup and escaping mechanisms (think "macro

> expansion" issues) ... and these issues are non-local behavioral issues.

> - The Localization (L10N) and Internationalization (I18N)

> Context/Features. Regardless of the character set chosen, improper

> consideration of L10N and I18N concerns may reduce international

> adoption, e.g., just specifying UTF-8 is not enough for handling the

> requirements of Japanese usage (<-- the problem is not UTF-8 per se, but

> lack of localization features).

> Now after considering the above features, users of international

> characters should consider how much specification is necessary ***and

> only specify that far, i.e., don't overspecify***. For example, in many

> data models, one may only need to specify the **repertoire** of the

> ISO/IEC 10646-1 **without** specifying a particular encoding.

> As another example, programming language APIs may have datatypes for

> native international character processing ... mandating UTF-8 might be

> problematic, error-prone, and inefficient (e.g., a UTF-8 class would

> probably be a poor choice for Java APIs).

> Thus, choosing (standardizing) the encoding of characters is less

> important in many cases than choosing its conceptual/functional model.

> In some cases, it may be appropriate to chose a specific encoding, but

> not for the purposes of "... the sole encoding for information exchange

> ...".

> An good example of a poorly defined specification is the IETF RFC 2426

> which defines the vCard. Even if UTF-8 (or UCS-4) were used, there

> would still be a problem with interchange of international characters

> because the ***character conceptual model*** is faulty, e.g., it is not

> possible to guarantee interoperability of non-ASCII characters in RFC

> 2426 (even if UTF-8 is used).

> ISO/IEC JTC1 has thorough knowledge and experience in standardizing and

> using the features listed above. The following are some starting points

> (a non-exhaustive list):

> Conceptual Framework: JTC1/SC2, JTC1/SC22, JTC1/SC34

> Character Encoding: JTC1/SC2

> Character Processing: JTC1/SC22, JTC1/SC2, JTC1/SC34, JTC1/SC32

> Whitespace/Record Processing: JTC1/SC22, JTC1/SC34

> Markup/Escaping: JTC1/SC34, JTC1/SC22, JTC1/SC2

> Localization/Internationalization: JTC1/SC22 WG14/WG15/WG20/WG21

> Thus, TC154 has incorrectly framed the problem as "UTF-8 vs. UTF-16".

> Before agreeing to "... UTF-8 as the sole encoding for information

> exchange ...", I recommend contacting JTC1 for feedback on technical

> issues.

> -FF

> _______________________________________________________________________

> Frank Farance, Farance Inc. T: +1 212 486 4700 F: +1 212 759 1605

> mailto:frank@farance.com http://farance.com

> Standards, products, services for the Global Information Infrastructure

> ________

> This mailing list may not be used for unlawful purposes. All postings

> should be relevant, but ITI accepts no responsibility for any posting

> and may terminate access to any subscriber violating any policies of the

> Association. Please review the JTC 1 TAG Antitrust Guidelines at

> <http://www.jtc1tag.org/policy/atrust.htm>.