>
> -----Original
Message-----
> From: Frank Farance
[mailto:frank@farance.com]
> Sent: Thursday,
January 10, 2002 11:05 PM
> To:
jtc1tag_talk@lists.itic.org
> Subject: FYI: My
informal comments on TC154's use of UTF-8 as "sole
> encoding" for
info interchange
>
>
> Please be advised that
replying to this message goes to the sender. If
> you wish to send a
reply to all on the list, please respond with "Reply
> All".
> _____________
> Dan Gillman (NCITS/L8
Chair) and I (NCITS/L8 IR) received the following
> E-mail regarding TC154
(and ITU?). I've provided an *informal*
response
> to the TC154 TAG
(below). I suggest that JTC1 get
involved ... this
> "character
problem" is already a big problem in SC2 vs. SC22, and I'd
> imagine SC34 would
have some interest in this topic, too.
>
> I think Arnold has
been working the "character" issue (right?). Arnold,
> do you have any
suggestions for addressing the TC154 concerns?
>
> -FF
>
--------------------------------------------------------------
> [Message from TC154
...]
>
> From: Henrietta Scully
<HSCULLY@ANSI.org>
> To: Gaile Spadin
<gspadin@disa.org>, Mary K Blantz <MBlantz@iona.com>,
> Barbara Bennett
<bbennett@itic.org>,
> Daniel Gillman
<gillman_d@bls.gov>, Frank Farance
>
<frank@farance.com>
> Subject: FW:
[TC154-P:143] use of UTF-8 UNICODE II of IS 10646-1
> Date: Fri, 28 Dec 2001
13:48:01 -0500
> X-Mailer: Internet
Mail Service (5.5.2650.21)
>
> Dear ISO/TC 154
Distribution:
>
> The following email
message has been received at the American National
> Standards
Institute. It is transmitted for
information of the ANSI
> Accredited Technical
Advisory Group (TAG) for ISO/TC 154.
>
> Regards,
> Henrietta Scully
> Program Manager
> Standards
Facilitation/ISOT
> isot@ansi.org
<mailto:isot@ansi.org>
>
>
>
> -----Original
Message-----
> From: Francois
Vuilleumier [mailto:fvuille@attglobal.net]
> Sent: Friday, December
28, 2001 10:27 AM
> To: TC154-P List
> Subject: [TC154-P:143]
use of UTF-8 UNICODE II of IS 10646-1
>
> hello TC154,
>
> the MoU/MG on
e-business has adopted the following Resolution:
>
> "The MoU/MG
recommends that all e-business standards generated by member
> organisations should
support the UTF-8 encoding defined in ISO 10646 [1]
> as
> the sole encoding for
information exchange. Member organisations are
> invited
> to inform the MoU/MG
[2] by end February 2002 of any business or
> technical
> barriers to implementation
of this recommendation."
>
> [1] of JTC1/SC2
> [2] mailto
<mailto:moumg@itu.int> :moumg@itu. int <mailto:moumg@itu.int>
>
>
> to get more, hook to
the TC154 web (see below) and goto:
> ++ b.
liaisons/partners
> ++ 1. internal
> ++ MoU on e-business
> ++ click on "IS
10646-1 UTF-8 UNICODE II"
>
> Bonne Année ;-)
François
> ===
> ISO François Vuilleumier
> TC ISO/TC154 chair
> 154 c/o DGD, Monbijoustrasse 40, CH-3003 Berne
> ===
>
>
>
--------------------------------------------------------------
> [INFORMAL response to
TC154 TAG ...]
>
> [The following is an
*informal* response to the issue raised below.
If
> you'd like a formal
NCITS/L8 position, Dan and I can get a formal
> statement. -FF]
>
>
> The use of UTF-8, for
the purposes described above, is inadequate for
> "E-business
standards", especially when TC154 has framed the question as
> "UTF-8 vs.
UTF-16". The use of characters,
especially in an
> international context,
is a complex standardization process ... merely
> advocating ISO/IEC
10646-1 (the 2000 edition) or UTF-8 (or UTF-16) may
> still cause
significant implementation and interoperability problems.
>
> In order to properly
use international characters within international
> standards, one must
consider several aspects:
>
> - The Conceptual
Model/Framework. What characters are
being
> used (independent of
encoding)? This interoperability point
is known as
> the "character
set repertoire". For example, both
ASCII and EBCDIC
> include the Latin
alphabet in their repertoire, but their encodings are
> different. ISO/IEC 10646-1 specifies a repertoire.
>
> - The Character Encoding. There are several encodings of
> ISO/IEC 10646-1 and
UTF-8 has its own strengths and weaknesses.
For
> example, UTF-8 is
space efficient for certain characters and inefficient
> for others;
regardless, UTF-8 requires significantly more processing
> effort (with a greater
programming error rate). In short, in
the space
> vs. time trade-off, UTF-8
favors space efficiency. Meanwhile,
UCS-4
> (another ISO/IEC
10646-1 encoding) is less space efficient (requires 4
> octets for each
character), but is more efficient to process ... UCS-4
> favors time
efficiency.
>
> - The Character Processing
Model. Many systems process
> characters as
"strings", i.e., a series of characters. Two types of
> processing are common:
multibyte character strings (i.e., characters may
> vary size, and
character "state" information may be embedded) and wide
> character strings
(i.e., characters are all the same size and there is
> no embedded
"state" information). Note
that the choice of character
> processing features
(e.g., multibyte vs. wide) is independent of the
> choice of character
encodings. Examples: Storage systems
typically
> favor space
efficiency, while processing (API) systems typically favor
> time efficiency.
>
> - The Whitespace/Record Processing
Model. Even 35 years after
> the standardization of
ASCII, improper handling of whitespace can cause
> interoperability
problems. The same "whitespace
problem" exists today
> with the processing
and transfer of XML records: incomplete
> specification of
whitespace semantics/processing can cause
> interoperability
problems. (Read: Simply specifying
UTF-8 is not
> enough.) For example, more attention needs to be
focused on
> specification of line
processing (e.g., newline boundaries),
> leading/trailing
whitespace (may be converted/reduced/removed by
> processing systems),
etc..
>
> - The Markup/Escaping Mechanism. There always will be a need to
> intermix
"control" features with "data" features in character
> processing. Markup is at one end of the spectrum (e.g.,
the top-level
> feature is
"control"; and "data" is subordinate), while escaping
> features are at the
other end of the spectrum (e.g., the top-level
> feature is
"data"; and an escaping mechanism signals "control"). One of
> the main hazards is
managing the proper sequencing and precedence of
> substitutions implied
by markup and escaping mechanisms (think "macro
> expansion"
issues) ... and these issues are non-local behavioral issues.
>
> - The Localization (L10N) and
Internationalization (I18N)
> Context/Features. Regardless of the character set chosen,
improper
> consideration of L10N
and I18N concerns may reduce international
> adoption, e.g., just
specifying UTF-8 is not enough for handling the
> requirements of
Japanese usage (<-- the problem is not UTF-8 per se, but
> lack of localization
features).
>
> Now after considering
the above features, users of international
> characters should
consider how much specification is necessary ***and
> only specify that far,
i.e., don't overspecify***. For
example, in many
> data models, one may
only need to specify the **repertoire** of the
> ISO/IEC 10646-1
**without** specifying a particular encoding.
>
> As another example,
programming language APIs may have datatypes for
> native international
character processing ... mandating UTF-8 might be
> problematic,
error-prone, and inefficient (e.g., a UTF-8 class would
> probably be a poor
choice for Java APIs).
>
> Thus, choosing
(standardizing) the encoding of characters is less
> important in many
cases than choosing its conceptual/functional model.
> In some cases, it may
be appropriate to chose a specific encoding, but
> not for the purposes
of "... the sole encoding for information exchange
> ...".
>
> An good example of a
poorly defined specification is the IETF RFC 2426
> which defines the
vCard. Even if UTF-8 (or UCS-4) were
used, there
> would still be a
problem with interchange of international characters
> because the
***character conceptual model*** is faulty, e.g., it is not
> possible to guarantee
interoperability of non-ASCII characters in RFC
> 2426 (even if UTF-8 is
used).
>
> ISO/IEC JTC1 has
thorough knowledge and experience in standardizing and
> using the features
listed above. The following are some
starting points
> (a non-exhaustive
list):
>
> Conceptual Framework: JTC1/SC2,
JTC1/SC22, JTC1/SC34
> Character Encoding: JTC1/SC2
> Character Processing: JTC1/SC22,
JTC1/SC2, JTC1/SC34, JTC1/SC32
> Whitespace/Record Processing:
JTC1/SC22, JTC1/SC34
> Markup/Escaping: JTC1/SC34,
JTC1/SC22, JTC1/SC2
> Localization/Internationalization:
JTC1/SC22 WG14/WG15/WG20/WG21
>
> Thus, TC154 has
incorrectly framed the problem as "UTF-8 vs. UTF-16".
> Before agreeing to
"... UTF-8 as the sole encoding for information
> exchange ...", I
recommend contacting JTC1 for feedback on technical
> issues.
>
> -FF
>
>
_______________________________________________________________________
> Frank Farance, Farance
Inc. T: +1 212 486 4700 F: +1 212 759 1605
>
mailto:frank@farance.com http://farance.com
> Standards, products,
services for the Global Information Infrastructure
>
> ________
> This mailing list may
not be used for unlawful purposes. All postings
> should be relevant,
but ITI accepts no responsibility for any posting
> and may terminate
access to any subscriber violating any policies of the
> Association. Please
review the JTC 1 TAG Antitrust Guidelines at
>
<http://www.jtc1tag.org/policy/atrust.htm>.
>