W3C Character Model (was Re: Unicode Search Engines)

From: David Hopwood (david.hopwood@zetnet.co.uk)
Date: Thu Feb 21 2002 - 19:21:22 EST

Previous message: Chris Pratley: "RE: CRLF vs. LF (was Re: Unicode and end users)"
In reply to: Mark Davis: "Re: Unicode Search Engines"
Next in thread: Marco Cimarosti: "RE: Unicode Search Engines"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

-----BEGIN PGP SIGNED MESSAGE-----

Mark Davis wrote:
>
> > > Documents not in UTF-* are normalized by definition, unless it is
> > > *impossible* to convert them to normalized Unicode (typically
> > > because they contain characters not yet present in Unicode).
>
[...]
> Simply saying that a document is "normalized by definition" if it is
> *possible* to convert it to Unicode would ignore reality, since
> converters may not *actually* convert it to normalized Unicode. One
> would have to have the additional requirement in the Character Model,
> that any XML parser that converts an XML document from a legacy
> character set into Unicode is not conformant unless it is (actually)
> normalizing.

That requirement is already in the Character Model:

<http://www.w3.org/TR/2002/WD-charmod-20020220/>

# 4.2.2 Include-normalized Text
[...]
# Text data is include-normalized if:
#
# 1. the data is Unicode-normalized and does not contain any character
# escapes or includes whose expansion would cause the data to become
# no longer Unicode-normalized; or
#
# 2. the data is in a legacy encoding and, if it were transcoded to a
# Unicode encoding form by a normalizing transcoder, the resulting
# data would satisfy clause 1 above.
#
# NOTE: A consequence of this definition is that legacy text (i.e. text
# in a legacy encoding) is always include-normalized unless i) a
# normalizing transcoder cannot exist for that encoding (e.g. because
# the repertoire contains characters not in Unicode) or ii) the text
# contains escapes or includes which, once expanded, result in
# un-normalized text.
[...]
# 4.2.3 Fully Normalized Text
[...]
# Text data is fully normalized if it is include-normalized and none of
# the spans composing the text begin with a non-starter character.
#
# In the remainder of this specification, normalized is used to mean
# 'fully normalized', unless otherwise indicated.
[...]
# 4.3 Responsibility for Normalization
[...]
# [C] All text content on the Web MUST be in include-normalized form and
# SHOULD be in fully normalized form.
#
# [S] Specifications of text-based formats and protocols MUST, as part of
# their syntax definition, require that the text be in normalized form.
[...]
# [I] Implementations which transcode text data from a legacy encoding
# to a Unicode encoding form MUST use a normalizing transcoder.

I don't think that implicitly redefining 'normalized' as 'fully normalized'
in most of the document is a good idea - it should be spelt out explicitly.
Also, 'fully normalized' doesn't appear to be defined correctly for legacy
charsets; it should be defined like this:

  1. the data is Unicode-normalized, does not contain any character
     escapes or includes whose expansion would cause the data to become
     no longer Unicode-normalized, and none of the spans composing the
     text begin with a non-starter character; or

  2. the data is in a legacy encoding and, if it were transcoded to a
     Unicode encoding form by a normalizing transcoder, the resulting
     data would satisfy clause 1 above.

I'll have to submit some comments about this.

- --
David Hopwood <david.hopwood@zetnet.co.uk>

Home page & PGP public key: http://www.users.zetnet.co.uk/hopwood/
RSA 2048-bit; fingerprint 71 8E A6 23 0E D3 4C E5 0F 69 8C D4 FA 66 15 01
Nothing in this message is intended to be legally binding. If I revoke a
public key but refuse to specify why, it is because the private key has been
seized under the Regulation of Investigatory Powers Act; see www.fipr.org/rip

-----BEGIN PGP SIGNATURE-----
Version: 2.6.3i
Charset: noconv

iQEVAwUBPHMdxzkCAxeYt5gVAQFl6Af+KJDkFbihALZ5KI9AXTVxxJvI5kwZjaT3
M3iiWQoo1eLoRSbjkLJdC0odr3XIxS4FRlrqL842ZwyRM6iRizUyoRqa0LWLzcjv
SOCVywFxuHRR723IPgePjrgNIKSbLRTjVt3m20mHTjncN9MdOV28EiBi1IVcr92h
TKzp/UkEkS7lyzUYV+dIV6X8WflE2ej/Wwpkshyu8pFOtP5mTPqYg2aZw5JX4oSK
Rx0CMmtRek3mxNZ/vVHOM3VZVGhxS5LjH8okwtInFcQ6MJBPXKbt7Zw/sKVnbbMc
2BNxI+cmIikti6sUgy34MJscygLRXYSxNb/t0Q7NuAbMRNwsG5QkWw==
=56c2
-----END PGP SIGNATURE-----

Previous message: Chris Pratley: "RE: CRLF vs. LF (was Re: Unicode and end users)"
In reply to: Mark Davis: "Re: Unicode Search Engines"
Next in thread: Marco Cimarosti: "RE: Unicode Search Engines"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.2 : Thu Feb 21 2002 - 18:51:49 EST