From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Tue Dec 09 2003 - 08:17:05 EST
> -----Message d'origine-----
> De : Peter Kirk [mailto:peterkirk@qaya.org]
> Envoye : mardi 9 decembre 2003 13:17
> A : verdy_p@wanadoo.fr
> Cc : Unicode@Unicode.Org
> Objet : Re: Coloured diacritics (Was: Transcoding Tamil in the presence
> of markup)
>
>
> On 09/12/2003 03:41, Philippe Verdy wrote:
>
> >Peter Kirk writes:
> >
> >
> >>Philippe, you have now stated this (several times). But just a day
> >>earlier you yourself stated that the rule forbidding combining marks at
> >>the start of a string would never be relaxed because it is fundamental
> >>to the XML containment model. You don't usually contradict yourself
> >>quite so obviously.
> >>
> >>
> >
> >I don't know how you interpreted what I may have said a few days before.
> >I have certainly not said that XML forbids combining marks at the start
> >of XML, just that W3C does not _recommand_ it as well as any other
> >defective combining sequences, as they are known to cause problems
> >(for example when it's difficult to track the effective text file type)
> >
> >
> So, let's get this clear. Within an XML or HTML document, if I want an e
> with a red acute accent on it, it is quite permissible to write:
>
> e<span class="red-text">{U+0301}</span>
>
> where {U+0301} is replaced by the actual Unicode character, and
> "red-text" is defined in the stylesheet. So it is not a problem that
> there is a defective combining sequence, nor that the accent is not
> combined with the e as it would be in NFC. Is that correct?
That's right: the text element within <span> just contains the string with
the isolated diacritic, it is already in NFC form despite it is defective.
And it must not be parsed by creating a combining sequence that includes
the ">" terminating the <span> tag (interpretation of combining sequences
is only valid within plain-text, and thus excludes syntactic characters
used in XML.
Note that this is not specific to XML. Any "text/*" format that is not
plain text (notably programming source files, shell scripts, HTML files,
stylesheets, and JavaScript files) should be handled this way, where
the syntax of the language governs the rules for parsing it, before
even trying to use Unicode definitions on parsed tokens used in that
programming language.
So normalization should never be performed on whole files that are not
explicitly of file type "text/plain" (either with an explicit meta-data
such as MIME headers during transmissions, or locally with OS-specific
conventions on file extension such as ".txt")
When in doubt, for example in CVS repositories or in diff/merge tools,
normalization must not be performed, and the current encoding form of
text files must be preserved, each time that tools does not implement
an accurate parser for the syntaxic and lexical rules of the effective
file type or language, which may or may not accept defective combining
sequences as valid plain-text strings (this includes identifiers,
however Unicode recommands a list of characters that can be used to
start an identifier, and this list excludes all non-starter combining
characters.)
__________________________________________________________________
<< ella for Spam Control >> has removed Spam messages and set aside
Newsletters for me
You can use it too - and it's FREE! http://www.ellaforspam.com
This archive was generated by hypermail 2.1.5 : Tue Dec 09 2003 - 09:27:05 EST