RE: Markup for Language (was: Re: Exemplifying apostrophes)

From: Phillips, Addison (addison@amazon.com)
Date: Thu May 29 2008 - 10:32:50 CDT

  • Next message: Manik Mahajan: "Ligatures For Indic languages"

    > Now that we are on to things that don't work, I should mention that
    > unlike rtl, language identification to the paragraph doesn't work
    > either. It should only be applied to a string of characters. A
    > paragraph my contain several languages.
    >

    Language identification can be applied at many levels to a document. It can certainly be applied to a string of characters. It can also be usefully applied to sentences, paragraphs, chapters, sections, entire documents, and even collections of documents. (And a document need not be written--sound recordings, for example, often use language).

    There are at least two types of language identification (see [1]). For the kind you mean here, language identification can work at any appropriate level of granularity. This email, for example, is entirely in English. This is no point to marking up every single sentence, line, word, or character with a language tag when the Content-Language header for the whole thing does the job nicely. Certainly a span of text can be in another language and should be appropriately tagged. But over-tagging increases complexity and burns bandwidth/storage to no good effect. Or, as we say in language tagging land, "Tag Content Wisely".

    Best Regards,

    Addison

    [1] http://www.w3.org/TR/i18n-html-tech-lang/

    Addison Phillips
    Chair, W3C Internationalization Core WG
    Editor, BCP 47 (Language Tags)

    Internationalization is not a feature.
    It is an architecture.

    > -----Original Message-----
    > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    > Behalf Of Behnam
    > Sent: Thursday, May 29, 2008 5:14 AM
    > To: Richard Wordingham
    > Cc: Unicode Mailing List
    > Subject: Re: Markup for Language (was: Re: Exemplifying apostrophes)
    >
    >
    > On 28-May-08, at 11:20 PM, Richard Wordingham wrote:
    >
    > > Douglas Davidson wrote on Wednesday, May 28, 2008 at 6:06 PM
    > >
    > >> The alternative mechanism for representing this in plain text
    > >> would be to insert a bidirectional control character, either RLM
    > >> or LRM, at the beginning of each directionally marked paragraph.
    > >> These characters are not specifically marks of paragraph base
    > >> writing directionality, but their presence at the beginning of a
    > >> paragraph would be sufficient to indicate it. However, this is
    > >> not the mechanism currently used in the case you mention.
    > >
    > > They don't quite work. The problem comes with a string of neutrals
    > > between a strong LTR and a strong RTL character. Their ordering
    > > may depend on the directionality of the paragraph, which may depend
    > > on a 'higher level' protocol (e.g. 'always left-to-right').
    > > Initial RLM and LRM work if one is free of such a higher level
    > > protocol; otherwise one has to stick these marks in whenever
    > > neutrals are not bracketed by characters of the same directionality.
    > >
    > > Richard.
    >

    >



    This archive was generated by hypermail 2.1.5 : Thu May 29 2008 - 10:36:10 CDT