Re: Embedded language ID pr

From: Mark Davis (mark_davis@taligent.com)
Date: Sat Sep 09 1995 - 20:17:44 EDT


Subject: RE>>Embedded language ID pr Time: 4:13 PM Date: 9/9/95

I guess a clearer question would be: what do you want to use language ids for,
and why is it that you don't use rich text in that context?

--------------------------------------
Date: 9/9/95 12:37 PM
To: Mark Davis
From: Mark Leisher
    Mark> Subject: RE>>Embedded language ID proposal Time: 5:42 PM
    Mark> Date: 9/8/95

    Mark> I am still unconvinced of the need to have language
    Mark> information in plain text; there are legitimate needs for
    Mark> that information, but there are needs for other particular
    Mark> attributes that go along with rich text, and it is hard to
    Mark> see why this one should be singled out.

For the most part I personally agree that language identifiers would
seem most logically markup.

But from a multilingual natural language processing perspective (and
perhaps others), having a single codeset with embedded language
identifier capability would provide an attractive reference text
representation.

Had the proposal not provided any utility for areas other than ours, I
doubt we would have bothered to present it other than as an
idiosyncrasy of our particular Unicode support implementation.

    Mark> In terms of commenting on these particular suggested private
    Mark> use implementations, the string scheme (LANG_ID_START text
    Mark> LANG_ID_END) has the very considerable drawback of
    Mark> introducing fr_FRgarbageen_US into data streams that don't
    Mark> recognize LANG_ID_START, LANG_ID_END. Using independent
    Mark> private use characters exclusively at least allows other
    Mark> implementations to filter them out without knowing
    Mark> bracketing semantics.

Telling point. I hadn't thought of that.

    Mark> As far as terminology goes, these are not combining
    Mark> characters: they are not positioned relative to a preceding
    Mark> base character; they are not positioned at all! They are
    Mark> more akin to the formatting characters such as RLM or ZWJ.

Our initial conclusion as well.
-----------------------------------------------------------------------------
mleisher@crl.nmsu.edu
Mark Leisher "The trick is not gaining the knowledge,
Computing Research Lab but surviving the lessons."
New Mexico State University -- "Svaha," Charles de Lint
Box 30001, Dept. 3CRL
Las Cruces, NM 88003

------------------ RFC822 Header Follows ------------------
Received: by taligent.com with SMTP;9 Sep 1995 12:34:22 -0800
Received: from taligent.com by mailserv.taligent.com (AIX 3.2/UCB 5.64/4.03)
          id AA36205; Sat, 9 Sep 1995 12:35:02 -0700
Received: from UNICODE.ORG by taligent.com with SMTP (5.67/23-Oct-1991-eef)
        id AA26899; Sat, 9 Sep 95 12:31:42 -0700
        for
Received: by Unicode.ORG (NX5.67c/NX3.0M)
        id AA25009; Sat, 9 Sep 95 12:23:10 -0700
Date: Sat, 9 Sep 95 12:23:10 -0700
From: unicode@Unicode.ORG
Message-Id: <9509091923.AA25009@Unicode.ORG>
Reply-To: mleisher@crl.nmsu.edu (Mark Leisher)
Errors-To: uni-bounce@Unicode.ORG
Subject: Re: Embedded language ID pr
To: unicode@Unicode.ORG



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:32 EDT