Re: Markup for Language (was: Re: Exemplifying apostrophes)

From: Douglas Davidson (ddavidso@apple.com)
Date: Wed May 28 2008 - 12:06:11 CDT

  • Next message: Lorna_Priest@sil.org: "Re: Glottal stop languages"

    On May 28, 2008, at 4:48 AM, Behnam wrote:

    > Right now, on my text editor, I right click and select the
    > directionality of the paragraph. That's what I did on that picture
    > at the end (which shouldn't be confused with right alignment). This
    > doesn't go to the higher level and higher level shouldn't change
    > that (which unfortunately is not always the case).

    As it happens, the paragraph directionality in the case you mentioned
    is handled by a higher-level protocol. Your picture shows a rich-text
    document, for which the paragraph directionality is a feature of the
    paragraph style; its embodiment in a document varies with the format,
    but in the case of RTF it would use the \rtlpar control word to
    indicate RTL paragraphs, while for HTML it would use a dir="rtl"
    attribute.

    The alternative mechanism for representing this in plain text would be
    to insert a bidirectional control character, either RLM or LRM, at the
    beginning of each directionally marked paragraph. These characters
    are not specifically marks of paragraph base writing directionality,
    but their presence at the beginning of a paragraph would be sufficient
    to indicate it. However, this is not the mechanism currently used in
    the case you mention.

    There are a number of reasons why the insertion of invisible control
    characters is an awkward solution for editing. Great care would need
    to be taken, for example, to make sure that control characters would
    not be accidentally deleted, or copied and pasted to inappropriate
    places. On the other hand, they would need to be carefully preserved
    in certain cases of copying, for example to make sure that copying an
    entire paragraph would preserve its directionality. These
    considerations would be especially important for control characters
    that appear in beginning and ending pairs. A "show invisibles" mode
    would probably be needed, just to assure sophisticated users that the
    control characters were properly positioned, but it would be likely to
    confuse the less sophisticated.

    Higher-level protocols, by contrast, are well suited to the needs of
    editing. They can naturally associate attributes with ranges of text,
    just as they do for style attributes such as fonts, underlines, and so
    forth. The problems of insertion, deletion, copying and pasting, and
    so forth are much more tractable. In general, higher-level protocols
    are more naturally expressive of the user's intent; in computer
    science terms, they separate controls from data, with the underlying
    Unicode character stream representing the data and the higher-level
    protocols representing the control information.

    If one has control of the import and export processes, then it would
    be possible to take text in which information is internally
    represented using higher-level protocols, and export it to plain text
    with appropriate control characters inserted, or to import from plain
    text and replace the control characters with the internal
    representation. The use of control characters in plain text is a
    necessary fallback mechanism if plain text is all that is available,
    and if the text is not going to be edited or otherwise altered--
    provided that the processes receiving it are sufficiently Unicode-
    savvy to handle the control characters properly. However, more and
    more it is the case that at least some form of markup is available,
    and where it is, it is generally better to make use of it.

    Douglas Davidson



    This archive was generated by hypermail 2.1.5 : Wed May 28 2008 - 12:07:54 CDT