RE: Text Editors and Canonical Equivalence (was Coloured diacritics)

From: Arcane Jill (arcanejill@ramonsky.com)
Date: Tue Dec 09 2003 - 10:00:17 EST

  • Next message: John Jenkins: "Re: Ideographic Description Characters"

    Hmm. Now here's some C++ source code (syntax colored as Philippe
    suggests, to imply that the text editor understands C++ at least well
    :enough to color it)

        int n = wcslen(L"café");

    (That's int n = wcslen(L"café"); for those without HTML email)

    The L prefix on a string literal makes it a wide-character string, and
    wcslen() is simply a wide-character version of strlen(). (There is no
    guarantee that "wide character" means "Unicode character", but let's
    just assume that it does, for the moment).

    So, should n equal four or five? The answer would appear to depend on
    whether or not the source file was saved in NFC or NFD format.

    There is more to consider than just how and whether a text editor
    normalizes. If a text editor is capable of dealing with Unicode text,
    perhaps it should also be able to explicitly DISPLAY the actual
    composition form of every glyph. The question I posed in the previous
    paragraph should ideally be obvious by sight - if you see four
    characters, there are four characters; if you see five characters, there
    are five characters. This implies that such a text editor should display
    NFD text as separate glyphs for each character.

    On the other hand, such a text editor must also acknowledge that "é" and
    "e + U+0301" are actually equivalent. The /intention/ of canonical
    equivalence is that the glyphs should display the same - otherwise we'd
    need precomposed versions of, well, everything. So in other contexts, is
    should display them the same.

    Yuk. That's a lot to think about for anyone considering writing a
    programmers' text editor with /serious/ Unicode support.
    Jill

     -----Original Message-----
    From: Philippe Verdy [mailto:verdy_p@wanadoo.fr]
    Sent: Tuesday, December 09, 2003 2:04 PM
    To: jcowan@reutershealth.com
    Cc: Unicode@Unicode.Org
    Subject: RE: Coloured diacritics (Was: Transcoding Tamil in the
    presence of markup)

    I would not like to use any Unicode plain-text editor that implicitly
    normalizes the text without asking me, to work on programming source
    files or XML or HTML files. But I will accept it, if the editor really
    understands the language or XML syntax (and exhibits it to the user with
    syntax coloring).



    This archive was generated by hypermail 2.1.5 : Tue Dec 09 2003 - 10:50:00 EST