Re: Unicode 4.0 BETA available for review

From: Yung-Fong Tang (ftang@netscape.com)
Date: Thu Feb 27 2003 - 15:12:55 EST

  • Next message: Kenneth Whistler: "Re: Unicode 4.0 BETA available for review"

    Kent Karlsson wrote:

    >>The Unicode 4.0 text further strengthens Conformance Clause
    >>C12, to make this crystal clear:
    >>
    >> "C12 When a process generates a code unit sequence which
    >> purports to be in a Unicode character encoding form, it shall
    >> not emit ill-formed code unit sequences.
    >>
    >> "C12a When a process interprets a code unit sequence which
    >> purports to be in a Unicode character encoding form, it
    >> shall treat ill-formed code unit sequences as an error
    >> condition, and shall not interpret such sequences as
    >> characters."
    >>
    >>And just in case anyone still has any trouble reading the
    >>painfully detailed specification of the UTF-8
    >>encoding form, an explicit note is included there:
    >>
    >> "* Because surrogate code points are not Unicode scalar
    >> values, any UTF-8 byte sequence that would otherwise
    >> map to code points D800..DFFF is ill-formed."
    >>
    >>So I don't think there is any hole here. If anyone still
    >>thinks that they can use these 3-octet/3-octet encodings
    >>of supplementary characters and call it UTF-8, then they
    >>are either engaging in wishful thinking or are not reading
    >>the standard carefully enough.
    >>
    The problem I need to deal with is not GENERATE those UTF-8, but how to
    handle these DATA when my code receive it. For example, when I receive a
    10K UTF-8 file which have 1000 lines of text, if there are one UTF-8
    sequence in the line 990 are ill-formed, should I fire the "error" for
    1. the whole file (10K, 1000 lines),
    2. all the line after line 899,
    3. the line 990 itslef,
    4. the text between the leading byte of that ill-formed UTF-8 till the
    end of the file,
    5. the text between the leading byte of that ill-formed UTF-8 sequenec
    till the end of the line 990,
    6. the text between the leading byte of that ill-formed UTF-8 till the
    next leading byte in line 990

    I there are others way you can scope the ERROR, I probably can continue
    it on and on and tell you 10-20 other way to scope it if I spend 20 more
    minutes.

    I do believe the error handling should be application specific.



    This archive was generated by hypermail 2.1.5 : Thu Feb 27 2003 - 15:59:43 EST