Re: UTS#40 (BOCU-1) ambiguity and possible serious bug about leading BOM

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sun Feb 04 2007 - 06:23:54 CST

  • Next message: Jukka K. Korpela: "Re: New translation posted"

    From: "Doug Ewell" <dewell@adelphia.net>
    > I covered all of this three years ago in Unicode Technical Note #14.
    > Look for the paragraph in the BOCU-1 section that begins "Because each
    > character..."
    >
    > It's possible to encode a signature safely in BOCU-1 by following it
    > with an FF reset byte, as you and Frank observed, but the spec
    > discourages FF resets.

    I wanted to signal this because I noted that what was a technical note is now displayed as a draft for becoming a UTS (with differences emphasized, notably for the conformance requirement). I had read this doc a long time ago, and there was no such "draft" status so it was not really a problem. Also the licencing issue is still not resolved. So the final drat should be listed in the Public Review page to fix the exact wording.

    The main risk is caused by the ambiguity of the sentence which does not indicate that it really encodes the codepoint U+FEFF normally (i.e. it changes the current state), and that does not specify if the leading BOM is required or optional.

    If encoding the reset byte FF is not recommended, then the leading BOM should not be recommended either, because this is a concatenation of an unrelated substring to the text. that's where i think that, in that case, the BOM, if used, should better be followed by a RESET byte, even if the rest of the document does not use any RESET byte.



    This archive was generated by hypermail 2.1.5 : Sun Feb 04 2007 - 06:26:40 CST