Frank da Cruz wrote:
> [I]f we look at a Unicode
> text file and see CR and/or LF in it, we don't know if those
> characters came from the private text format of a 7- or 8-bit file
> that was converted to Unicode without any record-format conversion,
> or if they are the "Unicode" CR and LF.
The semantics of CR and LF in Unicode 2.x *are* the ambiguous
ones inherited from the 7-bit controls; there are no other semantics.
But this has been changed in Unicode 3.0: see UTR #13
(http://www.unicode.org/unicode/reports/tr13/), which will be a
normative part of Unicode 3.0. Note well that UTR #13 does not
solely prescribe the semantics of CR and LF during conversion to and
from Unicode, but also the semantics of CR and LF *in* Unicode.
XML, a major Unicode application, takes almost the same point of view.
(IMHO, XML should be modified to accept LS as a line-end character.)
> Therefore this would only
> move the problem of incompatible record formats from the old world
> (of DOS, Windows, UNIX, Macintosh) to the new one.
Indeed. But the only real problem there is that some people and
applications (notably nroff output) use bare CR in plain text
to produce physical or notional overprinting. Otherwise, it
is perfectly fine to take the UTR #13 viewpoint.
> It's better to have Unicode characters LS and PS (and I think also
> Tab/Column-Separator and Page Separator) than to recycle the C0
> controls. This ensures round-trip integrity without having to know
> the history of the data ("it came originally from DOS so to convert
> it from Unicode to UNIX we need to...")
As for HT and FF, nobody uses them incompatibly, and
introducing new characters for them is supererogation at best.
-- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT