Re: Correct definition for an "isLatin1()" function

From: Antoine Leca (Antoine.Leca@renault.fr)
Date: Fri Oct 06 2000 - 11:59:10 EDT


Frank da Cruz wrote:
>
> > "Rogers, Paul" wrote:
> >
> > > We're whipping up a little function named isLatin1() that returns true if
> > > the (UCS-2) string in question is "all Latin1".
> >
> > [snip]
> >
> > > In other words, should we exclude the C0, C1, and Latin Extended code
> > > values?
> >
> > Including or excluding C0 and C1 is a matter of taste. If you mean
> > "strictly containing characters in ISO 8859-1", then they're out.
> > If you mean "representable in typical Latin-1 text files", then at least
> > C0 is in, and C1 will do no great harm. (Provided your Unicode
> > characters don't originate from incorrect transcoding from CP 1252.)
>
> Amen. More chaos and confusion from our friend CP1252. If a C1 byte was
> intended as a control character (such as NL, which is actually used in
> some places), then, by some definitions, the file that contains it might
> be considered Latin-1.

If you go this way, then an otherwise valid iso-8859-1 file that contains
esc sequences that switch to any other character set from the ISO 2375
registry _other_ than ISO 646 IRV (nr.6, a.k.a ASCII, esc ( B) and the
Latin1 supplement (nr.100, esc - A), should be also excluded...

Obviously, this is not correct. The input to Roger's function should obviously
be in correct Unicode format (and should not result from incorrect translation,
John's point). So if there are U+001B characters embeeded in the stream,
then it means that they are *not* to be considered as escape sequences as
per ISO 2022. Similarly, U+0085 embeeded in the stream *should* be considered
the Next Line character, and nothing else.

> If, on the other hand, it was intended to be a
> "smart quote" or somesuch, it can NOT be Latin-1. Unfortunately, computers
> have not yet reached the level of sophistication needed for mind reading.

Computers have nothing to do with programmers that bypass specifications, IMHO.
If the input is specified as UCS-2, then supplying incompletely converted
Cp-1252 characters should recolt what it deserves: gunshoot in the foot.

 
Antoine



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT