Re: OT: Correct definition for an "isLatin1()" function

From: Michael \(michka\) Kaplan (michka@trigeminal.com)
Date: Thu Oct 05 2000 - 15:06:24 EDT


<RANT>
The assumption here is that the function will be run on Unicode text.
Therefore, the various industrial and other code pages are irrelevant.
Microsoft does not convert the characters it has in the control code range
to those same code points in Unicode, does it? Indeed, a MultiByteToWideChar
call on these code points using cp1252 does not leave them as control codes.

No need to let this degenerate into a "Why Microsoft (and its code pages)
suck" discussion, truly. However, there are several newsgroups:

alt.destroy.microsoft
alt.conspiracy.microsoft
alt.microsoft.sucks

and others that would work well for that sort of thing. :-)

</RANT>

Ok, I feel better now. Back to the Unicode List.

michka

Michael Kaplan
Trigeminal Software, Inc.
http://www.trigeminal.com/

----- Original Message -----
From: "Frank da Cruz" <fdc@columbia.edu>
To: "Unicode List" <unicode@unicode.org>
Cc: "Unicode List" <unicode@unicode.org>
Sent: Thursday, October 05, 2000 11:11 AM
Subject: Re: Correct definition for an "isLatin1()" function

> > "Rogers, Paul" wrote:
> >
> > > We're whipping up a little function named isLatin1() that returns true
if
> > > the (UCS-2) string in question is "all Latin1".
> >
> > [snip]
> >
> > > In other words, should we exclude the C0, C1, and Latin Extended code
> > > values?
> >
> > Including or excluding C0 and C1 is a matter of taste. If you mean
> > "strictly containing characters in ISO 8859-1", then they're out.
> > If you mean "representable in typical Latin-1 text files", then at least
> > C0 is in, and C1 will do no great harm. (Provided your Unicode
> > characters don't originate from incorrect transcoding from CP 1252.)
> >
> Amen. More chaos and confusion from our friend CP1252. If a C1 byte was
> intended as a control character (such as NL, which is actually used in
> some places), then, by some definitions, the file that contains it might
> be considered Latin-1. If, on the other hand, it was intended to be a
> "smart quote" or somesuch, it can NOT be Latin-1. Unfortunately,
computers
> have not yet reached the level of sophistication needed for mind reading.
>
> Perhaps if you know the history of the data, you have some idea of what
> C1 byte values are supposed to represent. If the file was converted to
> UCS-2 from single-byte character sets, the history is important (and the
> precise conversion algorithm). If the data is UCS-2 ab initio, then
> U+0080-009F are well defined: they are C1 controls. Strictly speaking,
> since the data is UCS-2 now, they are C1 controls anyway.
>
> Of course there's also the issue of combining sequences. Unless your
> data is guaranteed to already be in Normalization form C, your isLatin1()
> function will have to include the entire normalization process, which
> involves lookahead, database lookups, sorts, and more database lookups,
> as described in the Unicode Technical Reports.
>
> - Frank
>
>
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:14 EDT