RE: Eudora (was: Is there Unicode mail out there?)

From: Jungshik Shin (jshin@mailaps.org)
Date: Fri Jul 13 2001 - 22:19:45 EDT


On Fri, 13 Jul 2001, Carl W. Brown wrote:

  Carl,

> > What makes me annoyed is that programs like Eudora lie about
> > MIME charset (i.e. it declares it's sending out ISO 8859-1 while it
> > actually sends out Windows-1252).
>
> I have no problem sending it our with a " Windows-1252" character set. If
> you convert to iso-8859-1 you lose characters that is just as bad as sending
> Windows-1252 out as iso-8859-1.

  Well, most characters (some smart quotes, bullets, etc) not
representable in ISO 8859-1 but present in Window 1252 can be
transliterated if necessary if that's the only option. If not, you can
always convert Windows-1252 to UTF-8 before getting it out. MacOS is a
better citizen of the Internet than MS-Windows in that respect (although
it may have been forced to be that way partly because of its small market
share). Most, if not all, internet tools running under MacOS do not leak
out any of Mac-specific encodings to the wire. Of course, I can't turn
the tide and Windows-1252 is ubiquitious. It's just my wishful thinking
that it would have been better if we had skipped Windows-1252 and company
going from ISO 8859-x to UTF-8.

> The problem is that many browsers do not
> yet support iso-8859-1 and the systems do not have iso-8859-15 fonts.
              ^^^^^^^^^^
   You meant ISO 8859-15, didn't you? How about Windows-1252 fonts?
Until recently, most X11 based systems (virtually all Unix and Unix-like
OS) didn't have any Windows-1252 fonts. All they had were ISO 8859-1
(and a few ISO 8859-15) fonts for Western European scripts/languages.

> > Hmm, are you sure? Netscape 4.7 can handle Korean pages in UTF-8
> > as long as you limit Hangul syllables to the repertoire of KS X 1001
> > (2350 of them).
>
> We could only get NS 4.7 to work for the languages that it had been
> localized.

  You must have done something wrong. Both MS-Windows version
(under English version of MS Windows ME) and Unix/X11 version are
working perfectly well with UTF-8 encoded Korean page right now in front
of me (try, e.g., <http://jshin.net/~jungshik/i18n/koencodings.html>)
Believe me (I hvae tested this not only once but many many times ever
since Netscape 4.x began to support UTF-8 !!) . NS 4.7 has no problem
rendering Korean pages encoded in UTF-8 (in case of MS-Windows version,
it can render the full repertoire of 11,172 syllables if a font with
the full repertoire is selected). Localization has *nothing whatsoever*
to do with rendering ability of Netscape 4.7x in the web page display
window as long as you have necessary fonts. (in MS-Windows version, set
fonts for Unicode to 'Arial MS Unicode'/'Bitstream Cyberbit'/other huge
fonts or one of Korean fonts that come with Global IME/Korean Language
Pack for Korean rendering).

> > I agree with you on most of points, but it's not so insane to support
> > Unicode and many other encodings with non-Unicode-based *fonts* (with
> > Unicode at the hub/center of implementation) as shown by (Solaris and)
> > Mozilla in a sense. Especially, it's reasonable thing to do when there
> > are lots of non-Unicode-based *fonts* distributed and used everyday by
> > target users. This is not to say Mozilla does not use Unicode-based fonts.
> > It makes use of both of them.
>
> Breaking text buffers into script segments a character at a time is a lot of
> overhead and difficult to determine at times.

  I'm afraid there's some misunderstanding here. I didn't say
segmentizing text buffers into many smallers one is the only and the most
efficient way. I think you're right that it's costly and inefficient.
What I wrote is just that's one of ways (but that needs not be 'a
character at a time') and that sometimes browsers have to deal with
both kinds of fonts (because they have to in order to work !), fonts in
legacy encodings with a small subset of Unicode coverage and fonts in
Unicode encoding with a lot of glyphs. Moreover, more often than not a
single font (even in Unicode encoding : Being in Unicode encoding does
not mean at all that all characters are covered by that font ) does not
cover all the characters present in the page and fontset can be pretty
handy (every font in fontset is searched from top to bottom for a glyph
to render characters). Segmentization is also necessary if 'language'
tagging is used. X11 even has a API call for that given a set of fonts
in many different encodings and a text buffer. People are now talking
about extending that API call to work with multiple iso10646-1 (XLFD
encoding name for Unicode/ISO 10646 BMP fonts) fonts with different
repertoires/coverages (at the moment, each font in fontset should have
distinct encoding)

   Jungshik Shin



This archive was generated by hypermail 2.1.2 : Fri Jul 13 2001 - 23:10:30 EDT