G. Adam Stanislav wrote:
> Internally a program will presumably decode UTF-8 into whatever format it
> uses. As for being stored on disk, what if the disk is on a LAN consisting
> of PC's and Macs? Should it be stored in little-endian or big endian order?
Either way, but with an appropriate BOM, and good software will be
able to cope.
> Besides, UTF-16 can only contain the first plane.
No, that's UCS-2 (which is moribund). UTF-16 handles planes 0-0x10,
which is rather more than all the planes there will ever be.
Current plans are 1 for obscure and archaic scripts, 2 for
obscure and archaic Han characters, 0xE for special magic,
and 0xF and 0x10 for private use.
> Even though, strictly
> speaking, Unicode is 16-bit, the ISO standard (is it 10646?) is 32-bit.
31-bit. But the codes above 0010FFFF will never be assigned.
> > o There is less text expansion for non-Latin languages.
>
> Yes, but with a well written expansion library (that I have been proposing)
> it happens fast and is completely transparent to the compiler writer.
I think the issue is speed, not space. UTF-16 can replace double-byte
character sets fairly easily, but UTF-8 makes for 50% expansion.
> Again, that can be completely transparent. More importantly, TCHAR is of
> different sizes in different OS's. For example, under Windows 95+/NT, TCHAR
> is 16 bits wide. Under FreeBSD (and probably other Unices) it is 32 bits
> wide.
I think you are confusing wchar_t (a C standard) with TCHAR (a Microsoft
idea). TCHAR is 16 bits in Unicode mode and 8 bits in "ANSI" (8-bit
code page) mode.
> But editors on both system can handle this minor quirk.
Some editors. Try Notepad (the standard Windows plaintext editor),
which can cope with UTF-16 fine but is baffled by bare-LF.
-- John Cowan http://www.ccil.org/~cowan cowan@ccil.org Schlingt dreifach einen Kreis um dies! / Schliesst euer Aug vor heiliger Schau, Denn er genoss vom Honig-Tau / Und trank die Milch vom Paradies. -- Coleridge / Politzer
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:48 EDT