Re: Unused code positions and mapping to Unicode

From: Kenneth Whistler (kenw@sybase.com)
Date: Thu Aug 05 1999 - 20:12:18 EDT


Randy asked,

>
> What is the proper mapping to Unicode of unused characters in a legacy encoding?

The unfortunate answer is that there is no single right answer. It depends
on what you are doing.

>
> For example, in Windows Cp1252-latin1 encoding, given the code position of 0x81:
>
> - It appears that notepad will map this to U+0081.
> - Nadine's book shows it mapped to U+FFFE.

This one, at least, can be ruled out. U+FFFE is just wrong, and should be
U+FFFD in these tables from Nadine Kano's book.

> - Java seems to map it it U+FFFD.
> - The mapping tables on the FTP site have it listed as undefined and
> don't give a Unicode value.

Which is probably the right way to define the table. Then an implementation
can choose which way it is going to treat these.

>
> I would think that U+FFFD is right.

In the general case, yes.

> But if you do that then round-trip
> conversion will not work if there are multiple unused characters in
> the legacy encoding (and for Cp1252 and others there are multiple such
> code postions).

But roundtripping to nonexistent code positions in a character encoding
is not necessarily a desireable goal anyway.

> Doing what notepad did will solve that, but that seems
> wrong since it is not that character in Unicode.

There is a case to be made, in particular for values in the range 0x80..0x9F
in an 8-bit encoding, to just map them through to Unicode U+0080..U+009F,
assuming them to be otherwise unspecified control characters. For character
encodings that obey the C0/C1 restrictions on graphical characters, such
as the 8859 series, this is most likely to be the right answer. However,
for Windows code pages and IBM code pages, which stick graphic characters
in the range 0x80..0x9F, mapping straight through to Unicode controls is
as likely to be wrong -- and will certainly be wrong in the future if the
code page in question is extended by adding some specific graphic character
at the formerly undefined position.

> I guess you could use the
> private use area to map them to unique positions, but that does not seem right
> either. And if I did use the private use area then other applications would
> likely not handle on it properly when sent to them.

You would do this kind of thing if you needed an internal round-tripping,
but it would be inadvisable to interchange data converted this way openly --
it provides even less information than if you had substituted U+FFFD for
the unconvertible positions.

> And then what do I do
> when the vendor later defines that code position, particularly when the vendor
> decides not to give it a new name (as happened in some cases when the Euro
> character was added)? I won't be able to tell if this is use of an unused
> code postion or now use of that new character.

When the vendor later defines a formerly undefined code position, there really
is no feasible alternative to updating your table(s). Once someone starts
using the newly defined code point, you must map it correctly.

>
> What are others doing to handle this?

Generically, I map to U+FFFD. And when vendors update their definitions, I
update my tables.

--Ken

> An answer that the data should never
> contain those code positions, while an understandable argument, is not helpful.
>
> Thanks in advance.
>
> Randy
>
> ------------------------------------------------------------------------------
> Randolph S. Williams
> National Language Support Voice: 919.677.8000
> SAS Institute Inc. Fax: 919.677.4444
> Cary, NC 27513 USA Email: Randy.Williams@sas.com
>



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT