RE: Unused code positions and mapping to Unicode

From: Murray Sargent (murrays@microsoft.com)
Date: Fri Aug 06 1999 - 17:56:48 EDT


One can argue that unassigned code points in 8-bit code pages should be
mapped to 0xFFFD. I think that while that's probably true according to the
letter of the current Unicode law, it's nevertheless a bad idea, since it
fails to roundtrip the codes. In fact, if you're working with a newer file
on an older system, you'll map more recently defined codes into oblivion.
And such assignments may be made when the need becomes sufficiently great,
e.g., the Euro being added to most 125x codepages at 0x80. So if older
software mapped 0x80 into 0xFFFD, you'd go broke because you'd lose all your
Euro's!

Accordingly in the 125x code pages, unassigned codes in 0x80 - 0x9F are
mapped to themselves, and unassigned codes in 0xA0 - 0xFF are mapped to EUDC
codes. Similarly in the nonWindows CP 857, the Turkish codes 0xD5, 0xE7,
and 0xF2 you ask about are mapped as:

0xd5 0xf8bb ;Undefined -> EUDC
0xe7 0xf8bc ;Undefined -> EUDC
0xf2 0xf8bd ;Undefined -> EUDC

Interesting idea to give access to EUDC, a choice that roundtrips the codes.

Murray

> -----Original Message-----
> From: Randy Williams [SMTP:sasrsw@wnt.sas.com]
> Sent: Friday, August 06, 1999 2:03 PM
> To: Murray Sargent; Unicode List
> Subject: RE: Unused code positions and mapping to Unicode
>
>
> Murray,
>
> So are undefined characters outside of the 0x80-0x9F range mapped to
> U+FFFD?
> For example, in Cp857-Turkish the code positions of 0xD5, 0xE7, and 0xF2.
>
> Randy
>
> -----Original Message-----
> From: Murray Sargent [mailto:murrays@microsoft.com]
> Sent: Friday, August 06, 1999 3:30 PM
> To: Unicode List
> Cc: unicode@unicode.org
> Subject: RE: Unused code positions and mapping to Unicode
>
>
> I would like to underline Ken's remark below "There is a case to be made,
> in
> particular for values in the range 0x80..0x9F in an 8-bit encoding, to
> just
> map them through to Unicode U+0080..U+009F" The point is that Unicode
> _does_ define these positions as the C1 controls. As such, they should
> not
> be mapped to 0xFFFD, although undefined character codes should be so
> mapped.
> In the absence of other definitions, the 0x80 - 0x9F codes are best mapped
> to themselves.
>
> Murray
>
> > -----Original Message-----
> > From: kenw@sybase.com [SMTP:kenw@sybase.com]
> > Sent: Thursday, August 05, 1999 5:09 PM
> > To: Unicode List
> > Cc: unicode@unicode.org; kenw@sybase.com
> > Subject: Re: Unused code positions and mapping to Unicode
> >
> > Randy asked,
> >
> > >
> > > What is the proper mapping to Unicode of unused characters in a legacy
> > encoding?
> >
> > The unfortunate answer is that there is no single right answer. It
> depends
> > on what you are doing.
> >
> > >
> > > For example, in Windows Cp1252-latin1 encoding, given the code
> position
> > of 0x81:
> > >
> > > - It appears that notepad will map this to U+0081.
> > > - Nadine's book shows it mapped to U+FFFE.
> >
> > This one, at least, can be ruled out. U+FFFE is just wrong, and should
> be
> > U+FFFD in these tables from Nadine Kano's book.
> >
> > > - Java seems to map it it U+FFFD.
> > > - The mapping tables on the FTP site have it listed as undefined and
> > > don't give a Unicode value.
> >
> > Which is probably the right way to define the table. Then an
> > implementation
> > can choose which way it is going to treat these.
> >
> > >
> > > I would think that U+FFFD is right.
> >
> > In the general case, yes.
> >
> > > But if you do that then round-trip
> > > conversion will not work if there are multiple unused characters in
> > > the legacy encoding (and for Cp1252 and others there are multiple such
>
> > > code postions).
> >
> > But roundtripping to nonexistent code positions in a character encoding
> > is not necessarily a desireable goal anyway.
> >
> > > Doing what notepad did will solve that, but that seems
> > > wrong since it is not that character in Unicode.
> >
> > There is a case to be made, in particular for values in the range
> > 0x80..0x9F
> > in an 8-bit encoding, to just map them through to Unicode
> U+0080..U+009F,
> > assuming them to be otherwise unspecified control characters. For
> > character
> > encodings that obey the C0/C1 restrictions on graphical characters, such
> > as the 8859 series, this is most likely to be the right answer. However,
> > for Windows code pages and IBM code pages, which stick graphic
> characters
> > in the range 0x80..0x9F, mapping straight through to Unicode controls is
> > as likely to be wrong -- and will certainly be wrong in the future if
> the
> > code page in question is extended by adding some specific graphic
> > character
> > at the formerly undefined position.
> >
> > > I guess you could use the
> > > private use area to map them to unique positions, but that does not
> seem
> > right
> > > either. And if I did use the private use area then other applications
> > would
> > > likely not handle on it properly when sent to them.
> >
> > You would do this kind of thing if you needed an internal
> round-tripping,
> > but it would be inadvisable to interchange data converted this way
> openly
> > --
> > it provides even less information than if you had substituted U+FFFD for
> > the unconvertible positions.
> >
> > > And then what do I do
> > > when the vendor later defines that code position, particularly when
> the
> > vendor
> > > decides not to give it a new name (as happened in some cases when the
> > Euro
> > > character was added)? I won't be able to tell if this is use of an
> > unused
> > > code postion or now use of that new character.
> >
> > When the vendor later defines a formerly undefined code position, there
> > really
> > is no feasible alternative to updating your table(s). Once someone
> starts
> > using the newly defined code point, you must map it correctly.
> >
> > >
> > > What are others doing to handle this?
> >
> > Generically, I map to U+FFFD. And when vendors update their definitions,
> I
> > update my tables.
> >
> > --Ken
> >
> > > An answer that the data should never
> > > contain those code positions, while an understandable argument, is not
> > helpful.
> > >
> > > Thanks in advance.
> > >
> > > Randy
> > >
> > >
> >
> --------------------------------------------------------------------------
> > ----
> > > Randolph S. Williams
> > > National Language Support Voice: 919.677.8000
> > > SAS Institute Inc. Fax: 919.677.4444
> > > Cary, NC 27513 USA Email:
> > Randy.Williams@sas.com
> > >



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:50 EDT