Re: (TC304WG4.50) Charset vs. codeset

From: Tex Texin (texin@bedford.progress.COM)
Date: Mon Oct 06 1997 - 03:00:33 EDT


OK, I got paid this week so I can afford to throw in my $.02:

With respect to uniqueness of Unicode, my first thought was of the
compatibility characters, since these are redundant with the characters
that have more specific semantics. For example, the hyphen exists for
compatibility with ASCII, but then also has other (more specific)
existences.

However, my second thought is that uniqueness, nice as it is when you are
doing mathematics, I think is not so significant for us, since uniqueness
of the characters depends on your application. When I am doing a
case-insensitive search, even ascii is non-unique.
When I work with asian characters I fold half-width and full-width
together. In other applications I treat both of these subsets as unique.
I suspect therefore that categorizing a character repertoire as
consisting of unique sets will depend on the eye of the beholder.
(Or the "i" as Alain beholds in his examples!)

Since I can't imagine someone creating a character set with more than one
of a character, without some differentiating characteristic between them,
I don't see that it helps us to debate which sets are unique or to
include uniqueness in the definition.

I thought Ken's definitions were adequate.

tex

On Oct 5, 8:35am, Alain LaBont\i - SCT wrote:
> Subject: Re: (TC304WG4.50) Charset vs. codeset
> A 06:42 97-10-05 -0700, Martin J. Dürst a écrit :
> >On Sat, 4 Oct 1997, Keld J|rn Simonsen wrote:
> >
> >> I had a few comments to Kenneth Whistler's recent writing:
>
> [Martin] :
> >> > Unicode is an encoded character set.
>
> [Keld] :
> >> I am not so sure about that. It violates the general principles of
> >> that an encoded character set only encodes one (abstract) character
> >> in one way.
> >>
> >> > ISO/IEC 10646 is an encoded character set.
> >>
> >> True.
>
> [Martin] :
> >You are probably refering to cases like A + combining ring above
> >vs. A with ring above (sorry I don't remember the official names).
> >
> >In that sense, both Unicode and ISO/IEC 10646 are very much the
> >same. Both include the possibilities to use combining marks.
> >Unicode is a little bit more explicit about them. But it doesn't
> >allow more things that ISO/IEC 10646. ISO/IEC doesn't explicitly
> >define equivalences, and therefore in theory, it's possible to
> >say that these are different abstract characters (or combinations
> >of them). But Unicode can say the same, namely that they are
> >different abstract characters/combinations. That the difference
> >shouldn't be visible to the user is patently obvious in both cases.
>
> [Alain] :
> My 2 cents:
>
> On one hand some combinations where you would not see a difference even
> with bad implementations are not recognized as equivalent in UNICODE
(SMALL
> DOTLESS I WITH CIRCUMFLEX and SMALL DOTLESS I WITH DIAERESIS are cases
in
> point which typically affect French; with the I other languages are
> affected as well).
>
> On the other hand, if the implementation is done on the fly by
overprinting
> or overdisplaying, the difference will be visible with the COMBINING
> DIACRITICS used with a SMALL DOTTED I (a traditional i!) while
according to
> UNICODE there is no difference of interpretation between the two
encodings.
>
> This is of course only anecdotical. However that should imho be
corrected
> in UNICODE. But nobody cares except me, it seems.
>
> I would like the two following rules to be true (wish list) :
>
> 1. Within a given script, combinations which make no difference with a
> precomposed character should be considered equivalent in UNICODE.
>
> 2. It should be disallowed to show differences for UNICODE
equivalences,
> when only one font is used.
>
> Personally, I also have problem buying applications that do double
> encoding, as this (as we all know with QP and SGML entities) multiplies
the
> possibilities of bugs, but also of inconsistencies (in particular in
search
> engines). I like that all passes through the same coding/decoding
process,
> at the lowest possible level (complete application environment or even
> operating system level).
>
> Alain LaBonté
> Québec
>
>-- End of excerpt from Alain LaBont\i - SCT

-- 
-------------------------------------------------------
Tex Texin                    
Manager International Development and Product Management
                                 
Progress Software Corp.        Voice:   +1-781-280-4271
14 Oak Park                      Fax:   +1-781-280-4949
Bedford, MA 01730  USA          
http://www.progress.com      texin@bedford.progress.com
-------------------------------------------------------



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:37 EDT