From: Kenneth Whistler (kenw@sybase.com)
Date: Fri May 16 2003 - 14:58:49 EDT
Ben Dougall asked:
> anyone? : uca and collation to ascertain various possible character
> groupings / catagorisations that are specific to various specified
> languages? to get some other matches, more than just an absolute match
> or not absolute match?
Use of the collation algorithm to do this is probably overkill.
If you are looking to make arbitrary sets of equivalences, such
as "all the consonants of English", then you should probably just
write your own ad hoc foldings.
>
> am i on the right track there? or is there a better direction maybe?
> i'm looking for a reasonably even coverage of the main languages.
Don't expect to find such information out of a character encoding
standard or out of the Unicode Collation Algorithm.
The point of the character encoding standard is to encode all
the characters of each *script*, so that it can be used to
represent text in all the languages that use that script. The
character encoding standard, per se, doesn't do any language-specific
categorizations.
The point of the Unicode Collation Algorithm is to provide a
default sorting for Unicode, along with a generic tailoring
mechanism to allow people to customize it to produce
language-specific sorting according to cultural conventions.
Again, the algorithm, per se, doesn't do any language-specific
categorizations.
>
> just checking. thanks.
>
>
> On Thursday, May 15, 2003, at 11:03 pm, Ben Dougall wrote:
>
> > would it be the uca / collation
> > <http://www.unicode.org/unicode/reports/tr10/> that will allow me to
> > do this? :
> >
> > having specified which language is being used, compare one character
> > to another and find out which various groupings they may or may not
> > share. such as comparing in english, an 'F' and 'W' would match on
> > case (and consonants even). case catagories i'm sure don't exist in
> > some other languages, but then i'm sure there are many other types of
> > catagorisations in other languages that english doesn't have.
> >
> > i'd like to have access to any kind of character catagories /
> > groupings that maybe applicable to whichever language is initially
> > specified.
You need to start looking up *linguistic* sources for that kind
of information.
Unless what you are really after is just the list of characters
needed to represent text for each language. In that case, then
there are various online sources to get you started, as reported
several times on this list. For example, see Indrik Hein's site:
http://www.eki.ee/itstandard/ladina/
> >
> > is it the uca that's what i need to look into for that type of thing?
No, I don't think so.
> >
> >
> > also i notice icu <http://oss.software.ibm.com/icu/> has a lot of
> > collation stuff. how does that compare to unicode's collation?, (if
> > collation is even what i'm after, that is). how is icu different from
> > unicode's collation?
ICU provides an implementation of the Unicode Collation Algorithm.
It conforms *to* UTS #10.
--Ken
This archive was generated by hypermail 2.1.5 : Fri May 16 2003 - 15:31:47 EDT