Re: Implementing on UTF8: toUpper(), toFold(), normalisation, collation, etc

From: Addison Phillips [wM] (aphillips@webmethods.com)
Date: Sat May 03 2003 - 12:45:25 EDT

Next message: Carl W. Brown: "RE: Implementing on UTF8: toUpper(), toFold(), normalisation, collation, etc"

Previous message: Doug Ewell: "Re: Transcribing old documents into Unicode compatible document files."
In reply to: Theodore H. Smith: "Implementing on UTF8: toUpper(), toFold(), normalisation, collation, etc"
Next in thread: Theodore H. Smith: "Finite state machines? UTF8: toFold(), normalisation, etc"
Reply: Theodore H. Smith: "Finite state machines? UTF8: toFold(), normalisation, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Dear Mr. Smith,

That's a lot of different things, some of which are not entirely based
on Unicode properties. Collation, for example, is strongly affected by
language.

Unicode character properties provide the information you need to
implement many of these functions. The Unicode character data files have
fields that you can compile into data tables for this purpose.

Before you go off and do that, you should take a look at libraries that
have already done it. I recommend a close look at the ICU library
(http://oss.software.ibm.com/icu) as an excellent starting point.

You should also look closely at the FAQ and technical documentation on
the Unicode website, if you have not already.

I should note that very few applications work directly on UTF-8 byte
sequences. Most choose to process Unicode using UTF-16 or UTF-32 in
memory, even if the ultimate representation is UTF-8.

I hope this helps for starters.

Best Regards,

Addison

-- 
Addison P. Phillips
Director, Globalization Architecture
webMethods, Inc.
+1 408.962.5487  mailto:aphillips@webmethods.com
-------------------------------------------
Internationalization is an architecture. It is not a feature.
Chair, W3C I18N WG Web Services Task Force
http://www.w3.org/International/ws
Theodore H. Smith wrote:
> Hi list,
> 
> I need to implement some way to implement toUpper(), toFold(), 
> normalisation, collation, and perhaps other Unicode features I may have 
> missed out, on UTF8 strings stored in the RAM.
> 
> I need to implement it for Windows (32-bit), MacOS9 and MacOSX.
> 
> I have other Unicode processing code, already, but not these or anything 
> close to these.
> 
> I heard that the only way is to read out the character information from 
> a database? My whole string processing library, with hundreds of 
> functions and a few properties, is only 54k. I don't want to add 200k of 
> database reading code and then huge Unicode database files to this 54k.
> 
> How is this best done, then? I'm assuming there isn't any mathematical 
> way to figure out a codepoint's properties? So where do I get this data 
> and what's the fastest way to do it?
> 
> -- 
>     Theodore H. Smith - Macintosh Consultant / Contractor.
>     My website: <www.elfdata.com/>
> 
>

Next message: Carl W. Brown: "RE: Implementing on UTF8: toUpper(), toFold(), normalisation, collation, etc"
Previous message: Doug Ewell: "Re: Transcribing old documents into Unicode compatible document files."
In reply to: Theodore H. Smith: "Implementing on UTF8: toUpper(), toFold(), normalisation, collation, etc"
Next in thread: Theodore H. Smith: "Finite state machines? UTF8: toFold(), normalisation, etc"
Reply: Theodore H. Smith: "Finite state machines? UTF8: toFold(), normalisation, etc"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Sat May 03 2003 - 13:36:13 EDT