Re: Does Unicode 4.1 change NFC?

From: Marcin 'Qrczak' Kowalczyk (qrczak@knm.org.pl)
Date: Tue Apr 05 2005 - 03:09:09 CST

Next message: Peter Kirk: "Re: Does Unicode 4.1 change NFC?"

Previous message: Arcane Jill: "Re: Does Unicode 4.1 change NFC?"
In reply to: Arcane Jill: "Re: Does Unicode 4.1 change NFC?"
Next in thread: Andrew C. West: "Re: Does Unicode 4.1 change NFC?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

"Arcane Jill" <arcanejill@ramonsky.com> writes:

> In particular, I have played around with writing code-generators, of
> the ilk which Ken mentioned in another post on this thread, and I
> /never/ assumed that all (or indeed, any) generated codepoints would
> be 16-bits wide. That would be a really dumb thing to do. Why is
> anyone even mentioning this as a possibility?

Since code produced by my generator is embedded in every program
compiled by my compiler, the primary goal is small data and code size.
I can live with updating the code when UCD changes some assumptions.

I mean just tables which give raw decomposition data. Strings are
represented by ISO-8859-1 and UTF-32, there is no BMP bias in
interfaces - only in some internally used tables.

The representation I used before for canonical decomposition:
- An array of 256 pointers to arrays of 256 pairs of 16-bit words
  gives decompositions of BMP characters. A pair is 0,0 for no
  decomposition, X,0 for a single-char decomposition and X,Y for
  two-char decomposition. All-zero pages are shared.
- An array of 32-bit words gives single-character decomposition
  for 542 characters starting from U+2F800.
- The remaining 13 characters with decompositions are treated by
  a switch statement in the code.

A change needed for Unicode 4.1:
- When 0xFFFF is stored in the place for a single-character
decomposition, an additional switch statement finds the real
decomposition. This affects 6 characters.

I claim that it was not a bad idea to use 16-bit entries in the
tables.

Compatibility decomposition is another story. The length may be longer
(up to 18) but currently only BMP characters are produced (including
the range of 1024 characters with some holes starting from U+1D400,
the only non-BMP characters having compatibility decompositions),
so my code doesn't currently include mechanism for producing non-BMP
characters here.

-- 
   __("<         Marcin Kowalczyk
   \__/       qrczak@knm.org.pl
    ^^     http://qrnik.knm.org.pl/~qrczak/

Next message: Peter Kirk: "Re: Does Unicode 4.1 change NFC?"
Previous message: Arcane Jill: "Re: Does Unicode 4.1 change NFC?"
In reply to: Arcane Jill: "Re: Does Unicode 4.1 change NFC?"
Next in thread: Andrew C. West: "Re: Does Unicode 4.1 change NFC?"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Tue Apr 05 2005 - 03:11:42 CST