From: Francois Yergeau (FYergeau@alis.com)
Date: Tue Jan 20 2004 - 13:01:26 EST
Look at SCSU (http://www.unicode.org/reports/tr6/) and BOCU-1
(http://www.unicode.org/notes/tn6/).
-- François > -----Message d'origine----- > De : Elliotte Rusty Harold [mailto:elharo@metalab.unc.edu] > Envoyé : 20 janvier 2004 11:59 > À : unicode@unicode.org > Cc : xom-interest@lists.ibiblio.org > Objet : Unicode forms for internal storage > > > I'm currently working on a project (XOM, > <http://www.cafeconleche.org/XOM/>) in which the Unicode text data is > a significant portion of memory usage in many important use cases. > Currently, for the major class where this is an issue in practice (as > proved by profiling), I store the data as UTF-8. This means ASCII > data takes half the space it would in UTF-16, and many other > characters take only the same amount as they would in UTF-16. However > CJK characters tend to take up 50% more space than they woudl in > UTF-16. > > Last night it occurred to me it might be possible to design an > internal storage format for this class which had better memory usage > characteristics. In particular I'd like ASCII data to occupy only a > single byte, and all other BMP characters from 128 to 65535 to occupy > only two bytes. Non-BMP characters could be stored in surrogate pairs. > > In developing such a format I have a couple of advantages: > > 1. Most C0 controls are forbidden, and will not appear in the data. > That's already verified. If someone tries to pass in a C0 control > other than tab, linefeed, or carriage return to setValue, an > exception is thrown and the data is not stored. Potentially one or > more of these characters could be used as markers in the stream. > > 2. I do not need random access to parts of the data, only to whole > strings. Unlike with UTF-8, it is not important to be able to look at > a single byte in isolation and tell immediately which part of what > kind of character it is. > > 3. This is all completely private to one class. No data in this form > will be passed on the wire. None will be exposed via the public API > which is completely based on Java strings (that is, UTF-16). > > However, I would like the translation into and out of this format to > be at least as fast as the translation between UTF-8 and UTF-16 the > class is currently performing on every call to setValue and getValue, > ideally faster. > > Has anyone done any work on Unicode formats for this use-case? Does > anyone have any references or ideas to share? > -- > > Elliotte Rusty Harold > elharo@metalab.unc.edu > Effective XML (Addison-Wesley, 2003) > http://www.cafeconleche.org/books/effectivexml > > http://www.amazon.com/exec/obidos/ISBN%3D0321150406/ref%3Dnosi m/cafeaulaitA
This archive was generated by hypermail 2.1.5 : Tue Jan 20 2004 - 13:53:03 EST