From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Sat Mar 21 2009 - 13:45:14 CST
> De : Petr Tomasek [mailto:tomasek@etf.cuni.cz]
> Envoyé : samedi 21 mars 2009 19:59
> À : Philippe Verdy
> Cc : 'Petr Tomasek'; Unicode@unicode.org
> Objet : Re: Does OpenOffice 3.0 handle unicode?
>
>
> On Sat, Mar 21, 2009 at 07:10:32PM +0100, Philippe Verdy wrote:
> > > [mailto:unicode-bounce@unicode.org] De la part de Petr Tomasek
> > > Envoyé : samedi 21 mars 2009 17:42 À :
> Unicode@unicode.org Objet :
> > > Does OpenOffice 3.0 handle unicode?
> > >
> > >
> > > Can someone, please, confirm whether the new version of
> OpenOffice
> > > can handle unicode? OpenOffice 2.0 unfortunatelly can handle only
> > > the BMP, while I need characters from the SMP.
> >
> > That's quite a stupid question: if OpenOffice can "handle" the BMP
> > characters, it means that it "handles" Unicode.
>
> OK, it was a little bit provocation from me, but hey,
> supporting only BMP nowadays should be considered buggy behaviour.
>
> > Appanretly you seem to ignore that OpenOffice was designed using
> > Unicode as a goal, and using file formats that require the
> correct support of Unicode.
> > This support has always been part of the file format specifications
> > (that are based on XML files compressed within a zipped archive).
> >
> > I can perfectly open Chinese documents containing
> characters from the
> > SIP, with OpenOffice (all versions, including those before 2.0).
>
> "My" OpenOffice 2.0.4 (on linux) cannot handle anything but BMP.
>
> If I copy text containign SMP characters onto OOo all I get
> are two "boxes".
> (Which makes me think OpenOffice handles UTF-16 as it was
> UCS-2 internally, or something like that...)
>
> > This is not a problem of OpenOffice version but of support of the
> > display of the characters and scripts (for complex scripts) in the
> > system's or application's renderer.
>
> AFAIK OpenOffice uses the ICU library on linux. Other
> programs build upon the ICU (such as xetex) work without any
> problem with SMP characters.
>
> > But if you don't have any font for those scripts you want to render
> > and that are part of the SMP, all you'll get is a set of
> empty boxes.
>
> Of course I have fonts installed and other programs on my
> system (such as those based on Pango - www.pango.org) show
> them as expected.
>
> > So, on the same system, if I can open a document containing non-BMP
> > characters with MS Office, I can as well open it with
> OpenOffice (or
> > Sun StarOffice).
>
> On the same system I can open a document (in gedit e.g.)
> containing non-BMP characters but cannot open it using OpenOffice.
>
> So the conclusion: the OpenOffice is broken and what You
> wrote is quite stupid :)
But you've demonstrated above (in your own contradictions) that your issue
was completely system-specific and not related to the application itself
(you did not specify which system you were using, but from what you just
wrote, I think it is some distribution of Linux: check your system for the
relevant updates or support): you admitted that Ooo is using ICU. And ICU is
FULLY compatible with Unicode (in all planes).
If your system displays two boxes, it's not a problem of OpenOffice but of
the renderers installed in your system and the way it is installed (and made
accessible to your OpenOffice installation).
For myself, I have absolutely no problem with SIP characters in Ooo, because
i have FIRST solved the system requirements and installed the necessarty
support. Look at the installation instructions for you office installation
and make sure that your system has the relevant fonts, and that they are
correctly accessible to your rendering libraries (Pango is not the only
thing to check), you need to check also your X11 settings, and some
parameters of your locale, because the character encoding support on your
graphic console partly depends on it and affects the way your system fonts
are loaded and handled by your renderer.
If your display settings do not report Unicode capability, all what
OpenOffice or other apps can do is to try to adapt to your display locale
and map some characters to it, but there will not be any way for it to go
beyonf this limitation. The version of your X11 emulation (XFree86?) and its
builtin support for unicode fonts may also be needed: if it's not enabled,
your X11 instalaltion will just exhibit sets of fonts for several specific
encodings, and your application will just try to adapt to one or a few of
them, when it converts a single internal (UTF-8 encoded in the XML document)
code point into the target encoding used by your display:
Seeing two boxes does not mean that Your office app handles the single code
points as two separate characters: look by your self if you can edit the
text and remove only one of the two surrogate characters that make a single
code point (in OpenOffice, documents are UTF-8 encoded in the archived XML
documents, so without using any surrogate for supplementary characters that
are encoded as a whole; I am not able to select surrogates isolately in the
GUI, even for characters for which I don't have any suitable font, or for
characters that are still not encoded in Unicode but accepted anyway,
displayed as an empty box glyph, and properly kept unchanged when saving).
If you can break a single character into one of the two surrogates, it just
means that internally, the application reads the utF-8 encoded oducments
into memory by first converting them to UTF-16, but ignoring then the UTF-16
requirements (this was what happended in old applications just recompiled in
C by just changing "char" into "wchar-t". But ICU was written to support all
Unicode requirements (including not breaking character in the middle) and
correctly support all the classic text algorithms and support the new ones
needed for internationalization (including complex scripts).
OpenOffice documents using the ODT format are zipped archives containing
several XML files: these XML files are (and have to be) fully conforming to
XML, so you can even patch one manually in a plain-text editor and
experiment with it: not only you can encode these characters using UTF-8,
but you can also represent them using numeric character entities (in decimal
or hexadecimal) using the Unicode code point: it works equally, and
OpenOffice accepts the document as well without any difference (this means
that ODT documents can be generated or modified by various tools and
applications that know how to handle conforming XML documents, not just by
saving them from OpenOffice itself).
This archive was generated by hypermail 2.1.5 : Sat Mar 21 2009 - 13:48:40 CST