Transcoding: Beyond the Basics
Presented by: Deborah Goldsmith - Apple Computer, Inc.
Intended Audience: |
Managers, Software Engineers, Systems Analysts |
Session Level: |
Beginner, Intermediate |
Initially, conversion of text from one character encoding to another - and in
particular between Unicode and other encodings - may seem straightforward: Just convert
characters in one encoding into the equivalent character - if there is one - in another
encoding. However, there are many subtle issues involved in deciding when characters in
different encodings are equivalent:
- Similar characters in different encodings may have a different range of meanings, or
different properties. One encoding may use a single character to represent a range of
meanings that another encoding uses two or three different characters to represent, or a
particular character that is in both encodings may have different directional properties
in the two different encodings.
- Information that is explicit in one encoding may be implicit in another, where it
may depend on context or other state.
- The requirements for how close an equivalence is required often depend on the
purpose of a particular transcoding operation, and may range from "identity" (subject to
the caveats mentioned above), through canonical, compatibility, or other semantic
equivalence, all they way to mere graphic similarity. Encoding conversions need to
handle different types of equivalence to properly handle different types of clients.
This talks describes many such issues, and presents techniques for handling them.
|