Design and Implementation of a Suite of Chinese Transcoders for Python 2
Thomas Emerson - Basis Technology Corporation
Intended Audience: |
Software Engineers, Content Developers |
Session Level: |
Intermediate, Advanced |
With the release of Python 2.0 in October 2000 Unicode strings became
a fundamental datatype in the language. A new module, codecs, provides
support for registering new encoding converters to transcode between
Unicode and legacy encodings. Codecs are provided for the ISO 8859-n
8-bit encodings, but the Asian encodings are absent.
The Python Codecs project, ,
is underway to supplement the standard set of encodings with the legacy CJK
encodings. This presentation describes the design and implementation of a single
unified transcoding framework for a wide range of Chinese encodings, including:
EUC-CN, EUC-TW
HZ
ISO-2022-CN, ISO-2022-CN-EXT
GB 18030
Big 5 and its variants, including HKSCS
It is expected that this framework will scale to support other
multibyte 8-bit Asian encodings.
|