From: Mike (mike-list@pobox.com)
Date: Mon Jul 07 2008 - 22:19:36 CDT
>> Writing your code with the assumption that you're dealing with BMP-
>> only is nonetheless still a bad idea, since the day will inevitably
>> come when you want to re-use it in a situation where the assumption is
>> false. Best to write for Unicode with as much generality as possible
>> from the get-go rather than having to rewrite later.
In my own code, I solved this by creating various UTF iterators. For
example when you ask the UTF-16 iterator for the next character, it
examines the next two bytes of the String to determine if they form a
surrogate or not. If they don't, then it returns a uint32 with the
code point and advances two bytes. If the first two bytes are a
surrogate, then it checks that the following two bytes create a
surrogate and it combines them into the effective code point and
returns that, advancing 4 bytes.
By creating these iterators, I was able to write all my other higher
level code independent of the UTF in use. For example, I can normalize
UTF-8 input directly into UTF-16 output (which requires my Char class
to turn a uint32 into the proper sequence of bytes for the output
encoding). One of these days I plan to write a GB18030 iterator (if I
can ever find a decent reference on how to en/decode it), and all the
high level functions will "just work" without even knowing the form of
the original data.
This approach lets you separate the input byte stream processing from
the rest of the code, and even allows you to expand your ability to
handle different encodings as the demand for them surfaces, with
minimal pain and effort. If you think UTF-16 is right today, perhaps
tomorrow you will be rewriting everything for UTF-8. If you follow my
example, though, you will not need to rewrite anything, and will be
able to support as many different encodings as you eventually need.
Mike
This archive was generated by hypermail 2.1.5 : Mon Jul 07 2008 - 22:23:48 CDT