RE: GB18030

From: Murray Sargent (murrays@microsoft.com)
Date: Fri Sep 21 2001 - 12:01:17 EDT


I think I've figured out a way to find the beginning of a GB18030 character starting anywhere in a document. The algorithm is similar to finding the beginning of a DBCS character in that you scan backward until you find a byte that can only come at the start of a character. The main difference is that you check for being in four-byte characters first (those of the form HdHd, where H is a byte in the range 0x81 - 0xFE and d is an ASCII digit). If a four-byte character isn't involved (ordinary GBxxxx don't use d as a trail byte), you revert to the DBCS approach for handling the rest of GB18030.
 
This algorithm is handy when you want to stream in a file in chunks and need to know if a chunk ends in the middle of a character. One can also solve this particular problem by keeping track of character boundaries from the start of stream, but typically more processing is involved.
 
Murray

        -----Original Message-----
        From: Carl W. Brown [mailto:cbrown@xnetinc.com]
        Sent: Fri 2001/09/21 04:56
        To: Charlie Jolly; unicode@unicode.org
        Cc:
        Subject: RE: GB18030
        
        

        Charlie,
        
        GB18030 is designed to support all Unicode characters. It has the capacity
        to also encode additional characters. I know of no plans to do so.
        
        I don't think it will have much affect on Unicode. Most systems that handle
        GB18030 will want to convert it to Unicode first to reduce processing
        overhead. With most of the common MBCS code pages you can determine the
        length of the character from the first byte. With GB18030 you some times
        have to check the first two characters. UTF-8 for example is an MBCS
        character set but if I am going backwards through a string I can do so.
        With GB18030 I must start over from the beginning of the string to find the
        start of the previous character.
        
        It is smaller that UTF-8 for Chinese and larger for anyone else.
        
        Carl
        
> -----Original Message-----
> From: unicode-bounce@unicode.org
> [mailto:unicode-bounce@unicode.org]On Behalf Of Charlie Jolly
> Sent: Friday, September 21, 2001 1:42 AM
> To: unicode@unicode.org
> Subject: GB18030
>
>
> GB18030
>
> In what ways will this effect Unicode?
>
> Does it contain anything that Unicode doesn't?
>
>
>
>
        
        
        



This archive was generated by hypermail 2.1.2 : Fri Sep 21 2001 - 16:22:22 EDT