Re: "Universal Character Set"

From: Asmus Freytag (asmusf@ix.netcom.com)
Date: Sat Feb 17 2007 - 15:54:19 CST

  • Next message: Jon Hanna: "Re: "Universal Character Set""

    On 2/17/2007 9:58 AM, Don Osborn wrote:
    >
    > Does anyone currently use the term “Universal Character Set” (UCS) to
    > refer to Unicode/ISO-10646? I guess it is technically correct, but I
    > rarely see it. It seems that folks generally use “Unicode” as the
    > catch-all term, or maybe I’m missing a wider use of UCS?
    >
    I believe your observation about "Unicode" being the common label are to
    the point. A bit of research is illuminating and might explain some of
    the reasons why the term has caught on.

    There are about 33 million pages indexed on Google that can be retrieved
    by a search for "Unicode" and about 111,000 by a search for "Universal
    character set". If you subtract all pages that mention 10646 or Unicode
    or UCS that number drops to 1/10th fir the altter. If you similarly
    subtract the other terms from the search for Unicode, there's hardly a
    reduction in number.

    What that means is that "universal character set" is probably most often
    used as a descriptor, as in "Unicode is a universal character set", and
    not as a label. The common label is clearly "Unicode". That's not
    surprising, because Unicode as a label has the advantage of being
    shorter and clearly referring to a specific character set.

    In the case of UCS as a label, you run into the problem that the letters
    UCS are not unique. Google will pull up the Union of Concerned
    Scientists, UCS Inc., University College School and a number of others
    on the first screen (and also helpfully suggest that you really meant
    USC). Trading non-distinctiveness for brevity is apparently not a clear
    win - and the use of UCS (in all meanings) is barely 1/6th of the one
    for Unicode. If you search for UCS together with 10646 or Unicode to
    sift out when UCS might have been used in the context of character sets,
    you find only about 800K inks, which only emphasizes the issue with the
    multiple meanings of UCS.

    10646 by itself gives about 4.5 million hits, of which fully 1/3 don't
    mention ISO, but are in reference to part numbers or are otherwise false
    positives--based on that you can conclude that 10646 is used as a
    designator of the character set about 1/10th as often as Unicode.

    There are instances where referring to Unicode is the only correct
    choice. For example, when referring to Unicode Normalization Forms,
    Unicode Bidi Algorithm, Unicode Line Breaking, and the myriad other
    specifications that have been developed or are being developed around
    the character set and collection of character properties by the Unicode
    Consortium.

    A./



    This archive was generated by hypermail 2.1.5 : Sat Feb 17 2007 - 15:57:07 CST