Re: Ranges/blocks ; font lookup by range

From: Mark Davis (mark.davis@icu-project.org)
Date: Tue May 08 2007 - 11:32:18 CDT

  • Next message: Adam Twardoch: "Re: Uppercase ß is coming? (U+1E9E)"

    The important concept for most people is the actual list of characters
    required for writing a given language. This does not align with the notion
    of "block" in the Unicode Standard, which is often a matter of historical
    accident based on when chunks of characters were incorporated. While people
    made efforts to have blocks be reasonably consistent in content, they don't
    necessarily correspond to actual usage.

    Thus a character list for a language may span multiple blocks, and yet not
    include all of the characters in any single block. I think you generally
    just want to avoid using the term "block".

    Mark

    On 5/8/07, Don Osborn <dzo@bisharat.net> wrote:
    >
    > Thanks Ken for the detailed explanations and all for the info &
    > discussion.
    >
    > The way I understand it then, if you were talking about a language that
    > uses
    > some extended Ethiopic/Ge'ez characters, you might say you need a font
    > with
    > (selected characters in) the "Ethiopic Supplement block" more properly
    > than
    > "Ethiopic Supplement range" but in the end it's pretty much the same?
    >
    > IOW, anything named (such as "Latin Extended-B" or "Arabic Supplement") is
    > a
    > block but any group of characters that is contiguous could be referred to
    > as
    > a range? So you could refer to a font having all Latin blocks or ranges?
    >
    > Sorry this is tedious, but in introducing the concepts to users who don't
    > necessarily need technical precision but do need to get how the system is
    > organized, one wants clarity and simplicity but not inaccuracy in terms.
    > One
    > thing a user encounters in practice that doesn't look like a "block" is
    > the
    > window you get in Word or Write when doing insert symbol. That is, the
    > character groups almost inevitably start & stop in mid "row" (in the
    > conventional, not 10646 sense) for various reasons (size of window; font
    > that has only selected characters from a block). This is not a complaint
    > in
    > any way - just thinking out loud about one issue among many for a
    > presentation.
    >
    > Anyway, this gives me a better perspective on the terminology, so thanks
    > again.
    >
    > Don
    >
    >
    > > -----Original Message-----
    > > From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org] On
    > > Behalf Of Kenneth Whistler
    > > Sent: Monday, May 07, 2007 7:46 PM
    > > To: asmusf@ix.netcom.com
    > > Cc: unicode@unicode.org; kenw@sybase.com
    > > Subject: Re: Ranges/blocks ; font lookup by range
    > >
    > >
    > > > > 1) Is "character range" or "character block" the preferred term
    > > now?
    > >
    > > > In Unicode, a block is a named entity associated with a range of
    > > > characters that is an integral multiple of 16.
    > > > That should provide the relation between these two terms. A 256
    > > > character range inside the Unified CJK Ideographs block, for example,
    > > is
    > > > not a block. (In 10646 it's called a 'row', if aligned on even 256
    > > > boundaries, but that's not a widely understood term out of context).
    > >
    > > Refining a little bit on Asmus' definitions:
    > >
    > > A Unicode block is a named entity associated with a range of *code
    > > points*
    > > that is an integral multiple of 16.
    > >
    > > You need to specify it that way, because a Unicode block can and often
    > > does contain unassigned (= reserved) code points, and may, in some
    > > instances, even contain noncharacters.
    > >
    > > The exact list of blocks is specified normatively in the UCD file,
    > > Blocks.txt. (Or you can see a comparable listing in Annex A of
    > > 10646.)
    > >
    > > Another way of thinking about it is that a block is a named entity
    > > consisting of a contiguous range of columns, where a column is
    > > defined as:
    > >
    > > Column: a range of 16 code points XXX0..XXXF
    > >
    > > "Column" isn't a normative term in either 10646 or the Unicode
    > > Standard, but is still a useful concept because it is so visible
    > > in the code charts.
    > >
    > > In the 10646 context, the following terms are also commonly used (these
    > > are my definitions, not normative definition in the standard):
    > >
    > > Row: a range of 256 code points XX00..XXFF
    > >
    > > Plane: a range of 64K code points X0000..XFFFF
    > >
    > > For comparison, here are the normative 10646 definitions:
    > >
    > > Row: A subdivision of a plane; of 256 cells.
    > >
    > > Plane: A subdivision of a group; of 256 x 256 cells.
    > >
    > > The Unicode Standard has adopted the term "plane" but
    > > doesn't make any regular use of the "row" term.
    > >
    > > On the other hand, the Unicode Standard makes use of the term "range"
    > > in its normal mathematical sense, and it can be used to specify any
    > > ad hoc listing of code points with a start and a stop point.
    > > For example, it is perfectly o.k. to talk about a character
    > > range, U+FFFE..U+10001, even though that particular range happens
    > > to span a column break, a row break, and a plane break, and also
    > > incorporates characters (and noncharacters) from two different blocks.
    > >
    > > One of the reasons why the Unicode Standard has generally moved away
    > > from talking too much about "Unicode character blocks", despite their
    > > normative status in the standard, is that they do not correlate
    > > well with script identity. There are a number of instances where
    > > a script is split across more than one block (Latin, Cyrillic, etc.),
    > > and there are instances where more than one script is contained within
    > > a single block (Greek and Coptic).
    > >
    > > People unfamiliar with the standard are likely to expect that if
    > > one talks about "the Ethiopic block", for example, that:
    > >
    > > A. It will contain all the Ethiopic characters.
    > > B. It will be a "block" in the sense Doug talked about, i.e.
    > > a "code page" like chunk of 256 characters 00..FF (or a
    > > "row" in 10646 parlance).
    > > C. It contains no characters used by other script.
    > >
    > > C happens to be true in this case, but A and B are not, because
    > > there are also Ethiopic characters in another supplemental block,
    > > and because the range of the Ethiopic block is 1200..137F.
    > >
    > > Interestingly, because the Ethiopic Supplement block was added
    > > contiguous to the Ethiopic block, the range of Ethiopic characters
    > > is a contiguous range, 1200..139F, even though that spans two blocks.
    > >
    > > --Ken
    > >
    > >
    >
    >
    >
    >
    >

    -- 
    Mark
    


    This archive was generated by hypermail 2.1.5 : Tue May 08 2007 - 11:34:13 CDT