while such treatment of an unassigned code points as a base characters (and
the reasons to threat them this way) are logically correct, it would not be
superfluous to formalize that, in my opinion.
Konstantin
2011/7/5 Mark Davis ☕ <mark_at_macchiato.com>
> Ah, you're right; I wasn't looking carefully enough at what you wrote.
>
> Yes, an unassigned code point (Cn) is treated as a base character.
>
> Unassigned code points are peculiar beasts, since we don't know really how
> they should behave until (and if) they are assigned. Their treatment by the
> Unicode algorithms varies based on some factors:
>
> - safety - don't have them behave in a way that causes problems
> - foresight - have them behave like the most likely candidate for
> future assignment
> - simplicity - since they shouldn't occur normally in text, don't spend
> too much time worrying about them.
>
> These are not formalized principles, just my observations on how we've
> operated over the years.
>
> Mark
> *— Il meglio è l’inimico del bene —*
>
>
>
> On Mon, Jul 4, 2011 at 20:17, Karl Williamson <public_at_khwilliamson.com>wrote:
>
>> On 07/03/2011 05:52 PM, Mark Davis ☕ wrote:
>>
>>>
>>>
>>> Mark
>>> /— Il meglio è l’inimico del bene —/
>>>
>>>
>>> On Sat, Jul 2, 2011 at 14:58, Karl Williamson <public_at_khwilliamson.com
>>> <mailto:public_at_khwilliamson.com>> wrote:
>>>
>>> I have two questions about this.
>>>
>>> 1) In UAX #44, it says for information about the Grapheme_Base
>>> property, to see UAX #29, but that document doesn't mention this
>>> property.
>>>
>>>
>>> The documentation on Grapheme_Base in #44 is obsolete. Grapheme_Base has
>>> not been used in the specification of grapheme clusters since (I
>>> believe) Unicode 3.2.
>>>
>>>
>>> 2) The definition in UAX #29 for both legacy and extended grapheme
>>> clusters effectively says that any Gc=Cn code points followed by any
>>> number of grapheme_extend code points is a grapheme cluster. Is
>>> that what is meant? I notice that Grapheme_Base excludes Cn code
>>> points.
>>>
>>>
>>> It doesn't say that. If you had the sequence <Control Extend>, you'd
>>> have a break between them according to the following rule:
>>> GB4. ( Control | CR | LF ) ÷
>>>
>>> It would result in two (degenerate) grapheme clusters.
>>>
>>> We need to fix the documentation to make this clearer. Could you let me
>>> know what let you to think that "any Gc=Cn code points followed by any
>>> number of grapheme_extend code points is a grapheme cluster" so that we
>>> can clarify that?
>>>
>>
>> It says that an extended grapheme cluster matches this:
>> ( CRLF
>> | Prepend* ( Hangul-syllable | !Control )
>> ( Grapheme_Extend | Spacing_Mark)*
>> | . )
>>
>> That tells me that one option for a grapheme cluster is a !Control
>> followed by any number of Grapheme_Extends.
>>
>> Lower down it defines "Control" as
>> "General_Category = Line Separator (Zl), or
>> General_Category = Paragraph Separator (Zp), or
>> General_Category = Control (Cc), or
>> General_Category = Format (Cf)
>> and not U+000D CARRIAGE RETURN (CR)
>> and not U+000A LINE FEED (LF)
>> and not U+200C ZERO WIDTH NON-JOINER (ZWNJ)
>> and not U+200D ZERO WIDTH JOINER (ZWJ)"
>>
>> By that definition of Control, all Gc=Cn code points are !Control.
>> Therefore a grapheme cluster can be a Cn followed by any number of
>> Grapheme_Extends
>>
>>
>>> Grapheme_Extend, on the other hand, is exactly equivalent to
>>> Grapheme_Cluster_Break=Extend.
>>>
>>>
>>
>
Received on Tue Jul 05 2011 - 15:42:36 CDT
This archive was generated by hypermail 2.2.0 : Tue Jul 05 2011 - 15:42:37 CDT