Re: Canonical block names: spaces vs. underscores

From: Ken Whistler <kenwhistler_at_att.net>
Date: Thu, 26 May 2016 11:07:14 -0700

On 5/26/2016 10:05 AM, Mathias Bynens wrote:
>> On 26 May 2016, at 17:47, Mark Davis ☕️ <mark_at_macchiato.com> wrote:
>>
>> The canonical property and property value formats are in the *Alias* files.
> Thanks for confirming!

Well, not quite... See below.

>
> Any chance the canonical names can be used in `Blocks.txt` as well, for consistency? This would simplify scripts that parse the Unicode database text files.

There's always a chance, I guess. But if we did so, we'd end up having
to just invent some
other more-or-less ad hoc property: Block_Name_Usable_For_Display, with
the values
we already have in the Blocks.txt file. Or we would have to change the
format to include
the block short alias as an additional field in the file, which would
have its own maintenance
and consistency issues. Or we would be introducing a historical
inconsistency in the UCD
between versions, which would *complicate* certain other scripts that
parse the UCD.

>
>> On 26 May 2016, at 18:03, Ken Whistler <kenwhistler_at_att.net> wrote:
>>
>> […] "canonical block name" is not a defined term in the standard.
> I didn’t mean to imply it was — it’s just an English word. I meant “canonical” as in “without loose matching applied”.

Ah, but "canonical" is a very freighted word in Unicode parlance. There
are 58 instances
of the word "canonical" in the current version of UAX #44, Unicode
Character Database.
Every one of them is a term of art, and none of them means what you mean
there. ;-)

What are actually in PropertyValueAliases.txt are "preferred aliases"
(one "abbreviated",
and one "long"), plus a few "other aliases" for various compatibility
reasons.

UAX #42 follows suit. The block property is represented by the blk
attribute, and the
enumerated values of the blk attribute:

http://www.unicode.org/reports/tr42/#w1aac13c13c19b1

use the *abbreviated *"preferred aliases" from PropertyValueAliases.txt.

>
>> For enumerated properties, and especially for catalog properties such as Block and Script,
>> the value of the property may be multi-word, and the best form to use in one context might
>> not be exactly (as in binary string equality exact) the same as in another.
> That makes sense, but shouldn’t it be consistent throughout the Unicode database text files?

Well, let's take an example. The entry in Blocks.txt for the Arabic
Presentation Forms-A block is:

FB50..FDFF; Arabic Presentation Forms-A

The entry for that block in PropertyValueAliases.txt is:

blk; Arabic_PF_A ; Arabic_Presentation_Forms_A
; Arabic_Presentation_Forms-A

So then which would it be? Should Blocks.txt be changed to the long
preferred alias:

FB50..FDFF; Arabic_Presentation_Forms_A

or to the abbreviated preferred alias:

FB50..FDFF; Arabic_PF_A

which would be more consistent with the XML attribute and with most
regex usage?
If the latter, you would end up with systematically less identifiable
labels in Blocks.txt,
which would make it a bit more obscure for other uses, and which would
also then
create ambiguities about what might be the "best" or "preferred" label
for blocks for
an API returning a block name -- which certainly wouldn't be the
abbreviated "preferred alias".

I suppose a proposal to the UTC to further modify the UCD handling of
block names
could change this situation. But I'm not convinced that we shouldn't
just leave
things as they stand -- for stability. And then live with the
complications required
for scripts or other parsing algorithms that actually need to deal with
Blocks.txt to
either parse out block ranges (its main function) or to get usable block
names
(its subsidiary function).

--Ken
Received on Thu May 26 2016 - 13:08:02 CDT

This archive was generated by hypermail 2.2.0 : Thu May 26 2016 - 13:08:03 CDT