Re: Canonical block names: spaces vs. underscores

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Thu, 26 May 2016 20:44:55 +0200

2016-05-26 20:07 GMT+02:00 Ken Whistler <kenwhistler_at_att.net>:

> Well, let's take an example. The entry in Blocks.txt for the Arabic
> Presentation Forms-A block is:
>
> FB50..FDFF; Arabic Presentation Forms-A
>
> The entry for that block in PropertyValueAliases.txt is:
>
> blk; Arabic_PF_A ; Arabic_Presentation_Forms_A ;
> Arabic_Presentation_Forms-A
>
> So then which would it be? Should Blocks.txt be changed to the long
> preferred alias:
>
> FB50..FDFF; Arabic_Presentation_Forms_A
>
> or to the abbreviated preferred alias:
>
> FB50..FDFF; Arabic_PF_A
>

I think that this would break parsers that expect the alias used in
Blocks.txt to be directly "readable" with spaces. My opinion is to keep
Blocks.txt untouched (with spaces) as it's part of the core standard since
too long (and in sync with the ISO standard) as being the *normative* block
name.

But we could add this normative value (with spaces) into
PropertyValueAliases.txt (that ISO 10646 does not have or need in its
standard):

blk; Arabic_PF_A ; Arabic_Presentation_Forms_A ;
Arabic_Presentation_Forms-A ; Arabic Presentation Forms-A

The other solution would be to *add* the abbreviated prefered alias in
Blocks.txt:

FB50..FDFF; Arabic Presentation Forms-A ; Arabic_PF_A

But this could break existing Block.txt parsers, when parsers should not
bug if finding new aliases in PropertyValueAliases.txt

Another solution would be to properly explain that to lookup values in
PropertyValues.txt, you can search it by replacing spaces in block names by
underscores, or make sure that underscores and spaces in the *middle* of
values are considered equivalent (so that even if they are rendered
visually, we can also display the listed aliases using spaces instead of
underscores.

However it must be clear that these aliases are case-sensitive by default
("Arabic_Presentation_Forms_A" is not the same as
"Arabic_presentation_forms_A" but is the same as "Arabic Presentation_Forms
A), unless the block names property is normatively said to be
case-insensitive (in that case the followings are also aliases:
"arabic_pf_a", "arabic pf a"). But adding case insensitivity has a cost,
which is much higher than *only* allowing basic replacements of spaces and
underscores (this will work, provided that there's no "special" aliases
starting by underscores, or using pairs of underscores: I doubt ISO will
use pairs of spaces in block names which are supposed to be trimmed with
whitespaces in the middle compressed).

Removing or replacing the space-separated words in block names in the UCD
would break the compatibility and synchronization with the ISO standard
which list them with spaces.
Received on Thu May 26 2016 - 13:45:42 CDT

This archive was generated by hypermail 2.2.0 : Thu May 26 2016 - 13:45:42 CDT