Time format characters 'h' and 'k'

Philippe Verdy via CLDR-Users cldr-users at unicode.org
Wed Aug 23 14:13:07 CDT 2017

2017-08-23 19:06 GMT+02:00 Doug Ewell via CLDR-Users <cldr-users at unicode.org

> Philippe Verdy wrote:
> > But the discussion was not out of topic: Mark Davis did not justify
> > really why he would have prefered the surrogates at end of the BMP.
> he said it was a historical anomaly.

And does not say why, though this is the most interesting thing. Saying
something WAS an anomaly brings this question. If no one can explain that,
then there was no "anomaly" at all and the parenthetical side is also no
relevant at all (for this subject line). But it's still an interesting
question: are UTF-16, BOM's and surrogates really useful as a normative
part of the standard instead of a technical annex kept only for historic
references because of its usage in the Windows API, or indirect reference
from CESU-8 (which is also not part of the standard but kept as another
possibilty for handling 16-bit code units without the problem of byte order.

As a parenthetical side, I can use the same argument: the disunification of
ZWNJ was also an historical anomaly (only because of its predominant usage
as BOM's in Windows "Notepad.exe" to support UTF-16 encoded texts instead
of UTF-16BE or UTF-16LE, two other related and unneeded 8-bit encodings,
that are probably even worse than CESU-8 not needing these damn'ed BOMs
that have polluted the usage of UTF-8). Windows clearly did not even needed
these BOMs, when it could store the text encoding in NTFS or ReFS metadata
(e.g. in a tiny alternate datastream, not using more storage cluster space
as it would fit directly in directory entries, just like what it does for
marking files downloaded from Internet from a third party domain name by
also using an ADS); for FAT32 and exFAT, there's also solutions using
conventional metadata folders (solution unused on MacOS for storing its
structured "resource forks" with similar goals and capabilities, or used on
webservers for HTTP metadata using an additional database or index file);
on OS/2 and VMS there were "extended attributes" with the same goal.

In my opinion there are still more (and better) ways to bijectively remap
codepoints to 16-bit codeunits and UTF-16 is just one of them (though it is
not strictly bijective but only surjective, causing additional problems for
the reverse conversion back to codepoints) but not the best for all uses
which won't like having to check  exceptions everywhere (notably unpaired
surrogates except at end of streams with premature truncation).
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://unicode.org/pipermail/cldr-users/attachments/20170823/9b169606/attachment.html>

More information about the CLDR-Users mailing list