Re: Is it save to dig into comment contents of PropList.txt?

From: Philippe Verdy <verdy_p_at_wanadoo.fr>
Date: Wed, 6 Nov 2013 00:24:56 +0100

2013/11/5 Steffen Daode <sdaoden_at_gmail.com>

> Hello,
> ...i came to this solution in order to generate test data with
> awk(1) in a memory-friendly way?
>
> (The problem i'm facing is that _PRINT and _GRAPH cannot be set
> for some properties from PropList.txt, say, _PRINT can't be set
> for U+0009, CHARACTER TABULATION (ht), since it's a Cc, but in
> order to know that i had to parse UnicodeData.txt and store
> character information in memory first, (not thinking about further
> options), but that requires a lot of memory, more than is
> available on low-end machines.)

TAB is "printable" (for the isprint() macro in standard C librries) because
it has a whitespace property, even if its general category is very weakly
defined (kept for upward compatibility, the GC property is not enough for
most applications). It is treated for example in word and line breaking
properties.

The character mapping for the isprint() macro is defined by an expression
based on existing Unicode properties. Most C libraries optimize this
expression using fast compressed lookup table, except those legacy
libraries buit only for 7-bit or 8-bit encodings based on ISO 646
(including ASCII, ISO 8859, and national encodings from Russia, Ukraine,
India, Japan, Korea, China -- VISCII needing a special exception as it
allocates some printable characters needed for accented letters, at code
positions of ISO 646 controls not needed and rarely used for plain text ;
same remark about old PC codepages where additional symbols are mapped in
those positions and found in old encoded texts for PCDOS/MSDOS..) or
EBCDIC, where this may be a very weak test on some 8-bit value ranges.
Received on Tue Nov 05 2013 - 17:27:14 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 05 2013 - 17:27:14 CST