Re: Is it save to dig into comment contents of PropList.txt? from Markus Scherer on 2013-11-05 (Unicode Mail List Archive)

From: Markus Scherer <markus.icu_at_gmail.com>
Date: Tue, 5 Nov 2013 08:10:15 -0800

On Tue, Nov 5, 2013 at 5:38 AM, Steffen Daode <sdaoden_at_gmail.com> wrote:

> Hello,
> ...i came to this solution in order to generate test data with
> awk(1) in a memory-friendly way?
>

Comments like at the end of this line?

0009..000D ; White_space # Cc [5] <control>..<control>

(The problem i'm facing is that _PRINT and _GRAPH cannot be set
> for some properties from PropList.txt, say, _PRINT can't be set
> for U+0009, CHARACTER TABULATION (ht), since it's a Cc, but in
> order to know that i had to parse UnicodeData.txt and store
> character information in memory first, (not thinking about further
> options), but that requires a lot of memory, more than is
> available on low-end machines.)
>

The comments are just that, comments, for human consumption, and their
format may change without notice. One exception is the syntax in the
@missing lines.

It is normal that you need to parse multiple Unicode data files for
extracting useful data.

It also does not require "a lot of memory" considering how much memory is
available even on ten-year-old clunkers at this point, unless you are
especially extravagant with how you store the data. Besides, after parsing,
you would normally build more compact data structures for the data you need.

Having said that, if your parsing works with the files you see and the data
you want to extract, then go for it. Just make sure that if the format
changes, you have enough checks in your parser so that it fails with an
error rather than silently producing garbage. You should also spot-check
that the data you get from the comments does indeed match the real data.

markus
Received on Tue Nov 05 2013 - 10:12:42 CST

This archive was generated by hypermail 2.2.0 : Tue Nov 05 2013 - 10:12:43 CST