I have created a tool in python to extract and transform UNIHAN database's
information. It’s open source (MIT-licensed) and offers users customized
outputs. It’s documented extensively at https://unihan-etl.git-pull.com. In
addition, the project’s source code can be found at
https://github.com/cihai/unihan-etl.
I paired off this tool due to the time-effort of studying the fields and
extracting the information correctly. The hope is that one day a traveller
going down the same path can find this useful.
It has been mentioned before on this list at least once, back in 2004:
http://unicode.org/mail-arch/unicode-ml/y2004-m04/0255.html
> I'm trying to pare Unihan.txt down to a less unwieldy size for my own use
by eliminating properties that are of no interest to me and would like to
be certain that eliminating the four properties containing the actual
values for those dictionaries can be done safely because the information
can be reconstituted if necessary from the kIRG* properties since I'm not
certain if those properties are of interest to me.
There are developers who may only want to extract a pre-determined set of
fields.
$ pip install —user unihan-etl
And create an export values into a CSV (UNIHAN downloads automatically):
$ unihan-etl
Only pull custom fields (once downloaded, Unihan.zip is cached for reuse):
$ unihan-etl -f kMandarin kNelson kMorohashi
Will only pull out those fields. Let’s get a structured output in JSON
(empty values are pruned automatically):
$ unihan-etl -f kMandarin kNelson kMorohashi -F json
Also, with pyyaml you can use -F yaml, as well.
$ pip install pyyaml
$ unihan-etl -f kMandarin kNelson kMorohashi -F yaml
To see all the command line options:
http://unihan-etl.git-pull.com/en/latest/cli.html
Container format: To keep that data exports as portable as possible, it
follows the Data Packages standard (
http://frictionlessdata.io/data-packages/). This is a trickier data set
since fields compact quite a bit of detail in them. Other data sets such as
CEDict will also be made available as data packages.
Backstory: I am trying to create a spiritual successor to cjklib (
https://pypi.python.org/pypi/cjklib). The project aims to pull in CJK
datasets and make them accessible under one library. Datasets are also
going to be available a la carte via a consistent data standard (Data
Packages). I am opting to use UNIHAN database as a core of the CJK data
sources. The project’s homepage is https://cihai.git-pull.com.
Received on Tue May 30 2017 - 10:22:20 CDT
This archive was generated by hypermail 2.2.0 : Tue May 30 2017 - 10:22:21 CDT