From: Stephane Bortzmeyer (bortzmeyer@nic.fr)
Date: Fri Sep 07 2007 - 09:47:56 CDT
On Fri, Jul 06, 2007 at 11:32:36AM +0200,
Stephane Bortzmeyer <bortzmeyer@nic.fr> wrote
a message of 11 lines which said:
> For various studies of the Unicode database, I prefer to work with a
> SQL version.
Several suggestions have been made.
1) Some people suggested to just load UnicodeData.txt in a DBMS (most
DBMS allow to load a CSV or CSV-like file simply) which is not a good
solution, because of the data in other files (such as Han properties)
or simply because of character ranges.
2) Some people suggested to wait for the XML version of the UCD (which
is now in beta-test, see http://www.unicode.org/review/pr-109.html)
So, I wrote my own (very incomplete solution). It is a simple program
(257 lines but it is far to handle all the stuff in the UCD, which is
a rich and complicated database). It was more complicated than
foreseen because the UCD is complex and the structure of its text
files is not always easy to handle. But it works for my purposes, I
can now write things like:
SELECT To_U(Characters.codepoint) AS UCodepoint, name, definition
FROM Characters, Han_Properties WHERE
Characters.codepoint = Han_properties.codepoint AND
definition ILIKE '%turtle%';
I attached here, in case some people could find it useful, the SQL
schema (tested on PostgreSQL, remember that very few real-world SQL
files are portable) and the program, written in Lua
(http://www.lua.org/). Feel free to use them as you want.
This archive was generated by hypermail 2.1.5 : Fri Sep 07 2007 - 09:49:47 CDT