From: Ben Dougall (bend@freenet.co.uk)
Date: Fri Jun 27 2003 - 10:44:15 EDT
i'm a bit confused. i thought that this type of thing was already
pretty well covered by the various unicode resources? (i guess there's
a strong chance not, if you're asking this question).
this is the way i see it:
it's for you to decide which format you internally normalise to (i'm
not even sure if that's the right word). to which specific *base
format* you decide to adhere to. (i'm talking about things like do you
treat text in a composed or decomposed form for example). it doesn't
matter which internal base format you choose, so long as you stick to
it and never try to compare two texts in different 'base formats'. then
on top of that you'd need to also apply a way to make use of character
mappings - when you get various versions of characters amounting to the
same meaning. (there's different levels to that and decisions for you
to make - no right nor wrong. the extent to which you allow various
character to amount to the same one. (this includes case mappings for
example obviously)
i don't see how language differences come into this. the japanese no
space thing you mention: if someone types in a particular phrase, in
japanese (therefore without spaces, if that is actually the case), then
the search will not try and use spaces. and the text that they're
searching will not be using spaces as it'll also be in japanese.
all that 'remove' and 'replace' part - you don't have to transform the
text, surely you just have to set up rules (or filters) within the code
that says for example "a or any number of tabs + a or any number of
spaces = 1 space". and if you apply those rules *throughout*, to the
text being searched, and the text strings that are inputted and
searched for, then all'll be cool (?) maybe.
> - replace all dashes with a standard ASCII minus-hyphen
like that part. i wouldn't replace or change any text in any way. i'd
just say in the code that any dash amounts to any other dash (and 'any
dash' = what you mean by 'all dashes')
basically i wouldn't go about changing characters. just allowing them
to represent an array of characters (including nothing/no characters in
some cases maybe)
so it's 2 main basic things: convert to base format throughout, and set
up rules / filters for characters (which will make heavy use of data,
(is it the 'properties' data? - for character grouping and mappings)
from unicode, plus a bit more of your own such as saying a variable
long line of any white space amounts to one space, if you'd want things
with variable amounts of space in to match that is.
On Friday, June 27, 2003, at 12:46 pm, Philippe Verdy wrote:
> In order to implement a plain-text search algorithm, in a language
> neutral way that would still work with all scripts, I am searching for
> advices on how this can be done "safely" (notably for automated
> search engines), to allow searching for text matching some basic
> encoding styles.
>
> My first approach to the problem is to try to simplify the text into a
> indexable form that would unify "similar" characters.
> So I'd like to have comments about possible issues in modern languages
> if I perform the following "search canonicalization":
>
> - Decompose the string into NFKD (this will remove font-related
> information and isolate combining marks)
> - Remove all combining characters (with combining class > 0),
> including Hebrew and Arabic cantillation.
> (are there significant combining vowel signs that should be kept?)
> - apply case folding using the Unicode standard (to lowercase
> preferably)
> - possibly perform a closure of the above three transforms
> - remove all controls, excepting TAB, CR, LF, VT, FF
> - replace all dashes with a standard ASCII minus-hyphen
> - replace all spacing characters with an ASCII space
> - replace all other punctuation with spaces.
> - canonicalize the remaining spaces (no leading and trailing spaces,
> and alls other sequences replaced with a single space).
> - (may be) recompose Korean Hangul syllables?
>
> What are the possible caveats, notably for Japanese, Korean and
> Chinese which traditionally do not use spaces ?
>
> How can we improve the algorithm for searches in Thai without using a
> dictionnary, so that word breaks could be more easily detected (and
> marked by inserting a ASCII space) ?
>
> Should I insert a space when there's a change of script type (for
> example in Japanese, between Hiragana, Katakana, Latin and Kanji
> ideographs) ?
>
> Is there an existing and documented conversion table used in
> plain-text search engines ?
>
> Is Unicode working on such search-canonicalization algorithm ?
>
> Thanks for the comments.
>
> -- Philippe.
>
>
This archive was generated by hypermail 2.1.5 : Fri Jun 27 2003 - 11:31:12 EDT