mapping characters with visual similarities

From: Chris Weber (Casaba Security) (chris@casabasecurity.com)
Date: Fri Mar 07 2008 - 16:30:30 CST

Next message: Kenneth Whistler: "Re: mapping characters with visual similarities"

Previous message: Rick McGowan: "Update to UAX #29 now available"
Next in thread: Kenneth Whistler: "Re: mapping characters with visual similarities"
Maybe reply: Kenneth Whistler: "Re: mapping characters with visual similarities"
Reply: Mark Davis: "Re: mapping characters with visual similarities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Hi group I thought this might be the right place to ask this question, and
apologize if this has been answered in the past.

How can I blacklist a large set of words from a wordlist when all unicode
blocks are allowed (e.g. full width latin, cyrillic, etc.)? The scenario
would be a web-application written in .Net supporting UTF-8. It consumes a
string of input, then compares the string against a wordlist of disallowed, or
blacklisted, words.

Background:
A sample of the blacklist includes profanity and trademark names like:
- microsoft
- wal-mart
- apple

Looking at the word 'microsoft' in its UCN form would be:
\u006D\u0069\u0063\u0072\u006F\u0073\u006F\u0066\u0074

The core of the problem seems to be that any one of these letters can be
glyphically (visually) represented using another code point, for example just
look at 'some' of the different ways the letter 'm' can be visually represented:

м \u043C
М \u041C
Ｍ \uFF2D
ｍ \uFF4D
ʍ \u028D
Μ \u039C

So the ideal solution might map every letter against all possible visual
representations of that letter. I know that's really tricky business, as even
something like 'fn' might look like an 'm' in some fonts, and two v's 'vv'
could look like a 'w'. Fonts play a part in this of course, and the problem
starts to look unsolvable.

Next message: Kenneth Whistler: "Re: mapping characters with visual similarities"
Previous message: Rick McGowan: "Update to UAX #29 now available"
Next in thread: Kenneth Whistler: "Re: mapping characters with visual similarities"
Maybe reply: Kenneth Whistler: "Re: mapping characters with visual similarities"
Reply: Mark Davis: "Re: mapping characters with visual similarities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Mar 07 2008 - 16:39:13 CST