mapping characters with visual similarities
From: Chris Weber (Casaba Security) (chris@casabasecurity.com)
Date: Fri Mar 07 2008 - 16:30:30 CST
Next message: Kenneth Whistler: "Re: mapping characters with visual similarities"
Hi group I thought this might be the right place to ask this question, and
apologize if this has been answered in the past.
How can I blacklist a large set of words from a wordlist when all unicode
blocks are allowed (e.g. full width latin, cyrillic, etc.)? The scenario
would be a web-application written in .Net supporting UTF-8. It consumes a
string of input, then compares the string against a wordlist of disallowed, or
blacklisted, words.
Background:
A sample of the blacklist includes profanity and trademark names like:
- microsoft
- wal-mart
- apple
Looking at the word 'microsoft' in its UCN form would be:
\u006D\u0069\u0063\u0072\u006F\u0073\u006F\u0066\u0074
The core of the problem seems to be that any one of these letters can be
glyphically (visually) represented using another code point, for example just
look at 'some' of the different ways the letter 'm' can be visually represented:
м \u043C
М \u041C
M \uFF2D
m \uFF4D
ʍ \u028D
Μ \u039C
So the ideal solution might map every letter against all possible visual
representations of that letter. I know that's really tricky business, as
even
something like 'fn' might look like an 'm' in some fonts, and two v's 'vv'
could look like a 'w'. Fonts play a part in this of course, and the
problem
starts to look unsolvable.
This archive was generated by hypermail 2.1.5
: Fri Mar 07 2008 - 16:39:13 CST