From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Mar 07 2008 - 17:42:11 CST
Take a look at http://www.unicode.org/reports/tr36/ and then
http://www.unicode.org/reports/tr39/
Mark
On Fri, Mar 7, 2008 at 2:30 PM, Chris Weber (Casaba Security) <
chris@casabasecurity.com> wrote:
> Hi group I thought this might be the right place to ask this question,
> and
> apologize if this has been answered in the past.
>
> How can I blacklist a large set of words from a wordlist when all unicode
> blocks are allowed (e.g. full width latin, cyrillic, etc.)? The scenario
> would be a web-application written in .Net supporting UTF-8. It consumes
> a
> string of input, then compares the string against a wordlist of
> disallowed, or
> blacklisted, words.
>
> Background:
> A sample of the blacklist includes profanity and trademark names like:
> - microsoft
> - wal-mart
> - apple
>
> Looking at the word 'microsoft' in its UCN form would be:
> \u006D\u0069\u0063\u0072\u006F\u0073\u006F\u0066\u0074
>
> The core of the problem seems to be that any one of these letters can be
> glyphically (visually) represented using another code point, for example
> just
> look at 'some' of the different ways the letter 'm' can be visually
> represented:
>
> м \u043C
> М \u041C
> M \uFF2D
> m \uFF4D
> ʍ \u028D
> Μ \u039C
>
> So the ideal solution might map every letter against all possible visual
> representations of that letter. I know that's really tricky business, as
> even
> something like 'fn' might look like an 'm' in some fonts, and two v's 'vv'
>
> could look like a 'w'. Fonts play a part in this of course, and the
> problem
> starts to look unsolvable.
>
>
-- Mark
This archive was generated by hypermail 2.1.5 : Fri Mar 07 2008 - 17:44:31 CST