Re: mapping characters with visual similarities

From: Mark Davis (mark.davis@icu-project.org)
Date: Fri Mar 07 2008 - 17:42:11 CST

Next message: Erkki I. Kolehmainen: "Invitation to Participate in a Workshop on Functional Multilingual Extensions to Keyboard Layouts"

Previous message: Kenneth Whistler: "Re: mapping characters with visual similarities"
In reply to: Chris Weber (Casaba Security): "mapping characters with visual similarities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

Take a look at http://www.unicode.org/reports/tr36/ and then
http://www.unicode.org/reports/tr39/

Mark

On Fri, Mar 7, 2008 at 2:30 PM, Chris Weber (Casaba Security) <
chris@casabasecurity.com> wrote:

> Hi group I thought this might be the right place to ask this question,
> and
> apologize if this has been answered in the past.
>
> How can I blacklist a large set of words from a wordlist when all unicode
> blocks are allowed (e.g. full width latin, cyrillic, etc.)? The scenario
> would be a web-application written in .Net supporting UTF-8. It consumes
> a
> string of input, then compares the string against a wordlist of
> disallowed, or
> blacklisted, words.
>
> Background:
> A sample of the blacklist includes profanity and trademark names like:
> - microsoft
> - wal-mart
> - apple
>
> Looking at the word 'microsoft' in its UCN form would be:
> \u006D\u0069\u0063\u0072\u006F\u0073\u006F\u0066\u0074
>
> The core of the problem seems to be that any one of these letters can be
> glyphically (visually) represented using another code point, for example
> just
> look at 'some' of the different ways the letter 'm' can be visually
> represented:
>
> м \u043C
> М \u041C
> Ｍ \uFF2D
> ｍ \uFF4D
> ʍ \u028D
> Μ \u039C
>
> So the ideal solution might map every letter against all possible visual
> representations of that letter. I know that's really tricky business, as
> even
> something like 'fn' might look like an 'm' in some fonts, and two v's 'vv'
>
> could look like a 'w'. Fonts play a part in this of course, and the
> problem
> starts to look unsolvable.
>
>

-- 
Mark

Next message: Erkki I. Kolehmainen: "Invitation to Participate in a Workshop on Functional Multilingual Extensions to Keyboard Layouts"
Previous message: Kenneth Whistler: "Re: mapping characters with visual similarities"
In reply to: Chris Weber (Casaba Security): "mapping characters with visual similarities"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Fri Mar 07 2008 - 17:44:31 CST