mapping characters with visual similarities

From: Chris Weber (Casaba Security) (chris@casabasecurity.com)
Date: Fri Mar 07 2008 - 16:30:30 CST

  • Next message: Kenneth Whistler: "Re: mapping characters with visual similarities"
    Hi group I thought this might be the right place to ask this question, and
    apologize if this has been answered in the past.

    How can I blacklist a large set of words from a wordlist when all unicode
    blocks are allowed (e.g. full width latin, cyrillic, etc.)?  The scenario
    would be a web-application written in .Net supporting UTF-8.  It consumes a
    string of input, then compares the string against a wordlist of disallowed, or
    blacklisted, words.

    Background:
    A sample of the blacklist includes profanity and trademark names like:
    - microsoft
    - wal-mart
    - apple

    Looking at the word 'microsoft' in its UCN form would be:
    \u006D\u0069\u0063\u0072\u006F\u0073\u006F\u0066\u0074

    The core of the problem seems to be that any one of these letters can be
    glyphically (visually) represented using another code point, for example just
    look at 'some' of the different ways the letter 'm' can be visually represented:

    м  \u043C
    М  \u041C
    M \uFF2D
    m \uFF4D
    ʍ  \u028D
    Μ  \u039C

    So the ideal solution might map every letter against all possible visual
    representations of that letter.  I know that's really tricky business, as even
    something like 'fn' might look like an 'm' in some fonts, and two v's 'vv'
    could look like a 'w'.  Fonts play a part in this of course, and the problem
    starts to look unsolvable.  



    This archive was generated by hypermail 2.1.5 : Fri Mar 07 2008 - 16:39:13 CST