RE: Subj: Unicode form field validation in javascript

From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 01 2007 - 13:50:59 CDT

Next message: Christopher Fynn: "Re: questions on implementing an embeded system that supports unicode"

Previous message: William J Poser: "RE: questions on implementing an embeded system that supports unicode"
In reply to: Magda Danish (Unicode): "Subj: Unicode form field validation in javascript"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

knez.dusan@gmail.com wrote:
> $legalChars = "/\p{L}|\p{Pc}|\p{N}/"; // check for letters, numbers and
> underscores $legalCharsCount = preg_match_all($legalChars,$strng,$blb);
> $illegalCharsCount = mb_strlen($strng,"UTF-8") - $legalCharsCount;
>
> I wonder, how to implement the similar javascript validation with regular
> expressions on client side.

Your test is basically, just verifying that there are no characters in the
string that don't match your regexp, which should be better written by
writing it as:
"/[\p{L}\p{Pc}\p{N}]/" using an explicit set.

Note however that your expression will not allow people to enter all their
names the way they want them written, because this expression isexcluding
combining characters (for example Hebrew and Arabic vowel points, or all
vowels from the Indic abugidas).

I think you will also receive complains about those using apostrophes (don't
assume they are necessarily "closing" on right) and hyphens, or needed
whitespaces in their composed names. (Here you are only accepting letters,
some punctuation, and digits/numbers, but no combining diacritic, no hyphen,
no apostrophes, and no space; for some languages you'll need other
"punctuation" marks as well because they are not used as punctuation in the
native language but as part of the orthographic system, as if they were
letters).

If you want to make sure that people name will be correctly allowed, you may
first look at the list of languages you want to support, and then look at
the list of characters needed for it, which you can find in the many UNHDR
texts present now on the Unicode site: each text comes with a list of the
characters it uses.

Javascript does not have a built-in support for this level of regular
expressions. However it does support some regular expressions that couldbe
built using a simpler syntax for the set of allowed characters.

But note also that Javascript internally handles strings encoded with
UTF-16, not UTF-8 like what you are doing in PHP on the server-side.
This means that the set must be described using only UTF-16 code-units. Note
also that Javascript does not enforce the codepointboundaries in the UTF-16
encoding (so characters out of the BMP that are accepted by your regexp,
would not be accepted in a Javascript Regexp, where they would be seen only
as separate surrogates).

So I suggest you to initialise a constant array containing the ranges of
accepted characters; This can be done with just a single array of javascript
integers (the even index stores the start of an accepted range, the odd
index stores the start of a rejected range), and then using a simple
dichomotic search for each codepoint to check.

You'll also need a simple loop to scan the javascript string to detect
surrogates and associate them in pairs to convert them into codepoints. If
surrogates are unpaired, you can return a false value from your test
function to signal it contains invalid javascript "characters". Note that
Javascript characters are, not the same as Unicode codepoints or abstract
characters: javascript strings are just vectors of 16-bit code-units (a
Javascript string can safely store invalid codepoints, and in fact any
vector in any order of codeunits in the full range \x0000 to \xFFFF)

However it seems difficult to enter such invalid text in a browser, that
must perform a validation for XML conformance, even if a Javascript string
could store the string (what this means is that not all Javascript strings
are assignable or retrievable from a HTML/XML-bound input object). But I
would not bet it, because of possible bugs in browsers, or limitation in
their editors that allow entering such text invalid for Unicode or for
XML/HTML transmission, despite they may be present in the local input
object.

The implementation is then simple to do, if your regular expression is
fixed: it just requires an equivalent initialized constant vector of
integers.

Next message: Christopher Fynn: "Re: questions on implementing an embeded system that supports unicode"
Previous message: William J Poser: "RE: questions on implementing an embeded system that supports unicode"
In reply to: Magda Danish (Unicode): "Subj: Unicode form field validation in javascript"
Messages sorted by: [ date ] [ thread ] [ subject ] [ author ] [ attachment ]
Mail actions: [ respond to this message ] [ mail a new topic ]

This archive was generated by hypermail 2.1.5 : Wed Aug 01 2007 - 13:51:57 CDT