RFC 2070 says the following with regard to HTML:
NOTE -- RFC 1866 section 4.2.2 specifies that an HTML user agent
should treat an end of line as a word space, except in
preformatted text. This should be interpreted in the context of
the script being processed, as the way words are separated in
writing is script-dependent. For some scripts (e.g. Latin), a
word space is just a space, but in other scripts (e.g. Thai) it is
a zero-width word separator, whereas in yet other scripts (e.g.
Japanese) it is nothing at all, i.e. totally ignored.
That's nice. However, so far I can't find anyone who can give me a
way to implement this particular note. Can I tell algorithmically
whether the two characters I'm trying to put a space between are such
that it should be a space, a zero-width word separator, or nothing at
all? Obviously I can test to see if they are in a particular script
range in the case of Unicode (which is what I'm working with), but I
don't have an exhaustive list of which scripts get which treatment.
Can anyone help?
pr
-- Pete Resnick <mailto:presnick@qualcomm.com> Eudora Engineering - QUALCOMM Incorporated Ph: (217)337-6377 or (858)651-4478, Fax: (858)651-1102
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:21:02 EDT