Mark E Davis wrote:
Note that with these simple characters
there is no way to distinguish the few semantic ligatures (that actually
should not be ignored) from the bulk of the decorative ligatures (which
must be ignored), unless one uses markup or adds yet more characters!
Distinguishing the two cases only makes sense when plain text must be
extracted from the rich text, because in this case you need to grasp the
"essence" of the text, dropping all the decorative markup. This is an
important process, and does not only happen when "the file is saved as plain
text" but, e.g. when:
- sections of text are extracted to be compared with a plain-text search
string;
- sections of text are copied from a rich text application to be pasted into
a plain text one;
- parts of the text need to be stored/loaded from/to some sort of binary
database (including, e.g., the document's header, TOCs, indices).
Adding more markup would be the ideal solution to keep the rich text scheme
consistent.
If the rich text scheme has separate markups to indicate runs of text as
"semantic" or "decorative" ligatures, there distinction is possible, of
course:
* rich text "a<SEMANTIC_LIG>bcde</SEMANTIC_LIG>f"
plain text "ab<ZWL>c<ZWL>d<ZWL>ef"
* rich text "a<DECOR_LIG>bcde</DECOR_LIG>f"
plain text "abcdef"
* rich text
"a<DECOR_LIG>b<SEMANTIC_LIG>cd</SEMANTIC_LIG>e</DECOR_LIG>f"
plain text "abc<ZWL>def"
An economic (but less elegant) solution is to allow casual use of ZW*L codes
in the rich text too. These would map to themself when converted to plain
text:
* rich text "ab<ZWL>c<ZWL>d<ZWL>ef"
plain text "ab<ZWL>c<ZWL>d<ZWL>ef"
* rich text "a<GENERIC_LIG>bcde</GENERIC_LIG>f"
plain text "abcdef"
* rich text "a<GENERIC_LIG>bc<ZWL>de</GENERIC_LIG>f"
plain text "abc<ZWL>def"
Ciao.
Marco
This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:57 EDT