To go with Lukas's Perl code, I'll provide a C version, not really tested
either, with ICU, to give him a choice. No error checking etc., just to give
the idea. If you want UTF-16 you'll need to use the macros in
unicode/utf16.h to generate surrogate pairs properly.
#include <stdio.h>
#include <unicode/utf8.h>
#define LINE_MAX 80 /* Whatever. */
int main() {
char buf[LINE_MAX];
while (fgets(buf, sizeof(buf), stdin)) {
int i;
size_t len = strlen(buf);
if (buf[len - 1] == '\n') {
buf[--len] = 0; /* We don't want that one in
the output. */
}
for (i= 0; i < len;) {
int32_t c;
UTF8_NEXT_CHAR_UNSAFE(buf, i, c);
printf(c < 0x80U ? "%c" : "&#%ld;", c); /* As Lukas's code,
use entities only above ASCII. */
}
putchar('\n'); /* Separate lines; will produce white space
in HTML. */
}
}
Hope this helps,
YA
This archive was generated by hypermail 2.1.2 : Fri Jul 06 2001 - 00:18:17 EDT