From: Doug Ewell (firstname.lastname@example.org)
Date: Wed Nov 13 2002 - 12:02:13 EST
Dominikus Scherkl <Dominikus dot Scherkl at glueckkanja dot com> wrote:
> Hm. <lang=en>...<\lang>
> that are 9+7 = 16 characters to indicate the language (and end of tag)
> All of them are ASCII, therefore encoded as 1 byte utf-8 each.
> Plane 14 requires 4 byte utf-8 each, and at least 3 characters
> (two tag-letters and the end-tag) - this is 12 bytes.
> Ok, this is less heavy, but not very much.
> Or what do you think what "weight" in this context means?!?
I definitely *don't* mean the number of bytes in UTF-8, which is just
one way of representing Unicode text. In UTF-16, Plane 14 tags require
only two 16-bit code units (cf. one for each Basic Latin character),
while in SCSU they can take as little as one byte each, after the
initial 3-byte overhead to set up the window and maybe another byte at
the start of each tag (not character) to switch to that window.
Interpretation of UTF-8 bytes needs to happen at the very earliest
stages of text processing. It does not belong in the same stage as
language tag interpretation, normalization, bidi, etc.
Furthermore, with Plane 14 you don't always need a closing tag like
</lang>, and if it's present you don't need to check it for syntax.
(Well, I suppose you could have something illegal like U+E0001 U+E0065
U+E006E U+E007F, but that's no worse than having to check for </lanf>.)
This archive was generated by hypermail 2.1.5 : Wed Nov 13 2002 - 12:46:24 EST