From: Kenneth Whistler (kenw@sybase.com)
Date: Mon May 19 2008 - 18:42:20 CDT
> But even the mail from this list seems not to be sent with a
> Unicode encoding; another mystery.
Not really a mystery at all. The encodings depend on the various
and sundry mail clients used by the people sending mail *to*
the list.
> > if
> > you mean normalization in the sense of transforming to a Normalization
> > Form
>
> Sorry, I meant normalizing according to a set of Unicode-inspired
> orthographic norms.
While Unicode may inspire people who work on orthographic norms,
it is important to note that Unicode itself (and the Unicode Consortium)
is *not* about orthographic norms. The Unicode Standard identifies
characters, but it does not attempt to tell people which characters
they *must* use for particular orthographies. That is up to folks
who are concerned with orthographic norms.
I realize that the use of apostrophes is a particularly fraught
question, but this is a result of the peculiar character encoding
history of the much-overloaded ASCII 0x27 apostrophe, and the
subsequent history of introduction of directional single quote
marks into character encodings and then U+02BC MODIFIER LETTER
APOSTROPHE into Unicode.
But there is no one "right answer" -- no matter how much people
might want one -- for which of the alternates should be used
under all circumstances.
> I appreciate your caution. On the other hand, not touching it is a decision,
> too. If different sources represent the same lexeme with different apostrophes
> and we refrain from touching them, then we’re asserting
> (in our project) that
> these lexemes are distinct, and this interferes with our discovery of
> translation paths through the lexeme.
In which case you should probably be doing some version of
"apostrophe folding" for the purposes of your lexemic analysis.
> Apparently, though it was by some (e.g., James Kass, who argued--against the
> view of Asmus Freytag--that Web pages are more, not less, subject to an
> expectation of standard conformity than are paper-printed works, and finished
> with: “Web pages on the Unicode site should be exemplary”). For my purposes,
> it would certainly help if they were exemplary, and it casts doubt on the
> claim of practicality of the standard when the standardizing authority doesn’t
> comply.
Doesn't "comply" with what? As I noted above, the Unicode Standard is
not about specifying orthographic norms. And the standardizing authority
for HTML is W3C, not the Unicode Consortium.
>
> > There is this problem in ukrainian language, where apostrophe means hard sign.
> > How to reproduce it in original cyrillic script? It would not be a "diacritic"
> > character as apostrophe, but it is really the original cyrillic character at
> > the moment (The Ukrainian National Library thake it as an apostrophe U+0027).
>
> > Same as in the Latin script: U+2019
> > http://www.unics.uni-hanover.de/nhtcapri/cyrillic-script.html5
>
> Why? This seems to conflict with the standard as I understand it. I believe
> it’s a letter with a phonological value, not a punctuation mark, so I
> understand the standard to state that the correct character is 02BC (MODIFIER
> LETTER APOSTROPHE). I believe that this is argued for at
> http://linux.org.ua/cgi-bin/yabb/YaBB.pl?num=1189996822/75
> in message 87. If I’m incorrect, I’d appreciate an explanation. Thanks.
Followed in message 89 by a quotation from Unicode 4.1.0 about
the distinctions (or non-distinctions) between U+0027, U+02BC,
and U+2019 -- and we are back chasing our tails again. It really
is the same set of arguments for every orthography that uses
a raised comma-shaped "apostrophe" in one or more contexts. Does
it systematically distinguish between letter and punctuation uses,
and if so, in what contexts?
--Ken
This archive was generated by hypermail 2.1.5 : Mon May 19 2008 - 18:45:42 CDT