From: Tom Emerson (tree@basistech.com)
Date: Wed Aug 10 2005 - 15:05:21 CDT
Philippe Verdy writes:
> OK but this is not a text encoding decoder: this means that you have to
> build a list of candidate charsets that pass at the plain-text level, then
> to try parse the text using a HTML parser to filter out parts that should
> not count in statistics:
[...]
Yes, yes, yes, we've done all of this, and more, for HTML and numerous
other markup languages and document formats: I'm speaking from
experience here, not just spouting off random ideas.
> You'll also have to consider the case where some or all of these text
> elements and attributes is already marked with a language indicator. In that
> case, the language autodetection should ignore them, and instead the
> statistics of characters should be computed separately per indicated
> language.
A lot of the time we find that the language attribute on a given tag
is wrong. User supplied metadata is useful, but can rarely be
trusted. More useful, often, are the font tags that they sprinkle
around. These can be used to help infer language, and later, encoding.
> The other problem is that most composed pages forget to explicitly label the
> foreign language used in small spans of text. These spans can be very
> frequent, specially within technical documents (like a JavaDoc page, or
> document speaking about some standards, with lots of acronyms or
> untranslated terms).
We have technology here that can detect occurrences of multiple
languages in a single document, though not at the level of one or two
words.
> To detect a language, you could also try searching for very common terms
> like "the", "is", "are", "have", "and" in English, "le", "un", "a", "à",
> "est", "et" in French, "der", "das", "ist" in German. These general terms
> are exactly those that are generally ignored by search engines due to their
> frequence in each language.
Right, isn't this what the Netscape detecter does? Building these term
lists is easily done, and can be useful indeed when disambiguating
possible matches. You can use these lists too to differentiate very
similar languages, like Malay and Indonesian, something we can do
quite reliably when given enough text.
-- Tom Emerson Basis Technology Corp. Software Architect http://www.basistech.com "You can't fake quality any more than you can fake a good meal." (W.S.B.)
This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 15:06:37 CDT