From: Philippe Verdy (verdy_p@wanadoo.fr)
Date: Wed Aug 10 2005 - 14:48:59 CDT
From: "Tom Emerson" <tree@basistech.com>:
> Philippe Verdy writes:
>> This is absolutely not needed for a charset detector (i.e. the detection
>> of
>> the encoding used to serialize the text). HTML escapes are perfectly
>> valid
>> in HTML, and even if they refer to non Latin-1 characters, this does not
>> change the fact that the page remains encoded in ISO-8859-1.
>>
>> You don't need to take HTML escapes into account with regards of which
>> encoding is used, because these escapes are independant of the actual
>> encoding used.
>
> Agreed. But if you are interested in the langauge of the page as well
> as the encoding, which some applications do care about, then you have
> to take these into account. And, as I said, building a model that
> accounts for language as well as encoding can help differentiate the
> various Latin-n versions.
OK but this is not a text encoding decoder: this means that you have to
build a list of candidate charsets that pass at the plain-text level, then
to try parse the text using a HTML parser to filter out parts that should
not count in statistics:
- the document type declaration and its inline DTD if any
- processing instructions
- the HTML comments
- the syntaxic HTML tag delimiters < = / > and quotes around attribute
values
- the element and attribute names
- the spaces around block elements, and within the opening tags around the
attributes
- most values of attributes, except enumerated or ID or name attributes (but
not all, as there are localizable CDATA attribute values)
- a few text elements with specific syntax (for example the content of
<script> and <style> attributes) which are not considered as renderable
plain-text.
This done, you can use the *parsed* text elements and attributes (where
character entities like "΀" have been converted to plain-text
equivalent) to feed a statistic counter if you try detecting the language.
You'll also have to consider the case where some or all of these text
elements and attributes is already marked with a language indicator. In that
case, the language autodetection should ignore them, and instead the
statistics of characters should be computed separately per indicated
language.
This means that you'll end with several statistic vectors, one for each
explicit language, plus one for the unspecified language (note that the
document headers or HTTP headers may include its own language indicator,
however this indication is notoriously incorrect, specially in the HTTP
headers, because it is often generated within common headers or page
templates for a whole site, even if the HTML page uses another language).
All the above remains specific to HTML. But there are other options to
consider that do apply to plain-text only documents without markup:
The other problem is that most composed pages forget to explicitly label the
foreign language used in small spans of text. These spans can be very
frequent, specially within technical documents (like a JavaDoc page, or
document speaking about some standards, with lots of acronyms or
untranslated terms).
To detect a language, you could also try searching for very common terms
like "the", "is", "are", "have", "and" in English, "le", "un", "a", "à",
"est", "et" in French, "der", "das", "ist" in German. These general terms
are exactly those that are generally ignored by search engines due to their
frequence in each language. I could have taken other examples than these 3
languages that generally use the same ISO-8859-1 charset, but their
frequence in a text informs that the document is probably not encoded in
ISO-8859-2 or -4 or -15.
This archive was generated by hypermail 2.1.5 : Wed Aug 10 2005 - 14:50:47 CDT