Re: [long] Use of Unicode in AbiWord

From: Mark Liberman (myl@unagi.cis.upenn.edu)
Date: Wed Mar 24 1999 - 08:56:33 EST


John,

This is not directly about Unicode any longer -- except to the very
limited extent that it connects to the question of whether vocabulary
size puts a limit on how many hanzi would be plausible to encode --
but I think that your estimate of English speakers' vocabulary size
is too low, by at least a factor of three:

>Analogously, an advanced English-speaker will have a vocabulary of about
>20,000 to 30,000 words. An unabridged English dictionary will have about
>600,000.

Of course, any estimate of how many words an English speaker knows
depends on careful definitions of "word", "English speaker", and
"know". Plausible changes in these definitions can result in more
than an order of magnitude difference in estimates.

However, one careful study (Nagy and Herman, "Breadth and Depth of
Vocabulary Acquisition", in McKeown and Curtis, Eds., The Nature of
Vocabulary Acquisition, Erlbaum, 1987), with a rather conservative
definition of "word" (regular derivational and inflectional sets
counted only once; no phrases; no proper names; no acronyms; etc.),
and a plausible definition of "know" (in the passive sense of reading
knowledge) found that the ordinary American high school graduates in
their sample "knew" about 40,000 "word families".

"Advanced" speakers will surely have a much larger passive
vocabulary than this. Using definitions and methods similar to Nagy
and Herman's -- checking a random sample from a large list -- I see
vocabulary estimates in many Penn undergraduates in the range of
70,000 to 90,000.

If you introduce proper names and "words" with internal white space
(i.e. collocations whose meaning is not compositional), the estimates
(for whatever kind of speaker) would of course be much larger.

Estimating active vocabulary is much more difficult -- I don't know of any
satisfactory method -- but for the purposes of this list, I presume that
it's the passive vocabulary that matters.

--

-Mark Liberman



This archive was generated by hypermail 2.1.2 : Tue Jul 10 2001 - 17:20:44 EDT