Borgmann ProjectList all the words in the English language
Chris Cole
Ur Studios, Inc.
Dmitri Alfred Borgmann
“Father of logology”
Recreational linguistics Systematic wordplay
Author of two seminal books
Language on Vacation: An Olio of Orthographical Oddities (Scribner's, 1965)
Beyond Language (Adventures in Words and Thought) (Scribner's, 1967)
Founder of Wordplay: The Journal of Recreational Linguistics (1968)
How many words are in the English Language?
“The English language has a complement of somewhere between two million and three million "short" words...”-Dmitri Borgmann, Beyond Language, p. 226
How many words are in the largest unabridged dictionaries?
Philip Babcock Gove, Preface to Webster's Third New International Dictionary of the English Language, Unabridged (G. & C. Merriam, 1961), p. 7a:
“This dictionary has a vocabulary of over 450,000 words. It would have been easy to make the vocabulary larger although the book, in the format of the preceding edition, could hardly hold any more pages or be any thicker. By itself, the number of entries is, however, not of first importance. The number of words available is always far in excess of and for a one volume dictionary many times the number that can possibly be included.”
How many words are in the largest unabridged dictionaries?
John Simpson, Chief Editor, Oxford English Dictionary, Preface to the Third Edition, March 2000:
“There are a number of myths about the Oxford English Dictionary, one of the most prevalent of which is that it includes every word, and every meaning of every word, which has ever formed part of the English language. Such an objective could never be fully achieved. […] It is also often claimed that a ‘word’ is not a ‘word’ (or is not ‘English’) unless it is in ‘the dictionary’. This may be acceptable logic for the purposes of word games, but not outside those limits. […] It may be added here that the question ‘How many words are there in the English language?’ cannot be answered by recourse to a dictionary.”
How many words are in the largest unabridged dictionaries?
Victoria Neufeldt, editor of the Webster's New World family of dictionaries, quoted in Kenneth F. Kister, Kister's Best Dictionaries for Adults and Young People, A Comparative Guide, The Oryx Press, Phoenix, Arizona, 1992, p. 79:
“I hate the word "unabridged." It's stupid and misleading, since it is used for all large dictionaries, regardless of whether an abridged edition of a given dictionary exists; and also, because the word sort of implies the idea of completeness, it encourages the buyer to believe that the dictionary so described contains all the words of the language. No dictionary comes anywhere near doing that.”
What is a word?
A word is the smallest unit of meaning. Analogous to:
A letter is the smallest unit of spelling.A phoneme is the smallest unit of pronunciation.
How many words are in the English language?
Unabridged dictionaries contain about 500,000 words. If “many times” (Gove) implies a multiple of 4 to 6, then 2 to 3 million (Borgmann) is a
reasonable estimate. How to find these words?
The problem of names
A name is a word that designates an individual or a class of individuals. Unlimited number of names.
The problem of prefixes and suffixes
“countermeasures,” “countercountermeasures,” “countercountercountermeasures,” etc. are all understandable and distinct, hence words.
The problem of compounds
English loose with closing open compounds, e.g., “airvent,” “air vent,” “air-vent.”
The problem of derived forms
Is "shanghaiings" a word?shanghai (verb) →shanghaiing (participle) →shanghaiing (noun) →shanghaiings (plural)
If so, it is interesting because each letter in it occurs exactly twice.
The problem of rare words
Comprises American, Canadian, British, Australian, etc. dialects. Web3 lists words printed since 1752. OED lists many older forms. What is
the cutoff? In addition you have jargon, technical terms, slang, loan words, etc. What is the English language?
Example: “amitular”
Incorrectly formed by analogy with “avuncular.” The ending “-ular” appended to “amit-” from the Latin “amita” (“aunt”), whereas the most appropriate adjective is probably “amital.”
Independently coined in 1982, 2003, 2004, 2007. Listed in several reference books.
Solving these problems
What does “word” mean? Paradox of the heap. If you remove one grain of sand from a heap, it is still
a heap, hence logically even one grain is a heap. The word “heap” is less likely to apply to smaller heaps. “word” is a vague term like “heap.”
Probability is the key
Wittgenstein: no private language. To be in a language, a word must be understood by multiple
speakers of that language. The probability that a string is a word is just the probability that it
will be understood by a speaker.
Solves previous problems
Names: specific names are understood by only a few speakers Prefixes and suffixes: highly stacked words are difficult to understand Compounds: most are in fact words Rare: understood by few speakers Derived forms: unusual derivations are difficult to understand
Using dictionaries to determine probability
Words included because of likelihood of being useful to customers. Example: Early dictionaries did not include common words. Example: “airvent” is not in any dictionary because it’s meaning is obvious. Limit to size of printed dictionaries, but does not apply to electronic
dictionaries.
Not going to be fixed by electronic dictionaries
Costs money to define a word. Word inclusion requires cost/benefit analysis. Faulty assumption: Words that are easily understood will be in
dictionary.
Using corpora to determine probability
Large corpora are available: USENET: over one million distinct strings in one billion instances Google: over ten million distinct strings occurring over 200 times in
one trillion instances
Problem with using corpus to determine probability
Example: “countercountermeasure” Defined in college–level dictionary (11th Collegiate). Google hits initial report 1000, really about 100. Why not in Google corpus? Does not have enough occurrences (200).
How many dictionary words are not in the corpus?
Examples of college-level words not in corpus: airmanships, airposts, airpowers
>40% of dictionary words not in Google corpus. The problem is the corpus cutoff requiring at least 200 occurrences.
College-level Unabridged
In corpus 109796 241213
Not in corpus 9882 200523
Signal versus noiseWhy is the Google corpus cutoff 200 occurrences?
Frequency Count Ratio2,000,000,000 44 -
200,000,000 325 7.38636
20,000,000 3,844 11.8277
2,000,000 20,403 5.30775
200,000 83,972 4.11567
20,000 387,649 4.61641
2,000 2,134,600 5.50653
200 10,957,554 5.13331
20 55,000,000 5
2 275,000,000 5
Signal versus noise
Too many noise words below 200 hits.
Hits College-level Unabridged Words Non-words
100 5 31 766 1,178
1,000 21 31 249 258
10,000 29 21 64 44
100,000 25 8 4 0
1,000,000 10 0 0 0
10,000,000 3 0 0 0
Example: 2747 strings that start with “air”Samples of non-words: aircraaft aircracft aircract aircradft aircradt aircraf aircraf5 aircraf5t aircraf6 aircraf6t aircrafc aircrafct aircrafdt aircraff aircrafft aircrafg aircrafgt aircrafi aircrafr aircrafrtSamples of words: airbagged airbalancer airball airballed airballoon airballs airband airbands airbanks airbath airbaths airbats airbattle airbeam airbeams airbear airbearing airbed airbeds airbell airbelt
Corpora are not the solution by themselves
Use versus mention, names, spam, etc. Faulty assumption: Word that is easily understood will be used.
Modeling human understanding
Bayesian model of word understanding Neuroscience results give us reasons to believe that understanding
can be modeled by Bayesian inference. Generative model of word formation Linguistics gives us reasons to believe that word formation follows a
predictable historical process.
Bayesian model of word understanding
Example: shanghaiings (plural of noun, p1) →shanghaiing (noun from participle, p2) →shanghaiing (participle from verb, p3) →shanghai (in dictionary, p4)
Probability = p1 * p2 * p3 * p4 pi determined via Bayes’ Law from observed ratios of occurrences
(of similar cases)
Generative model of word formation
Rules of word formation (etymology, parallelism, sound change, spelling change, etc.)
Example: avunculus (Latin, “uncle”) → avuncular
amita (Latin, “aunt”) → amitular║
Work in progress
Iterative approach Work “outward” from dictionary using linguistic rules of word
formation. Work “inward” from corpus using Bayesian inference on grammar
rules. Goal is a process instead of a list
Work in progress
Compound Dictionary 500K
Dictionary 250k
Dictionary 125k
Dictionary 60k
0100020003000400050006000
25%50%75%100%
Words starting with “pro.” Results from parsing 10 million sentences collected from USENET 1992.
Top Related