Download - Borgmann Project List all the words in the English language Chris Cole Ur Studios, Inc.

Borgmann ProjectList all the words in the English language

Chris Cole

Ur Studios, Inc.

Dmitri Alfred Borgmann

“Father of logology”

Recreational linguistics Systematic wordplay

Author of two seminal books

Language on Vacation: An Olio of Orthographical Oddities (Scribner's, 1965)

Beyond Language (Adventures in Words and Thought) (Scribner's, 1967)

Founder of Wordplay: The Journal of Recreational Linguistics (1968)

How many words are in the English Language?

“The English language has a complement of somewhere between two million and three million "short" words...”-Dmitri Borgmann, Beyond Language, p. 226

How many words are in the largest unabridged dictionaries?

Philip Babcock Gove, Preface to Webster's Third New International Dictionary of the English Language, Unabridged (G. & C. Merriam, 1961), p. 7a:

“This dictionary has a vocabulary of over 450,000 words. It would have been easy to make the vocabulary larger although the book, in the format of the preceding edition, could hardly hold any more pages or be any thicker. By itself, the number of entries is, however, not of first importance. The number of words available is always far in excess of and for a one volume dictionary many times the number that can possibly be included.”


John Simpson, Chief Editor, Oxford English Dictionary, Preface to the Third Edition, March 2000:

“There are a number of myths about the Oxford English Dictionary, one of the most prevalent of which is that it includes every word, and every meaning of every word, which has ever formed part of the English language. Such an objective could never be fully achieved. […] It is also often claimed that a ‘word’ is not a ‘word’ (or is not ‘English’) unless it is in ‘the dictionary’. This may be acceptable logic for the purposes of word games, but not outside those limits. […] It may be added here that the question ‘How many words are there in the English language?’ cannot be answered by recourse to a dictionary.”


Victoria Neufeldt, editor of the Webster's New World family of dictionaries, quoted in Kenneth F. Kister, Kister's Best Dictionaries for Adults and Young People, A Comparative Guide, The Oryx Press, Phoenix, Arizona, 1992, p. 79:

“I hate the word "unabridged." It's stupid and misleading, since it is used for all large dictionaries, regardless of whether an abridged edition of a given dictionary exists; and also, because the word sort of implies the idea of completeness, it encourages the buyer to believe that the dictionary so described contains all the words of the language. No dictionary comes anywhere near doing that.”

What is a word?

A word is the smallest unit of meaning. Analogous to:

A letter is the smallest unit of spelling.A phoneme is the smallest unit of pronunciation.

How many words are in the English language?

Unabridged dictionaries contain about 500,000 words. If “many times” (Gove) implies a multiple of 4 to 6, then 2 to 3 million (Borgmann) is a

reasonable estimate. How to find these words?

The problem of names

A name is a word that designates an individual or a class of individuals. Unlimited number of names.

The problem of prefixes and suffixes

“countermeasures,” “countercountermeasures,” “countercountercountermeasures,” etc. are all understandable and distinct, hence words.

The problem of compounds

English loose with closing open compounds, e.g., “airvent,” “air vent,” “air-vent.”

The problem of derived forms

Is "shanghaiings" a word?shanghai (verb) →shanghaiing (participle) →shanghaiing (noun) →shanghaiings (plural)

If so, it is interesting because each letter in it occurs exactly twice.

The problem of rare words

Comprises American, Canadian, British, Australian, etc. dialects. Web3 lists words printed since 1752. OED lists many older forms. What is

the cutoff? In addition you have jargon, technical terms, slang, loan words, etc. What is the English language?

Example: “amitular”

Incorrectly formed by analogy with “avuncular.” The ending “-ular” appended to “amit-” from the Latin “amita” (“aunt”), whereas the most appropriate adjective is probably “amital.”

Independently coined in 1982, 2003, 2004, 2007. Listed in several reference books.

Solving these problems

What does “word” mean? Paradox of the heap. If you remove one grain of sand from a heap, it is still

a heap, hence logically even one grain is a heap. The word “heap” is less likely to apply to smaller heaps. “word” is a vague term like “heap.”

Probability is the key

Wittgenstein: no private language. To be in a language, a word must be understood by multiple

speakers of that language. The probability that a string is a word is just the probability that it

will be understood by a speaker.

Solves previous problems

Names: specific names are understood by only a few speakers Prefixes and suffixes: highly stacked words are difficult to understand Compounds: most are in fact words Rare: understood by few speakers Derived forms: unusual derivations are difficult to understand

Using dictionaries to determine probability

Words included because of likelihood of being useful to customers. Example: Early dictionaries did not include common words. Example: “airvent” is not in any dictionary because it’s meaning is obvious. Limit to size of printed dictionaries, but does not apply to electronic

dictionaries.

Not going to be fixed by electronic dictionaries

Costs money to define a word. Word inclusion requires cost/benefit analysis. Faulty assumption: Words that are easily understood will be in

dictionary.

Using corpora to determine probability

Large corpora are available: USENET: over one million distinct strings in one billion instances Google: over ten million distinct strings occurring over 200 times in

one trillion instances

Problem with using corpus to determine probability

Example: “countercountermeasure” Defined in college–level dictionary (11th Collegiate). Google hits initial report 1000, really about 100. Why not in Google corpus? Does not have enough occurrences (200).

How many dictionary words are not in the corpus?

Examples of college-level words not in corpus: airmanships, airposts, airpowers

>40% of dictionary words not in Google corpus. The problem is the corpus cutoff requiring at least 200 occurrences.

College-level Unabridged

In corpus 109796 241213

Not in corpus 9882 200523

Signal versus noiseWhy is the Google corpus cutoff 200 occurrences?

Frequency Count Ratio2,000,000,000 44 -

200,000,000 325 7.38636

20,000,000 3,844 11.8277

2,000,000 20,403 5.30775

200,000 83,972 4.11567

20,000 387,649 4.61641

2,000 2,134,600 5.50653

200 10,957,554 5.13331

20 55,000,000 5

2 275,000,000 5

Signal versus noise

Too many noise words below 200 hits.

Hits College-level Unabridged Words Non-words

100 5 31 766 1,178

1,000 21 31 249 258

10,000 29 21 64 44

100,000 25 8 4 0

1,000,000 10 0 0 0

10,000,000 3 0 0 0

Example: 2747 strings that start with “air”Samples of non-words: aircraaft aircracft aircract aircradft aircradt aircraf aircraf5 aircraf5t aircraf6 aircraf6t aircrafc aircrafct aircrafdt aircraff aircrafft aircrafg aircrafgt aircrafi aircrafr aircrafrtSamples of words: airbagged airbalancer airball airballed airballoon airballs airband airbands airbanks airbath airbaths airbats airbattle airbeam airbeams airbear airbearing airbed airbeds airbell airbelt

Corpora are not the solution by themselves

Use versus mention, names, spam, etc. Faulty assumption: Word that is easily understood will be used.

Modeling human understanding

Bayesian model of word understanding Neuroscience results give us reasons to believe that understanding

can be modeled by Bayesian inference. Generative model of word formation Linguistics gives us reasons to believe that word formation follows a

predictable historical process.

Bayesian model of word understanding

Example: shanghaiings (plural of noun, p1) →shanghaiing (noun from participle, p2) →shanghaiing (participle from verb, p3) →shanghai (in dictionary, p4)

Probability = p1 * p2 * p3 * p4 pi determined via Bayes’ Law from observed ratios of occurrences

(of similar cases)

Generative model of word formation

Rules of word formation (etymology, parallelism, sound change, spelling change, etc.)

Example: avunculus (Latin, “uncle”) → avuncular

amita (Latin, “aunt”) → amitular║

Work in progress

Iterative approach Work “outward” from dictionary using linguistic rules of word

formation. Work “inward” from corpus using Bayesian inference on grammar

rules. Goal is a process instead of a list

Work in progress

Compound Dictionary 500K

Dictionary 250k

Dictionary 125k

Dictionary 60k

0100020003000400050006000

25%50%75%100%

Words starting with “pro.” Results from parsing 10 million sentences collected from USENET 1992.