Post on 27-Jun-2020
Hypothesis Testing Hypothesis Generation
Corpus LinguisticsA Simple Introduction
Niko Schenkn.schenk@em.uni-frankfurt.de
Applied Computational Linguistics LabComputer Science Department /
Department of English- and American StudiesGoethe University Frankfurt, Germany
April 17, 2019
Niko Schenk Corpus Linguistics – Introduction 1 / 35
Hypothesis Testing Hypothesis Generation
1 Hypothesis Testing
2 Hypothesis Generation
Niko Schenk Corpus Linguistics – Introduction 2 / 35
Hypothesis Testing Hypothesis Generation
1 Hypothesis Testing
2 Hypothesis Generation
Niko Schenk Corpus Linguistics – Introduction 3 / 35
Hypothesis Testing Hypothesis Generation
“Correct” vs. “Incorrect” Use of Language
Figure: Source: http://www.elektrojournal.at/bilder/d166/Saturn_Claim.jpg
Niko Schenk Corpus Linguistics – Introduction 4 / 35
Hypothesis Testing Hypothesis Generation
Soo! muss Technik
Regarding the slogan four things are striking:
Missing main verb “sein” (verbal ellipsis?/“ungrammatical”)Orthographic variation “soo” vs. “so” (spelling “error”)Misleading punctuation “soo! ...”, all letters capitalized...
Question: How would you objectively/formally test that, “soo”—compared to“so”—is not “regular” language?
Dictionary...Better: → www.google.com!
Niko Schenk Corpus Linguistics – Introduction 5 / 35
Hypothesis Testing Hypothesis Generation
Comparing Frequencies of Contrastive Elements
(a) Google results for “so” (b) Google results for “soo”
Niko Schenk Corpus Linguistics – Introduction 6 / 35
Hypothesis Testing Hypothesis Generation
Refining Comparisons of Contrastive Elements
(a) Google results for “so schon” (b) Google results for “soo schon”
Niko Schenk Corpus Linguistics – Introduction 7 / 35
Hypothesis Testing Hypothesis Generation
Congratulations!
We’ve successfully performed our first corpus linguistic search.
How is that different from using a dictionary?Answer:
1 consult tons of real data instead single of contemporary rule.
2 use tendencies instead of absolute true/false answer (Thedictionary claims that “soo” is false or—even worse–that thephrase does not exist).
Niko Schenk Corpus Linguistics – Introduction 8 / 35
Hypothesis Testing Hypothesis Generation
Language Change-An Example
Figure: “Thrived” vs. ...
Niko Schenk Corpus Linguistics – Introduction 9 / 35
Hypothesis Testing Hypothesis Generation
Language Change-An Example cont’d
Figure: ... “throve”.
Niko Schenk Corpus Linguistics – Introduction 10 / 35
Hypothesis Testing Hypothesis Generation
Language Change-An Example cont’d
Question: How would you objectively/formally show that “throve” is obsolete/nolonger in use?
You could ask a native speaker...(Much) better: → Google books Ngram Viewer
Niko Schenk Corpus Linguistics – Introduction 11 / 35
Hypothesis Testing Hypothesis Generation
Figure: Distributions of “thrived” vs. “throve” in the Google books corpus.
Possible explanation: Low-frequency words change to fit the main paradigm.Niko Schenk Corpus Linguistics – Introduction 12 / 35
Hypothesis Testing Hypothesis Generation
Again, how is that different from asking a native speaker?Answer: consult real data instead of intuition of one individualspeaker.
Niko Schenk Corpus Linguistics – Introduction 13 / 35
Hypothesis Testing Hypothesis Generation
What is Corpus Linguistics? (I)
Starting with a linguistic phenomenon (see previous examples) and a hypothesis, youuse
large textual resources (a corpus!) and software to objectively test(falsify/verify) the hypothesis.
hypothesis testing is usually based on frequencies obtained through search.
Niko Schenk Corpus Linguistics – Introduction 14 / 35
Hypothesis Testing Hypothesis Generation
1 Hypothesis Testing
2 Hypothesis Generation
Niko Schenk Corpus Linguistics – Introduction 15 / 35
Hypothesis Testing Hypothesis Generation
Extracting Useful Information
Contrary to the previous examples, you don’t have to start with a concrete hypothesis.
You could just “do something” with the corpus itself:
e.g., compute various statistics and inspect the output.
Usually you count words, phrases, etc. This is done automatically with the helpof computer programs.
→ As a result, you can come up with a hypothesis.
Niko Schenk Corpus Linguistics – Introduction 16 / 35
Hypothesis Testing Hypothesis Generation
Motivation
Co-occurrence probabilities between words...
Niko Schenk Corpus Linguistics – Introduction 17 / 35
Hypothesis Testing Hypothesis Generation
Motivation
Niko Schenk Corpus Linguistics – Introduction 18 / 35
Hypothesis Testing Hypothesis Generation
Motivation
Niko Schenk Corpus Linguistics – Introduction 19 / 35
Hypothesis Testing Hypothesis Generation
Motivation
Niko Schenk Corpus Linguistics – Introduction 20 / 35
Hypothesis Testing Hypothesis Generation
Motivation
Niko Schenk Corpus Linguistics – Introduction 21 / 35
Hypothesis Testing Hypothesis Generation
Motivation
Niko Schenk Corpus Linguistics – Introduction 22 / 35
Hypothesis Testing Hypothesis Generation
Motivation
Niko Schenk Corpus Linguistics – Introduction 23 / 35
Hypothesis Testing Hypothesis Generation
Motivation
Niko Schenk Corpus Linguistics – Introduction 24 / 35
Hypothesis Testing Hypothesis Generation
Motivation
Niko Schenk Corpus Linguistics – Introduction 25 / 35
Hypothesis Testing Hypothesis Generation
How to Automatically Find Collocations
Example: Automatically collect words which co-occur more frequently than whatwould be expected:
Generated hypothesis: → These words are idiomatic expressions, proper names...Niko Schenk Corpus Linguistics – Introduction 26 / 35
Hypothesis Testing Hypothesis Generation
How to Automatically Detect Similar Facebook Users
Based on the data, you could just count the words which are used by different peopleand compare the numbers.
Niko Schenk Corpus Linguistics – Introduction 27 / 35
Hypothesis Testing Hypothesis Generation
How to Automatically Detect Similar Facebook Users
Generated hypothesis: → Groups of people with the same opinion share a similarvocabulary.
Niko Schenk Corpus Linguistics – Introduction 28 / 35
Hypothesis Testing Hypothesis Generation
How to Automatically Identify the Author of a Text
The same technique can be applied to find the author of a document.
For each document (written by an author), count the number of distinct wordswhich he/she uses.
Niko Schenk Corpus Linguistics – Introduction 29 / 35
Hypothesis Testing Hypothesis Generation
How to Automatically Identify the Author of a Text
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
0.30 0.35 0.40 0.45 0.50
0.25
0.30
0.35
0.40
0.45
0.50
standard type / token ratio
type
(le
mm
a) /
toke
n ra
tio
christopher.txt
christopher2.txt
instructions.txt
lisa.txt lisa2.txt
lisa3.txt
lisa4.txt
lisa5.txtlisa6.txt
marie−luise.txt
marie−luise2.txt
marie−luise3.txt
marie−luise4.txt
miriam.txt
miriam2.txt
miriam3.txtmiriam4.txt
miriam5.txt
miriam6.txt
philipp.txt
simon.txt
simon2.txt
simon3.txt
thu.txt
thu2.txt
thu3.txt
thu4.txt
thu5.txt
thu6.txtthu7.txt
thu8.txt
viktor.txt
viktor2.txt
Niko Schenk Corpus Linguistics – Introduction 30 / 35
Hypothesis Testing Hypothesis Generation
Hypothesis Generation
Generated hypothesis:→ Students appearing closer together in the visualization are similar in language use.
Niko Schenk Corpus Linguistics – Introduction 31 / 35
Hypothesis Testing Hypothesis Generation
What is Corpus Linguistics? (II)
Starting with the corpus and no specific hypothesis, you use
large textual resources and statistics to detect contrastive (interesting)patterns, i.e. you generate a hypothesis.
Niko Schenk Corpus Linguistics – Introduction 32 / 35
Hypothesis Testing Hypothesis Generation
Summary
Corpus Linguistics can be divided into two parts
1 hypothesis testing
2 hypothesis generation
You usually use
large textual (linguistic) resources which are electronically available.
software to analyze (search) the data.
Frequencies are essential.
Niko Schenk Corpus Linguistics – Introduction 33 / 35
Hypothesis Testing Hypothesis Generation
Homework Assignment
Niko Schenk Corpus Linguistics – Introduction 34 / 35
Hypothesis Testing Hypothesis Generation
Corpus Linguistics—An Example
Google offers an exploratory search functionality as a corpus linguistic application.
Cf. Google Books Corpus1 / Google Ngram Viewer2
1http://googlebooks.byu.edu/x.asp – AE: 155 billion words2Cf. http://books.google.com/ngrams/
Niko Schenk Corpus Linguistics – Introduction 35 / 35