Corpus Linguistics - A Simple Introduction

26
Hypothesis Testing Hypothesis Generation Corpus Linguistics A Simple Introduction Niko Schenk Institut f¨ ur England- und Amerikastudien Goethe-Universit¨ at Frankfurt am Main Summer Term 2017 April 19, 2017 Niko Schenk Corpus Linguistics

Transcript of Corpus Linguistics - A Simple Introduction

Page 1: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Corpus LinguisticsA Simple Introduction

Niko Schenk

Institut fur England- und AmerikastudienGoethe-Universitat Frankfurt am Main

Summer Term 2017

April 19, 2017

Niko Schenk Corpus Linguistics

Page 2: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

1 Hypothesis Testing

2 Hypothesis Generation

Niko Schenk Corpus Linguistics

Page 3: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

1 Hypothesis Testing

2 Hypothesis Generation

Niko Schenk Corpus Linguistics

Page 4: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

“Correct” vs. “Incorrect” Use of Language

Figure: Source: http://www.elektrojournal.at/bilder/d166/Saturn_Claim.jpg

Niko Schenk Corpus Linguistics

Page 5: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Soo! muss Technik–cont’d

Regarding the slogan four things are striking:

Missing main verb “sein” (verbal ellipsis?/“ungrammatical”)Orthographic variation “soo” vs. “so” (spelling “error”)Misleading punctuation “soo! ...”, all letters capitalized...

Question: How would you objectively/formally test that, “soo”—compared to“so”—is not “regular” language?

Dictionary...Better: → www.google.de!

Niko Schenk Corpus Linguistics

Page 6: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Comparing Frequencies of Contrastive Elements

: Google results for “so” : Google results for “soo”Niko Schenk Corpus Linguistics

Page 7: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Refining Comparisons of Contrastive Elements

: Google results for “so schon” : Google results for “soo schon”Niko Schenk Corpus Linguistics

Page 8: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Congratulations!

We’ve successfully performed our first corpus linguistic search.

How is that different from using a dictionary?Answer:

1 consult real data instead single of contemporary rule.

2 use tendencies instead of absolute true/false answer(dictionary).

Niko Schenk Corpus Linguistics

Page 9: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Language Change-An Example

Figure: “Thrived” vs. ...

Niko Schenk Corpus Linguistics

Page 10: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Language Change-An Example cont’d

Figure: ... “throve”.

Niko Schenk Corpus Linguistics

Page 11: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Language Change-An Example cont’d

Question: How would you objectively/formally show that “throve” is obsolete/nolonger in use?

You could ask a native speaker...(Much) better: → Google books Ngram Viewer

Niko Schenk Corpus Linguistics

Page 12: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Figure: Distributions of “thrived” vs. “throve” in the Google books corpus.

Possible explanation: Low-frequency words change to fit the main paradigm.Niko Schenk Corpus Linguistics

Page 13: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Again, how is that different from asking a native speaker?Answer: consult real data instead of intuition of one individualspeaker.

Niko Schenk Corpus Linguistics

Page 14: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

What is Corpus Linguistics? (I)

Starting with a linguistic phenomenon (see previous examples) and a hypothesis, youuse

large textual resources (a corpus!) and software to objectively test(falsify/verify) the hypothesis.

hypothesis testing is usually based on frequencies obtained through search.

Niko Schenk Corpus Linguistics

Page 15: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

1 Hypothesis Testing

2 Hypothesis Generation

Niko Schenk Corpus Linguistics

Page 16: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Extracting Useful Information

Contrary to the previous examples, you don’t have to start with a concrete hypothesis.

You could just “do something” with the corpus itself:

e.g., compute various statistics and inspect the output.

Usually you count words, phrases, etc. This is done automatically with the helpof computer programs.

→ Afterwards, you can come up with a hypothesis.

Niko Schenk Corpus Linguistics

Page 17: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

How to Automatically Find Collocations

Example: Automatically collect words which co-occur more frequently than whatwould be expected:

Generated hypothesis: → These words are idiomatic expressions, proper names...Niko Schenk Corpus Linguistics

Page 18: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

How to Automatically Detect Similar Facebook Users

Based on the data, you could just count the words which are used by different peopleand compare the numbers.

Niko Schenk Corpus Linguistics

Page 19: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

How to Automatically Detect Similar Facebook Users

Generated hypothesis: → Groups of people with the same opinion share a similarvocabulary.

Niko Schenk Corpus Linguistics

Page 20: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

How to Automatically Identify the Author of a Text

The same technique can be applied to find the author of a document.

For each document (written by an author), count the number of distinct wordswhich he/she uses.

Niko Schenk Corpus Linguistics

Page 21: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

How to Automatically Identify the Author of a Text

● ●

●●

● ●

●●

0.30 0.35 0.40 0.45 0.50

0.25

0.30

0.35

0.40

0.45

0.50

standard type / token ratio

type

(le

mm

a) /

toke

n ra

tio

christopher.txt

christopher2.txt

instructions.txt

lisa.txt lisa2.txt

lisa3.txt

lisa4.txt

lisa5.txtlisa6.txt

marie−luise.txt

marie−luise2.txt

marie−luise3.txt

marie−luise4.txt

miriam.txt

miriam2.txt

miriam3.txtmiriam4.txt

miriam5.txt

miriam6.txt

philipp.txt

simon.txt

simon2.txt

simon3.txt

thu.txt

thu2.txt

thu3.txt

thu4.txt

thu5.txt

thu6.txtthu7.txt

thu8.txt

viktor.txt

viktor2.txt

Niko Schenk Corpus Linguistics

Page 22: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Hypothesis Generation

Generated hypothesis:→ Students appearing closer together in the visualization are similar in language use.

Niko Schenk Corpus Linguistics

Page 23: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

What is Corpus Linguistics? (II)

Starting with the corpus and no specific hypothesis, you use

large textual resources and statistics to detect contrastive (interesting)patterns, i.e. you generate a hypothesis.

Niko Schenk Corpus Linguistics

Page 24: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Summary

Corpus Linguistics can be divided into two parts

1 hypothesis testing

2 hypothesis generation

You usually use

large textual (linguistic) resources which are electronically available.

software to analyze (search) the data.

Niko Schenk Corpus Linguistics

Page 25: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Niko Schenk Corpus Linguistics

Page 26: Corpus Linguistics - A Simple Introduction

Hypothesis Testing Hypothesis Generation

Corpus Linguistics—An Example

Google offers specialized (exploratory) search as a corpus linguistic application.

Cf. Google Books Corpus1 / Google Ngram Viewer2

1http://googlebooks.byu.edu/x.asp – AE: 155 billion words2Cf. http://books.google.com/ngrams/

Niko Schenk Corpus Linguistics