Exploring the internal heterogeneity of a corpus of...

21
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1] D Exploring the internal heterogeneity of a corpus of Classical French with DiaCollo Bryan Jurish Annette Gerstenberg Berlin-Brandenburgische Akademie der Wissenschaften Freie Universit¨ at Berlin [email protected] [email protected] Global Philology Open Conference Universit¨ at Leipzig 22 nd February, 2016

Transcript of Exploring the internal heterogeneity of a corpus of...

Page 1: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1]

D

Exploring the internal heterogeneity of a corpus ofClassical French with DiaCollo

Bryan Jurish Annette GerstenbergBerlin-Brandenburgische Akademie der Wissenschaften Freie Universitat Berlin

[email protected] [email protected]

Global Philology Open Conference

Universitat Leipzig

22nd February, 2016

Page 2: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1]

Overview

The Situation

p

Diachronic (heterogeneous) Text Corpora

p

Collocation Profiling

p

Diachronic Collocation Profiling

DiaCollo

p

Requests & Parameters

p

Profile, Diffs & Indices

APWCF Corpus

p

Background

p

Subcorpora

p

Sources & Enrichments

Examples

Summary & Conclusion

Page 3: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [2]

The Situation: Diachronic Text Corpora

p

heterogeneous text collections, especially with respect to date of origin

t other partitionings may be relevant too, e.g. by genre, location, etc.

p

increasing number available for linguistic & humanities research, e.g.

t Deutsches Textarchiv (DTA) (Geyken 2013)

t Referenzkorpus Altdeutsch (DDD) (Richling 2011)

t Corpus of Historical American English (COHA) (Davies 2012)

p

. . . but even putatively “synchronic” corpora have a temporal extension, e.g.

t DWDS/ZEIT (Kohl ∼ politician vs. “cabbage”) (1946–2016)

t DDR Presseportal (Ausreise ∼ “departure”) (1945–1993)

t DWDS/Blogs (“Browser”) (1994–2016)

p

should expose temporal effects of e.g. semantic shift, discourse trends

p

problematic for conventional natural language processing tools

t implicit assumptions of homogeneity

Page 4: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [3]

The Situation: Collocation Profiling

“You shall know a word by the company it keeps”— J. R. Firth

Basic Idea (Church & Hanks 1990; Manning & Schutze 1999; Evert 2005)

p

lookup all candidate collocates (w2) occurring with the target term (w1)

p

rank candidates by association score

t “chance” co-occurrences with high-frequency items must be filtered out!

t statistical methods require large data sample

What for?

p

computational lexicography (Kilgarriff & Tugwell 2002; Didakowski & Geyken 2013)

p

neologism detection (Kilgarriff et al. 2015)

p

distributional semantics (Schutze 1992; Sahlgren 2006)

p

“text mining” / “distant reading” (Heyer et al. 2006; Moretti 2013)

Page 5: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [4]

Diachronic Collocation Profiling

The Problem: (temporal) heterogeneity

p

conventional collocation extractors assume corpus homogeneity

p

co-occurrence frequencies are computed only for word-pairs (w1, w2)

p

influence of occurrence date (and other document properties) is irrevocably lost

A Solution (sketch)

p

represent terms as n-tuples of independent attributes (including occurrence date)

t alternative: “document” level co-occurrences over sparse TDF matrix

p

partition corpus on-the-fly into user-specified intervals (“date slices”, “epochs”)

p

collect independent slice-wise profiles into final result set

Advantages Drawbacks

t full support for diachronic axis t sparse data requires larger corpora

t variable query-level granularity t computationally expensive

t flexible attribute selection t large index size

t multiple association scores t no syntactic relations (yet)

Page 6: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [5]

DiaCollo: Requests & Parameters

p

Perl API, RESTful web-service (Fielding 2000) + web-form GUI

p

accepts user requests as set of parameter=value pairs

p

parameter passing via URL query string or HTTP POST request

p

common parameters:

Parameter Description

query target collocant lemma(ta), regular expression, or DDC query

date target date(s), interval, or regular expression

slice epoch granularity or “0” (zero) for a date-independent profile

groupby projected collocate attributes with optional restrictions

score association score function for collocate ranking

kbest maximum number of collocate items to return per epoch

diff binary score comparison operation for diff profiles

global request global profile pruning (vs. default epoch-local pruning)

profile profile type to be computed ({native,tdf,ddc} × {unary,diff})

format output format or visualization mode (e.g. TSV, JSON, HTML, d3-cloud, . . .)

Page 7: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [6]

DiaCollo: Profiles, Diffs & Indices

Profiles & Diffs

p

simple request → unary profile for target term(s) (profile, query)

t filtered & projected to selected attribute(s) (groupby)

t trimmed to k-best collocates for target word(s) (score, kbest, global)

t aggregated into independent epoch-wise sub-intervals (date, slice)

p

diff request → comparison of two independent targets (profile, bquery, . . . )

t highlights differences or similarities of target queries (diff)

t can be used to compare different words (query 6= bquery)

. . . or different corpus subsets w.r.t. a given word (e.g. date 6= bdate)

Indices & Attributes

p

compile-time filtering of native indices: frequency threshholds, PoS-tags

p

default index attributes: Lemma (l), Pos (p)

p

finer-grained queries possible with TDF or DDC back-ends

p

batteries not included: corpus preprocessing, analysis, & full-text search index

t see e.g. Jurish (2003); Geyken & Hanneforth (2006); Jurish et al. (2014), . . .

Page 8: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [7]

DiaCollo: Scoring & Comparison Functions

Selected Association Score Functions

p

f raw collocation frequency = f12

p

lf collocation log-frequency = log2(f12 + ε)

p

mi1 pointwise mutual information ≈ log2

f12×N

f1×f2

p

milf pointwise MI × log-frequency ≈ log2

f12×N

f1×f2× log2 f12

p

ll log-likelihood (Dunning 1993) ≈ sgn(f12|f1, f2) × log(1 + log λ)

p

ld log-Dice coefficient (Rychly 2008) ≈ 14 + log22×f12

f1+f2

Selected Diff Operations

p

diff raw score difference = sa − sb

p

adiff absolute score difference = |sa − sb|

p

avg arithmetic average = 1

2(sa + sb)

p

max maximum = max{sa, sb}

p

min minimum = min{sa, sb}

p

havg harmonic average ≈ 2×sa×sb

sa+sb

Page 9: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [8]

APWCF: From diplomatic correspondence to a corpus of Classical French

p

Classical French: same structures as modern French, but

t linguistic norm is not yet stable

t variation and patterns of usage of grammatical features

t linguistic change on the levels of semantics and pragmatics

p

Acta Pacis Westphalicae (1643–1648): The French correspondence

t Diplomatic letters between Paris (government) and diplomats at Munster

t Ambassadors are committed to achieving diplomatic goals

convincing the government to adapt the instructions

t Diplomatic letters: formal constraints versus expressive needs

p

Linguistic interest

t Diachronic variation: comparison with existing resources of Classical French

t Synchronic variation: genre-internal heterogeneity

p

Genre-internal heterogeneity: hypothesis of different levels of formality

t Two subcorpora: ”government” (Paris) and ”ambassadors” (Munster)

t Register-variation reflected in the use of linguistic variables

Page 10: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [9]

APWCF: Correspondence

Page 11: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [10]

APWCF: Subcorpora

Page 12: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [11]

APWCF: Data

p

Letters mostly conserved in French Archives

t Archives du Ministere des Affaires Etrangeres

t microforms: Zentrum fur Historische Forschung, Bonn

p

Digital edition (PDF, XML)

t Bayerische Staatsbibliothek

t Zentrum fur historische Forschung Bonn

p

Linguistic corpus: AG

t Part-of-Speech Tagging (PRESTO, Cologne/Lyon)

t XML / TXM (Lyon)

p

Corpus size: 8 volumes of French edition, 2.4M Tokens

Page 13: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [12]

APWCF: Transcription

Diplomatic transcription, spelling variants preserved

p

traittes vs. traittez vs. traittez

p

estat (old) or etat (mod.) as appearing in the manuscript

p

Punctuation almost preserved, but. . .

Modernized

p

Some adaptations of punctuation

p

u/v, i/j modernized

p

Capitalization of proper names and titles

p

Diacritics normalized (lavis → l’avis, francais → francais)

p

Abbreviated titles/words: full form

Page 14: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [13]

Examples

Page 15: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [13]

Example 1: ledict (chancellery style)

http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=mi1&f=html ...query: doc.loc=Paris slice: 0

∼query: doc.loc=Munster ∼slice: 0

score: mi1 groupby: w=ledict

Paris Munster

N 2,462,443 2,462,443

f1 746,786 1,153,939

f2 = f12 284 414

score (mi1) 1.721 1.093

diff (Paris - Munster) 0.6278

p

simple example uses “unigrams” comparison profile (f2 = f12)

p

pointwise mutual information (mi1) score function

p

“Paris” shows definite preference the archaic form ledict (chancellery style)

Page 16: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [14]

Example 2: PLAIRE (situational context)

http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ...query: doc.loc=Paris slice: 0

∼query: doc.loc=Munster kbest: 25

score: ll groupby: w,l=PLAIRE

plaira

plairoit

plaisoit

plait

plaist

plueplaire

plaisant

plaisante

plaîtplaiseplaisent

plaict

p

situational context: lemma PLAIRE, e.g. s’il vous plaıt (“please”)

p

log-likelihood association scores (robust)

p

attribute-cloud visualization: warm colors ∼ Paris, cool colors ∼ Munster

Page 17: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [15]

Example 3: plaıt vs. plaist (orthographic variation)

http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ...query: doc.loc=Paris slice: 0

∼query: doc.loc=Munster score: ll

groupby: w=plaıt|plaist

p

zoom: “plaist/plaıt de ’s’il vous plaıst”

p

more frequent in Munster

p

relatively more frequent use of the archaic variant plaist

p

transcription in general respects orthographic variation

t typically not transparent in historical editions

Page 18: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [16]

Example 4: POUVOIR (request & response)

http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ...query: doc.loc=Paris slice: 0

∼query: doc.loc=Munster kbest: 25

score: ll groupby: w,l=POUVOIR

pouvant

pouvionspeustpouvons

pourrions

pouvez

peuvent

peultpourrons

pourroit

pourrez

pouvoir

puissiez

pouvoirs

puissions

pourront

puisse

pussent

puissent

pu

pourra

peut

pouvois

pouvans

pouvoient

p

Paris: request “you could” (puissiez/pourrez)

p

Munster: response “we could/will be able” (pouvions/pourrons)

Page 19: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [17]

Example 5: Speech Acts

http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ...query: doc.loc=Paris kbest: 50

∼query: doc.loc=Munster score: ll

groupby: w,l=ESCRIRE|ECRIRE|PARLER|COMMUNICQUER|COMMUNIQUER|DISCUTER

parlasmes

escrivit

parleront

parlois

escriroit

communiquer

escriptes

écrireparlera

parloit

escrittes

parlons

escrivismesescript

escrive

escrira

escrivoit

escrivonsescriroient

parlions

parleronsescriront discuté

communiquécommunicqué

parlent

escris

parlé

escriray

escrite

escrivois

escrivisse

escrivis

escrivés

escripre

parlèrent

escrivezdiscuter

parliez

parleray

comuniquerparloient

escripte

escrivent

escrirons

escritesparla

parlay

parlant

escriviez

p

Diplomatic negotiations: overt speech act verbs

p

Paris: discuter, discute, escrivez, escrivois, . . .

p

Munster: comuniquer, escrivons, escrivismes, parlasmes, parlerent, . . .

Page 20: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [18]

Summary & Conclusion

Diachronic Collocation Profiling

p

diachronic text corpora semantic shift, discourse trends

p

conventional tools implicit assumptions of homogeneity

p

diachronic profiling date-dependent lexemes

DiaCollo

p

on-the-fly corpus partitioning arbitrary query granularity

p

DDC/D* integration fine-grained queries, corpus KWIC links

p

RESTful web service external API, online visualization

APWCF + DiaCollo

p

metadata-based filtering location-specific profiles

p

“diff” profile mode inter-subcorpus comparisons

p

metadata-based aggregation subcorpus preference profiles

Page 21: Exploring the internal heterogeneity of a corpus of ...kaskade.dwds.de/~jurish/pubs/jg2017apwcf-diacollo-slides.pdf · Exploring the internal heterogeneity of a corpus of ... e.g.

2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [19]

— The End —

treuwirklich

lieb

herzlich

lächeln

gut

schön

persönlich

warm

letzte lieb

dankenklein

glücklich

kurzliebenswürdig

jungganzfreundschaftlich

freundlich

Thank you for listening!

http://kaskade.dwds.de/dstar/apwcf/diacollo

http://metacpan.org/release/DiaColloDB

APWCF: http://wikis.fu-berlin.de/pages/viewpage.action?pageId=594936338