Exploring the internal heterogeneity of a corpus of...
Transcript of Exploring the internal heterogeneity of a corpus of...
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1]
D
Exploring the internal heterogeneity of a corpus ofClassical French with DiaCollo
Bryan Jurish Annette GerstenbergBerlin-Brandenburgische Akademie der Wissenschaften Freie Universitat Berlin
[email protected] [email protected]
Global Philology Open Conference
Universitat Leipzig
22nd February, 2016
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [1]
Overview
The Situation
p
Diachronic (heterogeneous) Text Corpora
p
Collocation Profiling
p
Diachronic Collocation Profiling
DiaCollo
p
Requests & Parameters
p
Profile, Diffs & Indices
APWCF Corpus
p
Background
p
Subcorpora
p
Sources & Enrichments
Examples
Summary & Conclusion
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [2]
The Situation: Diachronic Text Corpora
p
heterogeneous text collections, especially with respect to date of origin
t other partitionings may be relevant too, e.g. by genre, location, etc.
p
increasing number available for linguistic & humanities research, e.g.
t Deutsches Textarchiv (DTA) (Geyken 2013)
t Referenzkorpus Altdeutsch (DDD) (Richling 2011)
t Corpus of Historical American English (COHA) (Davies 2012)
p
. . . but even putatively “synchronic” corpora have a temporal extension, e.g.
t DWDS/ZEIT (Kohl ∼ politician vs. “cabbage”) (1946–2016)
t DDR Presseportal (Ausreise ∼ “departure”) (1945–1993)
t DWDS/Blogs (“Browser”) (1994–2016)
p
should expose temporal effects of e.g. semantic shift, discourse trends
p
problematic for conventional natural language processing tools
t implicit assumptions of homogeneity
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [3]
The Situation: Collocation Profiling
“You shall know a word by the company it keeps”— J. R. Firth
Basic Idea (Church & Hanks 1990; Manning & Schutze 1999; Evert 2005)
p
lookup all candidate collocates (w2) occurring with the target term (w1)
p
rank candidates by association score
t “chance” co-occurrences with high-frequency items must be filtered out!
t statistical methods require large data sample
What for?
p
computational lexicography (Kilgarriff & Tugwell 2002; Didakowski & Geyken 2013)
p
neologism detection (Kilgarriff et al. 2015)
p
distributional semantics (Schutze 1992; Sahlgren 2006)
p
“text mining” / “distant reading” (Heyer et al. 2006; Moretti 2013)
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [4]
Diachronic Collocation Profiling
The Problem: (temporal) heterogeneity
p
conventional collocation extractors assume corpus homogeneity
p
co-occurrence frequencies are computed only for word-pairs (w1, w2)
p
influence of occurrence date (and other document properties) is irrevocably lost
A Solution (sketch)
p
represent terms as n-tuples of independent attributes (including occurrence date)
t alternative: “document” level co-occurrences over sparse TDF matrix
p
partition corpus on-the-fly into user-specified intervals (“date slices”, “epochs”)
p
collect independent slice-wise profiles into final result set
Advantages Drawbacks
t full support for diachronic axis t sparse data requires larger corpora
t variable query-level granularity t computationally expensive
t flexible attribute selection t large index size
t multiple association scores t no syntactic relations (yet)
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [5]
DiaCollo: Requests & Parameters
p
Perl API, RESTful web-service (Fielding 2000) + web-form GUI
p
accepts user requests as set of parameter=value pairs
p
parameter passing via URL query string or HTTP POST request
p
common parameters:
Parameter Description
query target collocant lemma(ta), regular expression, or DDC query
date target date(s), interval, or regular expression
slice epoch granularity or “0” (zero) for a date-independent profile
groupby projected collocate attributes with optional restrictions
score association score function for collocate ranking
kbest maximum number of collocate items to return per epoch
diff binary score comparison operation for diff profiles
global request global profile pruning (vs. default epoch-local pruning)
profile profile type to be computed ({native,tdf,ddc} × {unary,diff})
format output format or visualization mode (e.g. TSV, JSON, HTML, d3-cloud, . . .)
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [6]
DiaCollo: Profiles, Diffs & Indices
Profiles & Diffs
p
simple request → unary profile for target term(s) (profile, query)
t filtered & projected to selected attribute(s) (groupby)
t trimmed to k-best collocates for target word(s) (score, kbest, global)
t aggregated into independent epoch-wise sub-intervals (date, slice)
p
diff request → comparison of two independent targets (profile, bquery, . . . )
t highlights differences or similarities of target queries (diff)
t can be used to compare different words (query 6= bquery)
. . . or different corpus subsets w.r.t. a given word (e.g. date 6= bdate)
Indices & Attributes
p
compile-time filtering of native indices: frequency threshholds, PoS-tags
p
default index attributes: Lemma (l), Pos (p)
p
finer-grained queries possible with TDF or DDC back-ends
p
batteries not included: corpus preprocessing, analysis, & full-text search index
t see e.g. Jurish (2003); Geyken & Hanneforth (2006); Jurish et al. (2014), . . .
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [7]
DiaCollo: Scoring & Comparison Functions
Selected Association Score Functions
p
f raw collocation frequency = f12
p
lf collocation log-frequency = log2(f12 + ε)
p
mi1 pointwise mutual information ≈ log2
f12×N
f1×f2
p
milf pointwise MI × log-frequency ≈ log2
f12×N
f1×f2× log2 f12
p
ll log-likelihood (Dunning 1993) ≈ sgn(f12|f1, f2) × log(1 + log λ)
p
ld log-Dice coefficient (Rychly 2008) ≈ 14 + log22×f12
f1+f2
Selected Diff Operations
p
diff raw score difference = sa − sb
p
adiff absolute score difference = |sa − sb|
p
avg arithmetic average = 1
2(sa + sb)
p
max maximum = max{sa, sb}
p
min minimum = min{sa, sb}
p
havg harmonic average ≈ 2×sa×sb
sa+sb
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [8]
APWCF: From diplomatic correspondence to a corpus of Classical French
p
Classical French: same structures as modern French, but
t linguistic norm is not yet stable
t variation and patterns of usage of grammatical features
t linguistic change on the levels of semantics and pragmatics
p
Acta Pacis Westphalicae (1643–1648): The French correspondence
t Diplomatic letters between Paris (government) and diplomats at Munster
t Ambassadors are committed to achieving diplomatic goals
convincing the government to adapt the instructions
t Diplomatic letters: formal constraints versus expressive needs
p
Linguistic interest
t Diachronic variation: comparison with existing resources of Classical French
t Synchronic variation: genre-internal heterogeneity
p
Genre-internal heterogeneity: hypothesis of different levels of formality
t Two subcorpora: ”government” (Paris) and ”ambassadors” (Munster)
t Register-variation reflected in the use of linguistic variables
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [9]
APWCF: Correspondence
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [10]
APWCF: Subcorpora
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [11]
APWCF: Data
p
Letters mostly conserved in French Archives
t Archives du Ministere des Affaires Etrangeres
t microforms: Zentrum fur Historische Forschung, Bonn
p
Digital edition (PDF, XML)
t Bayerische Staatsbibliothek
t Zentrum fur historische Forschung Bonn
p
Linguistic corpus: AG
t Part-of-Speech Tagging (PRESTO, Cologne/Lyon)
t XML / TXM (Lyon)
p
Corpus size: 8 volumes of French edition, 2.4M Tokens
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [12]
APWCF: Transcription
Diplomatic transcription, spelling variants preserved
p
traittes vs. traittez vs. traittez
p
estat (old) or etat (mod.) as appearing in the manuscript
p
Punctuation almost preserved, but. . .
Modernized
p
Some adaptations of punctuation
p
u/v, i/j modernized
p
Capitalization of proper names and titles
p
Diacritics normalized (lavis → l’avis, francais → francais)
p
Abbreviated titles/words: full form
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [13]
Examples
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [13]
Example 1: ledict (chancellery style)
http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=mi1&f=html ...query: doc.loc=Paris slice: 0
∼query: doc.loc=Munster ∼slice: 0
score: mi1 groupby: w=ledict
Paris Munster
N 2,462,443 2,462,443
f1 746,786 1,153,939
f2 = f12 284 414
score (mi1) 1.721 1.093
diff (Paris - Munster) 0.6278
p
simple example uses “unigrams” comparison profile (f2 = f12)
p
pointwise mutual information (mi1) score function
p
“Paris” shows definite preference the archaic form ledict (chancellery style)
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [14]
Example 2: PLAIRE (situational context)
http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ...query: doc.loc=Paris slice: 0
∼query: doc.loc=Munster kbest: 25
score: ll groupby: w,l=PLAIRE
plaira
plairoit
plaisoit
plait
plaist
plueplaire
plaisant
plaisante
plaîtplaiseplaisent
plaict
p
situational context: lemma PLAIRE, e.g. s’il vous plaıt (“please”)
p
log-likelihood association scores (robust)
p
attribute-cloud visualization: warm colors ∼ Paris, cool colors ∼ Munster
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [15]
Example 3: plaıt vs. plaist (orthographic variation)
http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ...query: doc.loc=Paris slice: 0
∼query: doc.loc=Munster score: ll
groupby: w=plaıt|plaist
p
zoom: “plaist/plaıt de ’s’il vous plaıst”
p
more frequent in Munster
p
relatively more frequent use of the archaic variant plaist
p
transcription in general respects orthographic variation
t typically not transparent in historical editions
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [16]
Example 4: POUVOIR (request & response)
http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ...query: doc.loc=Paris slice: 0
∼query: doc.loc=Munster kbest: 25
score: ll groupby: w,l=POUVOIR
pouvant
pouvionspeustpouvons
pourrions
pouvez
peuvent
peultpourrons
pourroit
pourrez
pouvoir
puissiez
pouvoirs
puissions
pourront
puisse
pussent
puissent
pu
pourra
peut
pouvois
pouvans
pouvoient
p
Paris: request “you could” (puissiez/pourrez)
p
Munster: response “we could/will be able” (pouvions/pourrons)
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [17]
Example 5: Speech Acts
http://kaskade.dwds.de/dstar/apwcf/diacollo/?as=0&bs=0&p=d1&sf=ll&f=cloud ...query: doc.loc=Paris kbest: 50
∼query: doc.loc=Munster score: ll
groupby: w,l=ESCRIRE|ECRIRE|PARLER|COMMUNICQUER|COMMUNIQUER|DISCUTER
parlasmes
escrivit
parleront
parlois
escriroit
communiquer
escriptes
écrireparlera
parloit
escrittes
parlons
escrivismesescript
escrive
escrira
escrivoit
escrivonsescriroient
parlions
parleronsescriront discuté
communiquécommunicqué
parlent
escris
parlé
escriray
escrite
escrivois
escrivisse
escrivis
escrivés
escripre
parlèrent
escrivezdiscuter
parliez
parleray
comuniquerparloient
escripte
escrivent
escrirons
escritesparla
parlay
parlant
escriviez
p
Diplomatic negotiations: overt speech act verbs
p
Paris: discuter, discute, escrivez, escrivois, . . .
p
Munster: comuniquer, escrivons, escrivismes, parlasmes, parlerent, . . .
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [18]
Summary & Conclusion
Diachronic Collocation Profiling
p
diachronic text corpora semantic shift, discourse trends
p
conventional tools implicit assumptions of homogeneity
p
diachronic profiling date-dependent lexemes
DiaCollo
p
on-the-fly corpus partitioning arbitrary query granularity
p
DDC/D* integration fine-grained queries, corpus KWIC links
p
RESTful web service external API, online visualization
APWCF + DiaCollo
p
metadata-based filtering location-specific profiles
p
“diff” profile mode inter-subcorpus comparisons
p
metadata-based aggregation subcorpus preference profiles
2017-02-22 / Jurish, Gerstenberg / APWCF + DiaCollo [19]
— The End —
treuwirklich
lieb
herzlich
lächeln
gut
schön
persönlich
warm
letzte lieb
dankenklein
glücklich
kurzliebenswürdig
jungganzfreundschaftlich
freundlich
Thank you for listening!
http://kaskade.dwds.de/dstar/apwcf/diacollo
http://metacpan.org/release/DiaColloDB
APWCF: http://wikis.fu-berlin.de/pages/viewpage.action?pageId=594936338