Mining the 20th Century’s History from · Mining the 20th Century’s History from Folgert...

50
Mining the 20th Century’s History from Folgert Karsdorp, Mike Kestemont, Antal van den Bosch, Walter Daelemans & Dan Roth Guest Lecture, AI Course, ULB, 9 May 2014

Transcript of Mining the 20th Century’s History from · Mining the 20th Century’s History from Folgert...

Mining the 20th Century’s History from

Folgert Karsdorp, Mike Kestemont, Antal van den Bosch, Walter Daelemans & Dan Roth

!Guest Lecture, AI Course, ULB, 9 May 2014

What is TIME?

• Weekly newsmagazine

• American

• Largest of its kind

• ~25M readers

• Famous cover

• Founded 1923

• A/o Henry Luce

• Readers without time

• “digest”

• First of its kind

History

Journalism

• “American”

• Conservative

• Opinionated (!)

• “Timese” style

Continuity since 1923http://content.time.com/time/archive

Corpus statistics• 1923-2006

• M. Davies

• TIME U.S.

• ~194M words

• ~873K unique

• ~274K documents

• Stanford NLP

●●

●●

●●

●●●

●●●

●●

●●

●●

●●

●●

●●●●●●

●●●●

●●●●

●●●●

●●

●●

●●

●●

2000

3000

4000

1920 1940 1960 1980 2000Year

Docum

ents

●●●

●●

●●●●●

●●

●●●

●●●

●●●●●

●●

●●●●●

●●●●●

●●●

●●●

●●

●●●●●●

●●

●●●

●●

●●●●

●●

500

750

1000

1250

1920 1940 1960 1980 2000Year

Words

Culturomics• Reflection of cultural phenomena in language

• Michel et al. 2011

• Google Books

• Typically:

• Diachronic aspect

• Large corpora

• Also Leetaru 2011

Frequentist assumptionIf something is ‘important’, it will be mentioned a lot in ‘large corpora”

Parsimonious Language Models

• Hiemstra et al. 2004

• IR context: term relevance

• Compact language model per document

• Low probabilities for

• stopwords

• rare terms

• Probabilistic TF-IDF

Control for specificity

Model brings IR perspective to historical analysis: “If somebody would search for a document from

the fourties, which terms would (s)he use?”

Control for specificity

Model brings IR perspective to historical analysis: “If somebody would search for a document from

the fourties, which terms would (s)he use?”

Setup

• Build composite documents:

• per year

• per decade (e.g. sixties)

• Run PLM, only on nouns (NN, NNS)

• Extract characteristic vocabulary: P(w|D)

• Automated characterization of historical periods

• Zeitgeist analysis

Decade?

Decade?

Decade?

Tune Lambda

[lambda = 0.1] [lambda = 0.01] [lambda = 0.001]

Evaluation?

• Quantitative evaluation not straightforward

• Bring in historian

• Evidence = self-referential, self-explanatory

• … because you already know the corpus well

• Better suited for new, unknown corpora

Breaking points?• No more “manually” setting periods

• Extract 5000 top-characteristic words from each year

• Variability-Based Nearest Neighbour Clustering

• Identification of temporal stages in diachronic data

• Gries & Hilpert 2008

• Cluster tree only merges adjacent nodes (e.g. 1943-1944)

• Ward linkage, cosine distance

TIME 100• 20th century's 100 most influential people

(1999, now yearly)

• Five categories:

• Leaders & Revolutionaries

• Builders & Titans

• Artists & Entertainers

• Scientists & Thinkers

• Heroes & Icons

Wikifier

• Problems:

• Which Clinton?

• Anne + Frank...

• Cross-document coreference resolution

• “Wikification”

<a href="http://en.wikipedia.org/wiki/Ronald_Reagan">Ronald Reagan</a> launched his campaign to make <a href="http://en.wikipedia.org/

wiki/United_States">America</a> great again in <a href="http://en.wikipedia.org/wiki/Detroit">Detroit</a> in <a href="http://en.wikipedia.org/wiki/1980">1980</

a> . Let's go back to the <a href="http://en.wikipedia.org/wiki/Detroit">Motor City</a> and

hold our <a href="http://en.wikipedia.org/wiki/2016">2016</a> national <a href="http://

en.wikipedia.org/wiki/United_States_presidential_nominating_convention">n

omination convention</a> in <a href="http://en.wikipedia.org/wiki/Detroit">Detroit</a> .

adolf_hitler

akio_morita

alan_turing

albert_einstein

alexander_fleming

amadeo_giannini

andrei_sakharov

anne_frank

aretha_franklin

bart_simpson

bill_gates

bill_w.

billy_graham

bob_dylan

bruce_lee

charles_e._merrill

charles_lindbergh

charlie_chaplin

che_guevara

coco_chanel

david_ben−gurion

david_sarnoff

diana,_princess_of_wales

edmund_hillary

edwin_hubble

eleanor_roosevelt

emmeline_pankhurst

enrico_fermi

francis_crick

frank_sinatra

franklin_d._roosevelt

g.i._(military)

harvey_milkhelen_keller

henry_ford

ho_chi_minh

igor_stravinsky

jackie_robinson

james_joyce

jean_piaget

jim_henson

john_maynard_keynes

jonas_salk

juan_trippe

kennedy_family

kurt_gödel

le_corbusier

lech_wa....sa

leo_baekeland

leo_burnett

louis_armstrong

louis_b._mayer

louis_leakey

lucille_ball

lucky_luciano

ludwig_wittgenstein

mao_zedong

margaret_sanger

margaret_thatcher

marilyn_monroemarlon_brando

martha_graham

martin_luther_king,_jr.

mikhail_gorbachev

mother_teresa

muhammad_ali

nelson_mandela

oprah_winfrey

pablo_picasso

pelépete_rozelle

philo_farnsworth

pope_john_paul_ii

rachel_carson

robert_h._goddard

rodgers_and_hammerstein

ronald_reagan

rosa_parks

ruhollah_khomeini

sam_walton

sigmund_freud

steven_spielberg

t._s._eliot

tank_man

tenzing_norgay

the_beatles

theodore_roosevelt

thomas_watson,_jr.

tim_berners−lee

vladimir_lenin

walt_disney

walter_reuther

william_levitt

william_shockley

willis_carrier

winston_churchill

wright_brothers

CATEGORY a a a a aArtistsAndEntertainers BuildersAndTitans HeroesAndIcons LeadersAndRevolutionaries ScientistsAndThinkers

Scientists & Thinkers Heroes & Icons

A. Einstein Person of Century?

Simple measure

Criterion: Longest continuous span of trimesters of TIME issues in which a person is mentioned...

Shared 1st place with Joyce, Elliot, Roosevelt, ....

Person of the Year

• 11/12/13...

• Since 1927 (Lindbergh)

• Most influence

• For better or worse

• Sometimes deviant (“You”)

• Predict this

Methods

• Cast task as ranking problem: POY@1

• TIME publishes shortlist

• More insightful: inspect top ranking

• Evaluation:

• Mean Reciprocal Rank (cutoff @20)

• Accuracy @10, @5, @1

Learning to Rank

• Steps:

1. Retrieve 100 candidates via DF (baseline)

2. Extract features for each candidate

3. Rerank 100 via Learning to Rank

• Ranklib (LambdaMART)

Feature Types

• Frequency

• Topical

• Time series

• Similarity

• Network

Frequency features

• Document frequency (baseline)

• Frequency increase wrt previous year

Topical features

• Topic metadata (science, politics, music, …)

• # of topics person appears in

• Topical variance: coefficient of variation over topics

Time seriesLongest continuous time span

(Jimmy Carter, diachronic DF)

Similarity features

• Distributional properties of persons:

• Similarity to previous POY(s)

• Similarity to “year” of election

• Word2vec (Mikolov et al. 2013)

PageRank Centrality[Eisenhower POY 1959]

Overall results For the ~7000 persons mentioned each year

Baseline LTR

MRR@20 .28 .43

Top-10 (%) .53 .64

Top-5 (%) .46 .57

Top-1 (%) .16 .31

Top-100 (%) .91 (upper bound)

Ablation experiments

MRR@20

All features .43

— frequency features .37

— topical features .35

— temporal features .37

— network features .31

— similarity features .34

Adding features previous years

MRR@20

Election year .33

+1 year back .43

+2 year back .41

+3 year back .41

+4 year back .40

+5 year back .37

2013 elections…!

• Less documents (ca. 1000)

• Worse metadata after 2006

• Wikifier 2011

• Give it a try anyway…

Our top-101. B. Obama

2. Vl. Putin

3. M. Cyrus

4. M. Zuckerberg

5. A. Schwarzenegger

6. B. Bernanke

7. S. Jobs

8. Pope Francis

9. L. Grossman

10. A. Jolie

Shortlist• B. Assad

• J. Bezos

• T. Cruz

• M. Cyrus (our #3)

• Pope Francis (our #8)

• B. Obama (our #1)

• H. Rouhani

• K. Sebelius

• E. Snowden

• E. Windsor

Pope Francis Our number 8... (Snowden?)

Discussion (1)

!

• Jeff Bezos

• POY in 1999

• DF ranks him @93

• System reranks him @1

Discussion (2)“Dispersion” beats frequency (DF):

• E.g. topical variety ranks A. Jolie high

Discussion (3)

• Newcomers are difficult to predict

• Bias towards slowly emerging candidates:

• Model has very smooth view on history

• Conservative choice: similarity!

• Our system seems more “objective”:

• more women

• more bad guys

Modeling the unmodelable?

• Accuracy versus insight

• Top-10 is easy — Top-1 remains hard

• World knowledge (cf. Guardian and Snowden)

• Can we truly model such an ad hoc decision?

• Do it again next year? (Shared task?)

• Other data than just Time?

• Twitter (but conservative choice…)