October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source...

24
July 4, 2022 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web- applications for historical data- mining in public media

Transcript of October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source...

Page 1: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

April 21, 2023

WAHSP/BILAND

Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media

Page 2: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

April 21, 2023

WAHSP/BILAND

Research team: Stephen Snelders(UU), Pim Huijnen(UU), Daan Odijk(ISLA, UvA), Fons Laan(ISLA), Maarten de Rijke (ISLA), Toine Pieters (UU),

Page 3: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

04/21/23

Research

Creating big-data resources

Page 4: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

National library of the NetherlandsDigital Newspaper ArchiveNational library of the NetherlandsDigital Newspaper Archive

> 10.000.000 pages> 10.000.000 pages

> 1200 titles> 1200 titles

1618-1995

1618-1995

> 30.000.000 articles> 30.000.000 articles

Still growing...Still growing...

Page 5: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

How did/do you study 30 millionnewspaper articles?

Page 6: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

Dutch press on GermanyFrank van Vree (1989)Dutch press on GermanyFrank van Vree (1989)

> 1200 titles> 1200 titles

1618-1995

1618-1995

> 31.000.000 articles> 31.000.000 articles44

1930- 1939

1930- 1939

4.0004.000

Sampling

Page 7: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

04/21/23

Research

Page 8: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

Developing semantic document selection tools

Page 9: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

April 21, 2023

Research

WE NEED:

A semi-automatic and interactive open-source

application

An application that does not replace, but

supports the intuition and insights of the

historical researcher with expert knowledge of a

specific topic or domain.

An application that is user-friendly.

Page 10: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

April 21, 2023

Research

Problem:

Context and background of Dutch drug and eugenics

debates in time

Aim

Understanding and evaluation of public debates around

drugs, addiction and eugenics in the Netherlands, 1900-

1945

Research question

What are the dynamics (in terms of patterns and trends)

of public debates and sentiments around drugs and

addiction, and eugenics in the Dutch newspapers in the

first half of the twentieth century

Page 11: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

April 21, 2023

Research

Poe’s detective finds the truth by using data in those newspaper articles that do not concern the murder.

In a similar way we will find terms and sentiments in those newspaper articles that may seem irrelevant, but are not.

Page 12: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

12

E-everything

Information-extraction

Recognize structure in text

Part of speech

Noun, verb, …

Entities

people, organisations, locations, temporal expressions, …

Relations

Who, what, with whom, how, why

Page 13: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

13

E-everything

Information-extraction (2)

Page 14: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

04/21/23

Enjoyable but what does it tell us?

Page 15: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

04/21/23

Research

Page 16: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

04/21/23

Research Start Query: Opium

Page 17: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

04/21/23

Research Drugs and drug policy

Page 18: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

Odijk D., de Rooij O., Peetz M-H., Pieters T., de Rijke M., Snelders S. (2012). "Semantic Document Selection", TPDL 2012: Theory and Practice of Digital Libraries: Springer, September.

Page 19: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

04/21/23

Combining and clustering queries

Page 20: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

04/21/23

Research

By carefully inspecting the word counts, we found quantitative evidence for historical turning points that indicated the criminalization of the drugs debate around 1924

Page 21: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

Eugenics case; query overerving (hereditarian) 1867

04/21/23

Research

Primarily associations with health related terms/entities

Page 22: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

04/21/23

Research

Eugenics case;

Page 23: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

Eugenics case; query overerving 1935

04/21/23

Research

In 1935, however, the medical context of using the term inheritance made way for a legal and racial context

Toine Pieters
Page 24: October 6, 2015 WAHSP/BILAND Towards flexible and stable CLARIN-supported open-source web-applications for historical data-mining in public media.

E-Humanity Approaches to Reference Cultures: The Emergence of the United States in Public Discourse in the Netherlands, 1890-1990

Challenges: 1. OCR-Repair

2. Improving Text-mining software and data

infrastructure

3. Developing new historical research strategies

4. Educating historians and other humanities

researchers

04/21/23

NEW HORIZONS in DIGITAL HUMANITIES