Visualizing the Transcribe Bentham Corpus

82
Visualizing the Transcribe Bentham Corpus Frédérique Mélanie, Estelle Tieberghien, Pablo Ruiz Fabo, Thierry Poibeau LATTICE Lab: ENS CNRS U Paris 3, PSL USPC Tim Causer, Melissa Terras UCL Bentham Project, UCL Digital Humanities UCLDH Seminar, December 2016

Transcript of Visualizing the Transcribe Bentham Corpus

Page 1: Visualizing the Transcribe Bentham Corpus

Visualizing the Transcribe Bentham Corpus

Frédérique Mélanie, Estelle Tieberghien, Pablo Ruiz Fabo,

Thierry Poibeau

LATTICE Lab: ENS – CNRS – U Paris 3, PSL – USPC

Tim Causer, Melissa Terras

UCL Bentham Project, UCL Digital Humanities

UCLDH Seminar, December 2016

Page 2: Visualizing the Transcribe Bentham Corpus

Outline

• UCL Bentham Project & Transcribe Bentham

• How navigate this corpus? Visualizations

– Lexical extraction

– Co-occurrence networks

• Static view and Temporal evolution

• Evaluation and Challenges

• Other corpus explorations via visualization

• Distant Reading Module, WordTree

• Other lexical analyses 2

Page 3: Visualizing the Transcribe Bentham Corpus

Jeremy Bentham (1748-1832)

•Jurist, philosopher, and legal and

social reformer

•Leading theorist in Anglo-American

philosophy of law

•Influenced the development of

welfarism

•Advocated utilitarianism

•Animal rights,

•Work on the “panopticon”

•Not founder of UCL, but...

•60,000 folios in UCL Sp. Collections

•40,000 untranscribed

•Auto-icon

Page 4: Visualizing the Transcribe Bentham Corpus

The Bentham Project

• http://www.ucl.ac.uk/Bentham-Project/

• Since 1959

• “aims to produce a new scholarly

edition of the works and

correspondence of Jeremy Bentham”

• twenty six volumes of the new

Collected Works have been published

• 50 years to transcribe 20,000 folios

• Previous AHRC grant catalogued the

manuscripts

– http://www.benthampapers.ucl.ac.uk/

Page 5: Visualizing the Transcribe Bentham Corpus
Page 6: Visualizing the Transcribe Bentham Corpus
Page 7: Visualizing the Transcribe Bentham Corpus
Page 8: Visualizing the Transcribe Bentham Corpus
Page 9: Visualizing the Transcribe Bentham Corpus
Page 10: Visualizing the Transcribe Bentham Corpus

Facts and Figures (as of 1st July 2016)

• 16,205 manuscripts transcribed/partially-transcribed

• 15,351 (94%) checked and approved

• 83,955 visits

• 34,359 unique views

• Average session time: 14 minutes 13 seconds

• 140 countries

• 514 people have transcribed something

• Most of the work done by the 26 Super Transcribers

• Average of 54 transcripts edited since the start of the project

• Average of 56 per week during the last twelve months

• Greatest number of transcripts in any one week: 300 (w/c 14 June

• 2014)

Page 11: Visualizing the Transcribe Bentham Corpus

Transcribe Bentham progress, 8 September 2010 to 20 March 2015

0

2000

4000

6000

8000

10000

12000

8Sep

2010

5Nov2011

30Dec

2010

25Feb

2011

15Apr

2011

17Jun

2011

12Aug

2011

7Oct

2011

2Dec

2011

27Jan

2012

23Mar2012

18May2012

13Jul

2012

7Sep

2012

2Nov2012

28Dec

2012

22Feb

2013

26Apr

2013

21Jun

2013

16Aug

2013

11Oct

2013

6Dec

2013

31Jan

2014

28Mar2014

23May2014

18Jul

2014

12Sep

2014

7Nov2014

9Jan

2015

6Mar2015

Manuscripts worked on Completed transcripts

NYT article

BL manuscripts made available

Page 12: Visualizing the Transcribe Bentham Corpus

With thanks to: •Prof Philip Schofield (UCL Bentham Project, Principal Investigator) •Dr Tim Causer (Bentham Project) •Dr Kris Grint (Bentham Project) •Richard Davis (University of London Computer Centre •José Martin (ULCC) •Martin Moyle (UCL Library Services) •Lesley Pitman (UCL Library Services) •Tony Slade (UCL Creative Media) •Miguel Faleiro Rodrigues, Alejandro Salinas Lopez, and Raheel Nabi (UCL Creative Media) •Dr Arnold Hunt (British Library) •Anna-Maria Sichani (Bentham Project) •Dr Justin Tonra (National University of Ireland Galway) and Dr Valerie Wallace (Victoria University Wellington), bother formerly of the Bentham Project •All the partners in Transcriptorium http://transcriptorium.eu/consortium/ •And Transcribe Bentham’s volunteers! •Project previously funded by the AHRC and the Andrew W. Mellon Foundation

Page 13: Visualizing the Transcribe Bentham Corpus

Outline

• UCL Bentham Project & Transcribe Bentham

• How navigate this corpus? Visualizations

– Lexical extraction

– Co-occurrence networks

• Static view and Temporal evolution

• Evaluation and Challenges

• Other corpus explorations via visualization

• Distant Reading Module, WordTree

• Other lexical analyses 13

Page 14: Visualizing the Transcribe Bentham Corpus

Relevant access to a large corpus

14

Page 15: Visualizing the Transcribe Bentham Corpus

Relevant access to a large corpus

• A search index?

• Topic models?

• Corpus cartography?

Challenges for this corpus

• Not an all-English corpus

• Difficulties posed by an historical variety

• Technical language

• Revision history, additions and deletions

15

Page 16: Visualizing the Transcribe Bentham Corpus

Stats for analyzed corpus sample

• Total TEI files: 29,900

• In English: 29,400

• That we dated: 16,700

• We only visualized English transcripts that

we could date (with a simple heuristic)1

• Work is based on ca. 55% of the all the

TEI files in our sample

16

1We were not using the corpus’ date metadata for this exercise

Page 17: Visualizing the Transcribe Bentham Corpus

Corpus Cartography

• Lexical extraction (of relevant sequences)

• Clustering based on similarity measures

• Visual representation (map of the corpus)

based on layout algorithms

17

Page 18: Visualizing the Transcribe Bentham Corpus

Cartography tool: CorText

• CorText Manager covers all cartography

steps:

– Lexical extraction

– Clustering

– Visualization

• Each step can be used independently,

thanks to standard import/export formats

18

Page 19: Visualizing the Transcribe Bentham Corpus

To

ols

co

mb

ined

wit

h C

orT

ext

CARTOGRAPHY STEP TOOLS and RESOURCES

Lexical Extraction

DBpedia Spotlight

YaTeA

Human domain-expert

Clustering CorText Analysis

Visualization Gephi + Sigma JS plugin

- Static CorText MapExplorer

Inkscape

- Dynamic CorText Heatmaps,

Tubes, Distant Reading 19

Page 20: Visualizing the Transcribe Bentham Corpus

Outline

• UCL Bentham Project & Transcribe Bentham

• How navigate this corpus? Visualizations

– Lexical extraction

– Co-occurrence networks

• Static view and Temporal evolution

• Evaluation and Challenges

• Other corpus explorations via visualization

• Distant Reading Module, WordTree

• Other lexical analyses 20

Page 21: Visualizing the Transcribe Bentham Corpus

Lexical Extraction

• CorText native option

– Noun-Phrase chunks (based on TreeTagger)

• Our options:

– Entity Linking / Wikification to DBpedia

– Keyphrase extraction tools like YaTeA

• In all cases: manual selection of pre-ranked

candidate terms by a domain-expert

21

Page 22: Visualizing the Transcribe Bentham Corpus

Entity Linking / Wikification

• Given a database with encyclopedic

knowledge (e.g. Wikipedia)

- Finds references (mentions) to DB terms in text

- Dealing with variability in the mentions for a term

22

Page 23: Visualizing the Transcribe Bentham Corpus

Entity Linking / Wikification

• Given a database with encyclopedic

knowledge (e.g. Wikipedia)

- Finds references (mentions) to DB terms in text

- Dealing with variability in the mentions for a term

23

Database

Page 24: Visualizing the Transcribe Bentham Corpus

Entity Linking / Wikification

• Given a database with encyclopedic

knowledge (e.g. Wikipedia)

- Finds references (mentions) to DB terms in text

- Dealing with variability in the mentions for a term

24

Database

Page 25: Visualizing the Transcribe Bentham Corpus

Entity Linking / Wikification

• Given a database with encyclopedic

knowledge (e.g. Wikipedia)

- Finds references (mentions) to DB terms in text

- Dealing with variability in the mentions for a term

25

Database Corpus

- judicatory - judicial - judicature - Judicatory - Judicial

Page 26: Visualizing the Transcribe Bentham Corpus

Entity Linking / Wikification

• Tool: DBpedia Spotlight

• Compares the context of sequences of

words in a text against DBpedia articles:

– Term definition’s text

– Links

– DBpedia structure (redirections etc.)

• Assigns a DBpedia term to the sequence if

a good match is found

26

Page 27: Visualizing the Transcribe Bentham Corpus

Entity Linking / Wikification

Example terms and their variants

27

Term Variants

Judiciary judicature, judicatory, judicial

Jury jury, juries

Monarch king, monarch

Quantity amount, quantity

Saint Peter Simon Peter, Cephas

Page 28: Visualizing the Transcribe Bentham Corpus

Entity Linking / Wikification

28

• Applying a current knowledge-base

(DBpedia) to 18th-19th century texts

• Is this a valid method?

Page 29: Visualizing the Transcribe Bentham Corpus

Keyphrase extraction

• YaTeA (Aubin and Hamon, 2006)

• Extracts noun-phrases of configurable

structure and length

29

Page 30: Visualizing the Transcribe Bentham Corpus

Outline

• UCL Bentham Project & Transcribe Bentham

• How navigate this corpus? Visualizations

– Lexical extraction

– Co-occurrence networks

• Static view and Temporal evolution

• Evaluation and Challenges

• Other corpus explorations via visualization

• Distant Reading Module, WordTree

• Other lexical analyses 30

Page 31: Visualizing the Transcribe Bentham Corpus

Clustering

• CorText offers several similarity metrics

– we chose the default method (distributional)

for homogeneous networks (Weeds & Weir 2005)

31

Page 32: Visualizing the Transcribe Bentham Corpus

Visualization

• Static (one map for all dated transcripts)

• Dynamic: temporal slices on the corpus

– Heatmaps

– “River” or Sankey networks (“Tubes layout”)

32

http://apps.lattice.cnrs.fr/bentham

Page 33: Visualizing the Transcribe Bentham Corpus

Static visualization

33

CorText network visualized with Gephi

Page 34: Visualizing the Transcribe Bentham Corpus

Static visualization

34

CorText network visualized with Gephi

Page 35: Visualizing the Transcribe Bentham Corpus

Static visualization

35

Page 36: Visualizing the Transcribe Bentham Corpus

Example term: Bill

36

Page 37: Visualizing the Transcribe Bentham Corpus

Example term: happiness

37

CorText network made interactive thanks to Gephi’s Sigma JS Exporter

Page 38: Visualizing the Transcribe Bentham Corpus

38

Example term: happiness

Page 39: Visualizing the Transcribe Bentham Corpus

39

Example term: happiness

Page 40: Visualizing the Transcribe Bentham Corpus

Example term: suffering

40

Page 41: Visualizing the Transcribe Bentham Corpus

Example term: suffering

41

Page 42: Visualizing the Transcribe Bentham Corpus

42

Example term:

death

Page 43: Visualizing the Transcribe Bentham Corpus

43

Example term:

death

Page 44: Visualizing the Transcribe Bentham Corpus

Examples: nodes linking clusters

44

Page 45: Visualizing the Transcribe Bentham Corpus

Examples: nodes linking clusters

45

Page 46: Visualizing the Transcribe Bentham Corpus

Heatmaps: Saliency per subcorpus

46

Page 47: Visualizing the Transcribe Bentham Corpus

Heatmaps: 1800-1809 subcorpus

47

Page 48: Visualizing the Transcribe Bentham Corpus

Heatmaps: 1810-1819 subcorpus

48

Page 49: Visualizing the Transcribe Bentham Corpus

Dynamic visualization

49

Page 50: Visualizing the Transcribe Bentham Corpus

Dynamic visualization

50

1795 1800 1805 1810

Page 51: Visualizing the Transcribe Bentham Corpus

Dynamic visualization

51

1795 1800 1805 1810

Page 52: Visualizing the Transcribe Bentham Corpus

Dynamic visualization

52

1795 1800 1805 1810

Page 53: Visualizing the Transcribe Bentham Corpus

Dynamic visualization

53

1795 1800 1805 1810

Page 54: Visualizing the Transcribe Bentham Corpus

Outline

• UCL Bentham Project & Transcribe Bentham

• How navigate this corpus? Visualizations

– Lexical extraction

– Co-occurrence networks

• Static view and Temporal evolution

• Evaluation and Challenges

• Other corpus explorations via visualization

• Distant Reading Module, WordTree

• Other lexical analyses 54

Page 55: Visualizing the Transcribe Bentham Corpus

Evaluation

• Static maps: terms in the clusters

correspond closely to issues dealt with by

Bentham for the thematic areas of each

cluster

• Heatmaps: The evolution depicted

corresponds to the evolution of topics in

Bentham’s work

• DBpedia vs. keyphrase extraction: The

keyphrases provide more relevant

evidence for specialized scholars, a

general encyclopedia can help other users

55

Page 56: Visualizing the Transcribe Bentham Corpus

Challenges Deleted material Additions

56

Page 57: Visualizing the Transcribe Bentham Corpus

Challenges Thematic Variety

• Animal Welfare

• Arts

• Capital punishment

• Civil Code

• Constitutional Code

• Convict transportation

• Correspondence

• Crime & Punishment

• Education

• Law

• Legislation

• Moral Philosophy

• New South Wales

• Panopticon

• Penal Code

• Political Economy

• Preventive Police

• Religion

• Science

• Sexual Morality

• Torture

Formal Variety

• Text sheets

• Copies / Fair copies

• Marginal summary sheets

• Correspondence

• Collectanea

• Rudiments

• Spencers

57

From http://www.transcribe-bentham.da.ulcc.ac.uk/td/Manuscripts and

http://www.benthampapers.ucl.ac.uk/help.aspx?subject=category

Page 58: Visualizing the Transcribe Bentham Corpus

Outline

• UCL Bentham Project & Transcribe Bentham

• How navigate this corpus? Visualizations

– Lexical extraction

– Co-occurrence networks

• Static view and Temporal evolution

• Evaluation and Challenges

• Other corpus explorations via visualization

• Distant Reading Module, WordTree

• Other lexical analyses 58

Page 59: Visualizing the Transcribe Bentham Corpus

Distant Reading Module

• Follow evolution of selected lexical

sequences

59

Page 60: Visualizing the Transcribe Bentham Corpus

Evolution of a lexical item

60

Temporal evolution

Temporal evolution profiles:

- Here: Rising, but present at all dates

- Other examples: falling, regular spikes etc.

Page 61: Visualizing the Transcribe Bentham Corpus

Contexts: WordTree

61

Page 62: Visualizing the Transcribe Bentham Corpus

Contexts: WordTree

62

Page 63: Visualizing the Transcribe Bentham Corpus

Contexts: WordTree

63

Page 64: Visualizing the Transcribe Bentham Corpus

Context evolution: Bump Charts

64

• Example: evil

Page 65: Visualizing the Transcribe Bentham Corpus

65

Neighbours evolution

Bu

mp

Ch

art

s

Page 66: Visualizing the Transcribe Bentham Corpus

66

Neighbours evolution

Bu

mp

Ch

art

s

Page 67: Visualizing the Transcribe Bentham Corpus

• Example: relations among neighbours of

evil

Relations in the context: Egonetworks

67

Page 68: Visualizing the Transcribe Bentham Corpus

Evolution of neighbours’ relations

68

Eg

on

etw

ork

s (

Pe

rio

d 2

)

Page 69: Visualizing the Transcribe Bentham Corpus

Evolution of neighbours’ relations

69

Eg

on

etw

ork

s (

Pe

rio

d 3

)

Page 70: Visualizing the Transcribe Bentham Corpus

Evolution of neighbours’ relations

70

Eg

on

etw

ork

s (

Pe

rio

d 4

)

Page 71: Visualizing the Transcribe Bentham Corpus

Outline

• UCL Bentham Project & Transcribe Bentham

• How navigate this corpus? Visualizations

– Lexical extraction

– Co-occurrence networks

• Static view and Temporal evolution

• Evaluation and Challenges

• Other corpus explorations via visualization

• Distant Reading Module, WordTree

• Other lexical analyses 71

Page 72: Visualizing the Transcribe Bentham Corpus

Other Lexical Analyses

• TXM “textometry” tool

– Automatic part-of-

speech tagging

– Partition texts according

to metadata

– Query corpus using

linguistic criteria

– Statistical analyses

(overrepresentation,

underrepresentation)

72

[ http://textometrie.ens-lyon.fr/?lang=en ]

Page 73: Visualizing the Transcribe Bentham Corpus

Lexical Analysis with TXM

73

Page 74: Visualizing the Transcribe Bentham Corpus

Lexical Analysis with TXM

• Partition the corpus according to Category,

Year, Decade, Main headings, or other

available metadata

74

Page 75: Visualizing the Transcribe Bentham Corpus

Lexical Analysis with TXM

Number of words per Category

75

Page 76: Visualizing the Transcribe Bentham Corpus

Lexical Analyses with TXM

• Over- (or under-) representation of given

words per decade (after partitioning per decade)

76

Page 77: Visualizing the Transcribe Bentham Corpus

TXM linguistic queries

• Evil followed by a noun, per text-category

77

Page 78: Visualizing the Transcribe Bentham Corpus

TXM linguistic queries

• Sentences containing an adjective + evil

78

Page 79: Visualizing the Transcribe Bentham Corpus

Summary • Accessing a large unedited corpus

– Cartography methods

• Lexical extraction

• Maps

– Static picture of the corpus

– Temporal evolution

– Other visualizations (Distant, WordTree)

• Domain-expert feedback

• Challenges

• Other lexical analyses

79

http://apps.lattice.cnrs.fr/bentham

Page 80: Visualizing the Transcribe Bentham Corpus

Bibliography

Aubin, S., and Hamon, T. (2006) Improving Term

Extraction with Terminological Resources. In

Advances in Natural Language Processing: 5th

International Conference on NLP, FinTAL 2006, pp.

380-387. LNAI 4139. Springer.

Auer, Sören, et al. (2007). DBpedia: A nucleus for a

web of open data. The Semantic Web. Springer.

Causer, Tim, and Terras, Melissa (2014a). Many

hands make light work. Many hands together

make merry work: Transcribe Bentham and

crowdsourcing manuscript collections, in

Crowdsourcing Our Cultural Heritage, ed. M. Ridge,

Ashgate

Causer, Tim, and Terras, Melissa (2014b).

Crowdsourcing Bentham: Beyond the Traditional

Boundaries of Academic History, International

Journal of Humanities and Arts Computing, 8

Chavalarias, David, and Jean-Philippe Cointet. (2013).

Phylomemetic Patterns in Science Evolution—The

Rise and Fall of Scientific Fields. PLoS ONE 8 (2)

Cortext Manager Documentation (2016).

https://docs.cortext.net/.

Mendes, Pablo N., Max Jakob, Andrés García-Silva,

and Christian Bizer. (2011). DBpedia Spotlight:

Shedding Light on the Web of Documents. In

Proceedings of the 7th International Conference on

Semantic Systems, 1–8. ACM.

Mélanie, F., Tieberghien, E., Ruiz, P., Poibeau, T.,

Causer, T. Terras, M. (2016). Mapping the Bentham

Corpus. In Digital Humanities Conference (DH

2016). Kraków, Poland.

Poibeau, T. and Ruiz, P. (2015). Generating Navigable

Semantic Maps from Social Sciences Corpora. In

Digital Humanities Conference (DH 2015). Sydney,

Australia.

Rule, Alix, Jean-Philippe Cointet, and Peter S.

Bearman. (2015). Lexical Shifts, Substantive

Changes, and Continuity in State of the Union

Discourse, 1790–2014. Proceedings of the National

Academy of Sciences 112 (35)

Venturini, T., N. Baya Laffite, J.-P. Cointet, I. Gray, V.

Zabban, and K. De Pryck. (2014). Three Maps and

Three Misunderstandings: A Digital Mapping of

Climate Diplomacy. Big Data & Society 1

Weeds J, Weir D (2005). Co-occurrence retrieval: A

flexible framework for lexical distributional similarity.

In Computational Linguistics 31(4), 439–475.

Wattenberg, M. and Viégas, F.B., 2008. The word tree,

an interactive visual concordance. In IEEE

transactions on visualization and computer graphics,

14(6), pp.1221-1228.

80

Page 81: Visualizing the Transcribe Bentham Corpus

81

Page 82: Visualizing the Transcribe Bentham Corpus

82

& return you all due thanks

[email protected] http://www.lattice.cnrs.fr/Pablo-Ruiz-Fabo,541 http://apps.lattice.cnrs.fr/