Visualizing the Transcribe Bentham Corpus

Visualizing the Transcribe Bentham Corpus

Frédérique Mélanie, Estelle Tieberghien, Pablo Ruiz Fabo,

Thierry Poibeau

LATTICE Lab: ENS – CNRS – U Paris 3, PSL – USPC

Tim Causer, Melissa Terras

UCL Bentham Project, UCL Digital Humanities

UCLDH Seminar, December 2016

Outline

• UCL Bentham Project & Transcribe Bentham

• How navigate this corpus? Visualizations

– Lexical extraction

– Co-occurrence networks

• Static view and Temporal evolution

• Evaluation and Challenges

• Other corpus explorations via visualization

• Distant Reading Module, WordTree

• Other lexical analyses 2

Jeremy Bentham (1748-1832)

•Jurist, philosopher, and legal and

social reformer

•Leading theorist in Anglo-American

philosophy of law

•Influenced the development of

welfarism

•Advocated utilitarianism

•Animal rights,

•Work on the “panopticon”

•Not founder of UCL, but...

•60,000 folios in UCL Sp. Collections

•40,000 untranscribed

•Auto-icon

The Bentham Project

• http://www.ucl.ac.uk/Bentham-Project/

• Since 1959

• “aims to produce a new scholarly

edition of the works and

correspondence of Jeremy Bentham”

• twenty six volumes of the new

Collected Works have been published

• 50 years to transcribe 20,000 folios

• Previous AHRC grant catalogued the

manuscripts

– http://www.benthampapers.ucl.ac.uk/

Facts and Figures (as of 1st July 2016)

• 16,205 manuscripts transcribed/partially-transcribed

• 15,351 (94%) checked and approved

• 83,955 visits

• 34,359 unique views

• Average session time: 14 minutes 13 seconds

• 140 countries

• 514 people have transcribed something

• Most of the work done by the 26 Super Transcribers

• Average of 54 transcripts edited since the start of the project

• Average of 56 per week during the last twelve months

• Greatest number of transcripts in any one week: 300 (w/c 14 June

• 2014)

Transcribe Bentham progress, 8 September 2010 to 20 March 2015

0

2000

4000

6000

8000

10000

12000

8Sep

2010

5Nov2011

30Dec

2010

25Feb

2011

15Apr

2011

17Jun

2011

12Aug

2011

7Oct

2011

2Dec

2011

27Jan

2012

23Mar2012

18May2012

13Jul

2012

7Sep

2012

2Nov2012

28Dec

2012

22Feb

2013

26Apr

2013

21Jun

2013

16Aug

2013

11Oct

2013

6Dec

2013

31Jan

2014

28Mar2014

23May2014

18Jul

2014

12Sep

2014

7Nov2014

9Jan

2015

6Mar2015

Manuscripts worked on Completed transcripts

NYT article

BL manuscripts made available

With thanks to: •Prof Philip Schofield (UCL Bentham Project, Principal Investigator) •Dr Tim Causer (Bentham Project) •Dr Kris Grint (Bentham Project) •Richard Davis (University of London Computer Centre •José Martin (ULCC) •Martin Moyle (UCL Library Services) •Lesley Pitman (UCL Library Services) •Tony Slade (UCL Creative Media) •Miguel Faleiro Rodrigues, Alejandro Salinas Lopez, and Raheel Nabi (UCL Creative Media) •Dr Arnold Hunt (British Library) •Anna-Maria Sichani (Bentham Project) •Dr Justin Tonra (National University of Ireland Galway) and Dr Valerie Wallace (Victoria University Wellington), bother formerly of the Bentham Project •All the partners in Transcriptorium http://transcriptorium.eu/consortium/ •And Transcribe Bentham’s volunteers! •Project previously funded by the AHRC and the Andrew W. Mellon Foundation

Outline










Relevant access to a large corpus

14

Relevant access to a large corpus

• A search index?

• Topic models?

• Corpus cartography?

Challenges for this corpus

• Not an all-English corpus

• Difficulties posed by an historical variety

• Technical language

• Revision history, additions and deletions

15

Stats for analyzed corpus sample

• Total TEI files: 29,900

• In English: 29,400

• That we dated: 16,700

• We only visualized English transcripts that

we could date (with a simple heuristic)1

• Work is based on ca. 55% of the all the

TEI files in our sample

16

1We were not using the corpus’ date metadata for this exercise

Corpus Cartography

• Lexical extraction (of relevant sequences)

• Clustering based on similarity measures

• Visual representation (map of the corpus)

based on layout algorithms

17

Cartography tool: CorText

• CorText Manager covers all cartography

steps:


– Clustering

– Visualization

• Each step can be used independently,

thanks to standard import/export formats

18

To

ols

co

mb

ined

wit

h C

orT

ext

CARTOGRAPHY STEP TOOLS and RESOURCES

Lexical Extraction

DBpedia Spotlight

YaTeA

Human domain-expert

Clustering CorText Analysis

Visualization Gephi + Sigma JS plugin

- Static CorText MapExplorer

Inkscape

- Dynamic CorText Heatmaps,

Tubes, Distant Reading 19

Outline










Lexical Extraction

• CorText native option

– Noun-Phrase chunks (based on TreeTagger)

• Our options:

– Entity Linking / Wikification to DBpedia

– Keyphrase extraction tools like YaTeA

• In all cases: manual selection of pre-ranked

candidate terms by a domain-expert

21

Entity Linking / Wikification

• Given a database with encyclopedic

knowledge (e.g. Wikipedia)

- Finds references (mentions) to DB terms in text

- Dealing with variability in the mentions for a term

22






23

Database






24

Database






25

Database Corpus

- judicatory - judicial - judicature - Judicatory - Judicial


• Tool: DBpedia Spotlight

• Compares the context of sequences of

words in a text against DBpedia articles:

– Term definition’s text

– Links

– DBpedia structure (redirections etc.)

• Assigns a DBpedia term to the sequence if

a good match is found

26


Example terms and their variants

27

Term Variants

Judiciary judicature, judicatory, judicial

Jury jury, juries

Monarch king, monarch

Quantity amount, quantity

Saint Peter Simon Peter, Cephas


28

• Applying a current knowledge-base

(DBpedia) to 18th-19th century texts

• Is this a valid method?

Keyphrase extraction

• YaTeA (Aubin and Hamon, 2006)

• Extracts noun-phrases of configurable

structure and length

29

Outline










Clustering

• CorText offers several similarity metrics

– we chose the default method (distributional)

for homogeneous networks (Weeds & Weir 2005)

31

Visualization

• Static (one map for all dated transcripts)

• Dynamic: temporal slices on the corpus

– Heatmaps

– “River” or Sankey networks (“Tubes layout”)

32

http://apps.lattice.cnrs.fr/bentham

Static visualization

33

CorText network visualized with Gephi


34

CorText network visualized with Gephi


35

Example term: Bill

36

Example term: happiness

37

CorText network made interactive thanks to Gephi’s Sigma JS Exporter

38


39


Example term: suffering

40

Example term: suffering

41

42

Example term:

death

43

Example term:

death

Examples: nodes linking clusters

44

Examples: nodes linking clusters

45

Heatmaps: Saliency per subcorpus

46

Heatmaps: 1800-1809 subcorpus

47

Heatmaps: 1810-1819 subcorpus

48

Dynamic visualization

49


50

1795 1800 1805 1810


51

1795 1800 1805 1810


52

1795 1800 1805 1810


53

1795 1800 1805 1810

Outline










Evaluation

• Static maps: terms in the clusters

correspond closely to issues dealt with by

Bentham for the thematic areas of each

cluster

• Heatmaps: The evolution depicted

corresponds to the evolution of topics in

Bentham’s work

• DBpedia vs. keyphrase extraction: The

keyphrases provide more relevant

evidence for specialized scholars, a

general encyclopedia can help other users

55

Challenges Deleted material Additions

56

Challenges Thematic Variety

• Animal Welfare

• Arts

• Capital punishment

• Civil Code

• Constitutional Code

• Convict transportation

• Correspondence

• Crime & Punishment

• Education

• Law

• Legislation

• Moral Philosophy

• New South Wales

• Panopticon

• Penal Code

• Political Economy

• Preventive Police

• Religion

• Science

• Sexual Morality

• Torture

Formal Variety

• Text sheets

• Copies / Fair copies

• Marginal summary sheets

• Correspondence

• Collectanea

• Rudiments

• Spencers

57

From http://www.transcribe-bentham.da.ulcc.ac.uk/td/Manuscripts and

http://www.benthampapers.ucl.ac.uk/help.aspx?subject=category

http://www.transcribe-bentham.da.ulcc.ac.uk/td/Manuscripts







Outline










Distant Reading Module

• Follow evolution of selected lexical

sequences

59

Evolution of a lexical item

60

Temporal evolution

Temporal evolution profiles:

- Here: Rising, but present at all dates

- Other examples: falling, regular spikes etc.

Contexts: WordTree

61

Contexts: WordTree

62

Contexts: WordTree

63

Context evolution: Bump Charts

64

• Example: evil

65

Neighbours evolution

Bu

mp

Ch

art

s

66

Neighbours evolution

Bu

mp

Ch

art

s

• Example: relations among neighbours of

evil

Relations in the context: Egonetworks

67

Evolution of neighbours’ relations

68

Eg

on

etw

ork

s (

Pe

rio

d 2

)


69

Eg

on

etw

ork

s (

Pe

rio

d 3

)


70

Eg

on

etw

ork

s (

Pe

rio

d 4

)

Outline










Other Lexical Analyses

• TXM “textometry” tool

– Automatic part-of-

speech tagging

– Partition texts according

to metadata

– Query corpus using

linguistic criteria

– Statistical analyses

(overrepresentation,

underrepresentation)

72

[ http://textometrie.ens-lyon.fr/?lang=en ]

http://textometrie.ens-lyon.fr/?lang=en



Lexical Analysis with TXM

73


• Partition the corpus according to Category,

Year, Decade, Main headings, or other

available metadata

74


Number of words per Category

75

Lexical Analyses with TXM

• Over- (or under-) representation of given

words per decade (after partitioning per decade)

76

TXM linguistic queries

• Evil followed by a noun, per text-category

77

TXM linguistic queries

• Sentences containing an adjective + evil

78

Summary • Accessing a large unedited corpus

– Cartography methods

• Lexical extraction

• Maps

– Static picture of the corpus

– Temporal evolution

– Other visualizations (Distant, WordTree)

• Domain-expert feedback

• Challenges

• Other lexical analyses

79

http://apps.lattice.cnrs.fr/bentham

Bibliography

Aubin, S., and Hamon, T. (2006) Improving Term

Extraction with Terminological Resources. In

Advances in Natural Language Processing: 5th

International Conference on NLP, FinTAL 2006, pp.

380-387. LNAI 4139. Springer.

Auer, Sören, et al. (2007). DBpedia: A nucleus for a

web of open data. The Semantic Web. Springer.

Causer, Tim, and Terras, Melissa (2014a). Many

hands make light work. Many hands together

make merry work: Transcribe Bentham and

crowdsourcing manuscript collections, in

Crowdsourcing Our Cultural Heritage, ed. M. Ridge,

Ashgate

Causer, Tim, and Terras, Melissa (2014b).

Crowdsourcing Bentham: Beyond the Traditional

Boundaries of Academic History, International

Journal of Humanities and Arts Computing, 8

Chavalarias, David, and Jean-Philippe Cointet. (2013).

Phylomemetic Patterns in Science Evolution—The

Rise and Fall of Scientific Fields. PLoS ONE 8 (2)

Cortext Manager Documentation (2016).

https://docs.cortext.net/.

Mendes, Pablo N., Max Jakob, Andrés García-Silva,

and Christian Bizer. (2011). DBpedia Spotlight:

Shedding Light on the Web of Documents. In

Proceedings of the 7th International Conference on

Semantic Systems, 1–8. ACM.

Mélanie, F., Tieberghien, E., Ruiz, P., Poibeau, T.,

Causer, T. Terras, M. (2016). Mapping the Bentham

Corpus. In Digital Humanities Conference (DH

2016). Kraków, Poland.

Poibeau, T. and Ruiz, P. (2015). Generating Navigable

Semantic Maps from Social Sciences Corpora. In

Digital Humanities Conference (DH 2015). Sydney,

Australia.

Rule, Alix, Jean-Philippe Cointet, and Peter S.

Bearman. (2015). Lexical Shifts, Substantive

Changes, and Continuity in State of the Union

Discourse, 1790–2014. Proceedings of the National

Academy of Sciences 112 (35)

Venturini, T., N. Baya Laffite, J.-P. Cointet, I. Gray, V.

Zabban, and K. De Pryck. (2014). Three Maps and

Three Misunderstandings: A Digital Mapping of

Climate Diplomacy. Big Data & Society 1

Weeds J, Weir D (2005). Co-occurrence retrieval: A

flexible framework for lexical distributional similarity.

In Computational Linguistics 31(4), 439–475.

Wattenberg, M. and Viégas, F.B., 2008. The word tree,

an interactive visual concordance. In IEEE

transactions on visualization and computer graphics,

14(6), pp.1221-1228.

80

https://docs.cortext.net/

https://docs.cortext.net/

82

& return you all due thanks

[email protected] http://www.lattice.cnrs.fr/Pablo-Ruiz-Fabo,541 http://apps.lattice.cnrs.fr/

Visualizing the Transcribe Bentham Corpus

Education

Transcript of Visualizing the Transcribe Bentham Corpus