Visualizing the Transcribe Bentham Corpus
Transcript of Visualizing the Transcribe Bentham Corpus
Visualizing the Transcribe Bentham Corpus
Frédérique Mélanie, Estelle Tieberghien, Pablo Ruiz Fabo,
Thierry Poibeau
LATTICE Lab: ENS – CNRS – U Paris 3, PSL – USPC
Tim Causer, Melissa Terras
UCL Bentham Project, UCL Digital Humanities
UCLDH Seminar, December 2016
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 2
Jeremy Bentham (1748-1832)
•Jurist, philosopher, and legal and
social reformer
•Leading theorist in Anglo-American
philosophy of law
•Influenced the development of
welfarism
•Advocated utilitarianism
•Animal rights,
•Work on the “panopticon”
•Not founder of UCL, but...
•60,000 folios in UCL Sp. Collections
•40,000 untranscribed
•Auto-icon
The Bentham Project
• http://www.ucl.ac.uk/Bentham-Project/
• Since 1959
• “aims to produce a new scholarly
edition of the works and
correspondence of Jeremy Bentham”
• twenty six volumes of the new
Collected Works have been published
• 50 years to transcribe 20,000 folios
• Previous AHRC grant catalogued the
manuscripts
– http://www.benthampapers.ucl.ac.uk/
Facts and Figures (as of 1st July 2016)
• 16,205 manuscripts transcribed/partially-transcribed
• 15,351 (94%) checked and approved
• 83,955 visits
• 34,359 unique views
• Average session time: 14 minutes 13 seconds
• 140 countries
• 514 people have transcribed something
• Most of the work done by the 26 Super Transcribers
• Average of 54 transcripts edited since the start of the project
• Average of 56 per week during the last twelve months
• Greatest number of transcripts in any one week: 300 (w/c 14 June
• 2014)
Transcribe Bentham progress, 8 September 2010 to 20 March 2015
0
2000
4000
6000
8000
10000
12000
8Sep
2010
5Nov2011
30Dec
2010
25Feb
2011
15Apr
2011
17Jun
2011
12Aug
2011
7Oct
2011
2Dec
2011
27Jan
2012
23Mar2012
18May2012
13Jul
2012
7Sep
2012
2Nov2012
28Dec
2012
22Feb
2013
26Apr
2013
21Jun
2013
16Aug
2013
11Oct
2013
6Dec
2013
31Jan
2014
28Mar2014
23May2014
18Jul
2014
12Sep
2014
7Nov2014
9Jan
2015
6Mar2015
Manuscripts worked on Completed transcripts
NYT article
BL manuscripts made available
With thanks to: •Prof Philip Schofield (UCL Bentham Project, Principal Investigator) •Dr Tim Causer (Bentham Project) •Dr Kris Grint (Bentham Project) •Richard Davis (University of London Computer Centre •José Martin (ULCC) •Martin Moyle (UCL Library Services) •Lesley Pitman (UCL Library Services) •Tony Slade (UCL Creative Media) •Miguel Faleiro Rodrigues, Alejandro Salinas Lopez, and Raheel Nabi (UCL Creative Media) •Dr Arnold Hunt (British Library) •Anna-Maria Sichani (Bentham Project) •Dr Justin Tonra (National University of Ireland Galway) and Dr Valerie Wallace (Victoria University Wellington), bother formerly of the Bentham Project •All the partners in Transcriptorium http://transcriptorium.eu/consortium/ •And Transcribe Bentham’s volunteers! •Project previously funded by the AHRC and the Andrew W. Mellon Foundation
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 13
Relevant access to a large corpus
14
Relevant access to a large corpus
• A search index?
• Topic models?
• Corpus cartography?
Challenges for this corpus
• Not an all-English corpus
• Difficulties posed by an historical variety
• Technical language
• Revision history, additions and deletions
15
Stats for analyzed corpus sample
• Total TEI files: 29,900
• In English: 29,400
• That we dated: 16,700
• We only visualized English transcripts that
we could date (with a simple heuristic)1
• Work is based on ca. 55% of the all the
TEI files in our sample
16
1We were not using the corpus’ date metadata for this exercise
Corpus Cartography
• Lexical extraction (of relevant sequences)
• Clustering based on similarity measures
• Visual representation (map of the corpus)
based on layout algorithms
17
Cartography tool: CorText
• CorText Manager covers all cartography
steps:
– Lexical extraction
– Clustering
– Visualization
• Each step can be used independently,
thanks to standard import/export formats
18
To
ols
co
mb
ined
wit
h C
orT
ext
CARTOGRAPHY STEP TOOLS and RESOURCES
Lexical Extraction
DBpedia Spotlight
YaTeA
Human domain-expert
Clustering CorText Analysis
Visualization Gephi + Sigma JS plugin
- Static CorText MapExplorer
Inkscape
- Dynamic CorText Heatmaps,
Tubes, Distant Reading 19
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 20
Lexical Extraction
• CorText native option
– Noun-Phrase chunks (based on TreeTagger)
• Our options:
– Entity Linking / Wikification to DBpedia
– Keyphrase extraction tools like YaTeA
• In all cases: manual selection of pre-ranked
candidate terms by a domain-expert
21
Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
22
Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
23
Database
Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
24
Database
Entity Linking / Wikification
• Given a database with encyclopedic
knowledge (e.g. Wikipedia)
- Finds references (mentions) to DB terms in text
- Dealing with variability in the mentions for a term
25
Database Corpus
- judicatory - judicial - judicature - Judicatory - Judicial
Entity Linking / Wikification
• Tool: DBpedia Spotlight
• Compares the context of sequences of
words in a text against DBpedia articles:
– Term definition’s text
– Links
– DBpedia structure (redirections etc.)
• Assigns a DBpedia term to the sequence if
a good match is found
26
Entity Linking / Wikification
Example terms and their variants
27
Term Variants
Judiciary judicature, judicatory, judicial
Jury jury, juries
Monarch king, monarch
Quantity amount, quantity
Saint Peter Simon Peter, Cephas
Entity Linking / Wikification
28
• Applying a current knowledge-base
(DBpedia) to 18th-19th century texts
• Is this a valid method?
Keyphrase extraction
• YaTeA (Aubin and Hamon, 2006)
• Extracts noun-phrases of configurable
structure and length
29
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 30
Clustering
• CorText offers several similarity metrics
– we chose the default method (distributional)
for homogeneous networks (Weeds & Weir 2005)
31
Visualization
• Static (one map for all dated transcripts)
• Dynamic: temporal slices on the corpus
– Heatmaps
– “River” or Sankey networks (“Tubes layout”)
32
http://apps.lattice.cnrs.fr/bentham
Static visualization
33
CorText network visualized with Gephi
Static visualization
34
CorText network visualized with Gephi
Static visualization
35
Example term: Bill
36
Example term: happiness
37
CorText network made interactive thanks to Gephi’s Sigma JS Exporter
38
Example term: happiness
39
Example term: happiness
Example term: suffering
40
Example term: suffering
41
42
Example term:
death
43
Example term:
death
Examples: nodes linking clusters
44
Examples: nodes linking clusters
45
Heatmaps: Saliency per subcorpus
46
Heatmaps: 1800-1809 subcorpus
47
Heatmaps: 1810-1819 subcorpus
48
Dynamic visualization
49
Dynamic visualization
50
1795 1800 1805 1810
Dynamic visualization
51
1795 1800 1805 1810
Dynamic visualization
52
1795 1800 1805 1810
Dynamic visualization
53
1795 1800 1805 1810
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 54
Evaluation
• Static maps: terms in the clusters
correspond closely to issues dealt with by
Bentham for the thematic areas of each
cluster
• Heatmaps: The evolution depicted
corresponds to the evolution of topics in
Bentham’s work
• DBpedia vs. keyphrase extraction: The
keyphrases provide more relevant
evidence for specialized scholars, a
general encyclopedia can help other users
55
Challenges Deleted material Additions
56
Challenges Thematic Variety
• Animal Welfare
• Arts
• Capital punishment
• Civil Code
• Constitutional Code
• Convict transportation
• Correspondence
• Crime & Punishment
• Education
• Law
• Legislation
• Moral Philosophy
• New South Wales
• Panopticon
• Penal Code
• Political Economy
• Preventive Police
• Religion
• Science
• Sexual Morality
• Torture
Formal Variety
• Text sheets
• Copies / Fair copies
• Marginal summary sheets
• Correspondence
• Collectanea
• Rudiments
• Spencers
57
From http://www.transcribe-bentham.da.ulcc.ac.uk/td/Manuscripts and
http://www.benthampapers.ucl.ac.uk/help.aspx?subject=category
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 58
Distant Reading Module
• Follow evolution of selected lexical
sequences
59
Evolution of a lexical item
60
Temporal evolution
Temporal evolution profiles:
- Here: Rising, but present at all dates
- Other examples: falling, regular spikes etc.
Contexts: WordTree
61
Contexts: WordTree
62
Contexts: WordTree
63
Context evolution: Bump Charts
64
• Example: evil
65
Neighbours evolution
Bu
mp
Ch
art
s
66
Neighbours evolution
Bu
mp
Ch
art
s
• Example: relations among neighbours of
evil
Relations in the context: Egonetworks
67
Evolution of neighbours’ relations
68
Eg
on
etw
ork
s (
Pe
rio
d 2
)
Evolution of neighbours’ relations
69
Eg
on
etw
ork
s (
Pe
rio
d 3
)
Evolution of neighbours’ relations
70
Eg
on
etw
ork
s (
Pe
rio
d 4
)
Outline
• UCL Bentham Project & Transcribe Bentham
• How navigate this corpus? Visualizations
– Lexical extraction
– Co-occurrence networks
• Static view and Temporal evolution
• Evaluation and Challenges
• Other corpus explorations via visualization
• Distant Reading Module, WordTree
• Other lexical analyses 71
Other Lexical Analyses
• TXM “textometry” tool
– Automatic part-of-
speech tagging
– Partition texts according
to metadata
– Query corpus using
linguistic criteria
– Statistical analyses
(overrepresentation,
underrepresentation)
72
[ http://textometrie.ens-lyon.fr/?lang=en ]
Lexical Analysis with TXM
73
Lexical Analysis with TXM
• Partition the corpus according to Category,
Year, Decade, Main headings, or other
available metadata
74
Lexical Analysis with TXM
Number of words per Category
75
Lexical Analyses with TXM
• Over- (or under-) representation of given
words per decade (after partitioning per decade)
76
TXM linguistic queries
• Evil followed by a noun, per text-category
77
TXM linguistic queries
• Sentences containing an adjective + evil
78
Summary • Accessing a large unedited corpus
– Cartography methods
• Lexical extraction
• Maps
– Static picture of the corpus
– Temporal evolution
– Other visualizations (Distant, WordTree)
• Domain-expert feedback
• Challenges
• Other lexical analyses
79
http://apps.lattice.cnrs.fr/bentham
Bibliography
Aubin, S., and Hamon, T. (2006) Improving Term
Extraction with Terminological Resources. In
Advances in Natural Language Processing: 5th
International Conference on NLP, FinTAL 2006, pp.
380-387. LNAI 4139. Springer.
Auer, Sören, et al. (2007). DBpedia: A nucleus for a
web of open data. The Semantic Web. Springer.
Causer, Tim, and Terras, Melissa (2014a). Many
hands make light work. Many hands together
make merry work: Transcribe Bentham and
crowdsourcing manuscript collections, in
Crowdsourcing Our Cultural Heritage, ed. M. Ridge,
Ashgate
Causer, Tim, and Terras, Melissa (2014b).
Crowdsourcing Bentham: Beyond the Traditional
Boundaries of Academic History, International
Journal of Humanities and Arts Computing, 8
Chavalarias, David, and Jean-Philippe Cointet. (2013).
Phylomemetic Patterns in Science Evolution—The
Rise and Fall of Scientific Fields. PLoS ONE 8 (2)
Cortext Manager Documentation (2016).
https://docs.cortext.net/.
Mendes, Pablo N., Max Jakob, Andrés García-Silva,
and Christian Bizer. (2011). DBpedia Spotlight:
Shedding Light on the Web of Documents. In
Proceedings of the 7th International Conference on
Semantic Systems, 1–8. ACM.
Mélanie, F., Tieberghien, E., Ruiz, P., Poibeau, T.,
Causer, T. Terras, M. (2016). Mapping the Bentham
Corpus. In Digital Humanities Conference (DH
2016). Kraków, Poland.
Poibeau, T. and Ruiz, P. (2015). Generating Navigable
Semantic Maps from Social Sciences Corpora. In
Digital Humanities Conference (DH 2015). Sydney,
Australia.
Rule, Alix, Jean-Philippe Cointet, and Peter S.
Bearman. (2015). Lexical Shifts, Substantive
Changes, and Continuity in State of the Union
Discourse, 1790–2014. Proceedings of the National
Academy of Sciences 112 (35)
Venturini, T., N. Baya Laffite, J.-P. Cointet, I. Gray, V.
Zabban, and K. De Pryck. (2014). Three Maps and
Three Misunderstandings: A Digital Mapping of
Climate Diplomacy. Big Data & Society 1
Weeds J, Weir D (2005). Co-occurrence retrieval: A
flexible framework for lexical distributional similarity.
In Computational Linguistics 31(4), 439–475.
Wattenberg, M. and Viégas, F.B., 2008. The word tree,
an interactive visual concordance. In IEEE
transactions on visualization and computer graphics,
14(6), pp.1221-1228.
80
81
82
& return you all due thanks
[email protected] http://www.lattice.cnrs.fr/Pablo-Ruiz-Fabo,541 http://apps.lattice.cnrs.fr/