[poster] Structured and Unstructured: Extracting Information from Classics Scholarly Texts

1
EXTRACTING INFORMATION FROM CLASSICS SCHOLARLY TEXTS Matteo Romanello, [email protected] Centre of Computing in the Humanities (CCH), King's College London THE PROJECT AT A GLANCE PhD in Digital Humanities (DH) Project started in October 2009 Project co-supervised by: Willard McCarty (CCH KCL) Jonathan Ginzburg (DCS KCL) Supported by the Arts and Humanities Research Council (AHRC) INFORMATION RETRIEVAL IN CLASSICS APh: state of the art in IR (Information Retrieval) in the Classics Model: centralized, analytic (selective), manual labour, highly (absolutely?) accurate, subscription- based access Future(s): How can we complement it? With which tools? What can be improved in the tool used by the majority of Classicists to retrieve bibliography? ...OTHER POSSIBLE MODEL(S) Open source and open access Decentralized: harvesting rather than centralising Automatic: how to reduce the labour of manual annotation? Errorful/noisy input and acceptable error rate Measurable completeness/exhaustivity How the two approaches compare to each other? WHICH CORPUS TO BE MINED? 3 scenarios: 1) LEXIS: electronic version of a Classics journal. Online papers are PDF produced scanning the actual printed copies (open access,no OCR, low quality images); 2)PSWPC (Princeton Stanford working papers in Classics): open archive; text stripped from PDFs often is scarcely accurate (e.g. Greek sequences) 3) JSTOR's Data for Research API: no controls over text processing; access to data for ~180k Classics papers (to word frequency, extracted references, bigrams/trigrams, key terms). A KNOWLEDGE-BASED APPROACH Idea: exploiting accurate information contained in structured data sources; using information in the KB to train machine- learning based components; Obstacles: different (heterogeneous) data formats, XML dialects, DB schemas etc.; lack of semantic interoperability. Possible solution: using high level ontologies (e.g. CIDOC CRM) to make data interoperable Once integrated into a Knowledge Base that information can be used to train machine-learning based components. EXTRACTING CANONICAL REFERENCES Classicists usually refer to primary sources by means of abbreviated references, called canonical references. Explicit linking of primary and secondary sources contained in the Digital Library implies being able to automatically interpret and extract such references, which is still an open issue. CRefEx is a tool I'm developing for the automatic extraction of canonical references. EXTRACTING BIBLIOGRAPHIC REFERENCES The task of extracting modern bibliographic for the purpose of Automatic Citation Indexing is well established in disciplines such as Computer Science. However, some assumptions behind a tool such as Parscit (used to build CiteseerX) are not always applicable to Classics papers. This is an example of how the complexity of Humanities materials can lead to an improvement of tools and algorithms drawn from other fields. EMERGING RESEARCH QUESTIONS How did the practice of citing ancient texts change after that tools such as the TLG were introduced? What the citation networks in Classics look like? What emerges from the comparison with citation networks in other disciplines? What are the access point to information meaningful for Classicists? T h e d i g i t a l i m a g e o f t h e V e n e t u s A ( M a r c . G r a e c . Z . 4 5 4 ) u s e d f o r t h e g r a p h i c s h a s b e e n p r o d u c e d b y t h e C H S ( H a r v a r d U )

description

Poster I have presented at the DH2010 conference on my ongoing PhD project

Transcript of [poster] Structured and Unstructured: Extracting Information from Classics Scholarly Texts

Page 1: [poster] Structured and Unstructured: Extracting Information from Classics Scholarly Texts

EXTRACTING INFORMATION FROM CLASSICS SCHOLARLY TEXTSMatteo Romanello, [email protected]

Centre of Computing in the Humanities (CCH), King's College London

THE PROJECT AT A GLANCE● PhD in Digital Humanities (DH)● Project started in October 2009● Project co-supervised by:

● Willard McCarty (CCH KCL)● Jonathan Ginzburg (DCS KCL)

● Supported by the Arts and Humanities Research Council (AHRC)

INFORMATION RETRIEVAL IN CLASSICS● APh: state of the art in IR (Information Retrieval) in the Classics● Model: centralized, analytic (selective), manual labour, highly (absolutely?) accurate, subscription-based access● Future(s): How can we complement it? With which tools? What can be improved in the tool used by the majority of Classicists to retrieve bibliography?

...OTHER POSSIBLE MODEL(S)● Open source and open access● Decentralized: harvesting rather than centralising● Automatic: how to reduce the labour of manual annotation?● Errorful/noisy input and acceptable error rate● Measurable completeness/exhaustivity● How the two approaches compare to each other?

WHICH CORPUS TO BE MINED?

3 scenarios:

1) LEXIS: electronic version of a Classics journal. Online papers are PDF produced scanning the actual printed copies (open access,no OCR, low quality images);

2)PSWPC (Princeton Stanford working papers in Classics): open archive; text stripped from PDFs often is scarcely accurate (e.g. Greek sequences)

3) JSTOR's Data for Research API: no controls over text processing; access to data for ~180k Classics papers (to word frequency, extracted references, bigrams/trigrams, key terms).

A KNOWLEDGE-BASED APPROACHIdea:

● exploiting accurate information contained in structured data sources;

● using information in the KB to train machine-learning based components;

Obstacles:

● different (heterogeneous) data formats, XML dialects, DB schemas etc.;

● lack of semantic interoperability.

Possible solution: using high level ontologies (e.g. CIDOC CRM) to make data interoperable

Once integrated into a Knowledge Base that information can be used to train machine-learning based components.

EXTRACTING CANONICAL REFERENCES

Classicists usually refer to primary sources by means of abbreviated references, called canonical references.

Explicit linking of primary and secondary sources contained in the Digital Library implies being able to automatically interpret and extract such references, which is still an open issue.

CRefEx is a tool I'm developing for the automatic extraction of canonical references.

EXTRACTING BIBLIOGRAPHIC REFERENCESThe task of extracting modern bibliographic for the purpose of Automatic Citation Indexing is well established in disciplines such as Computer Science.

However, some assumptions behind a tool such as Parscit (used to build CiteseerX) are not always applicable to Classics papers.

This is an example of how the complexity of Humanities materials can lead to an improvement of tools and algorithms drawn from other fields.

EMERGING RESEARCH QUESTIONS

● How did the practice of citing ancient texts change

after that tools such as the TLG were introduced?

● What the citation networks in Classics look like?

● What emerges from the comparison with citation networks in other disciplines?

● What are the access point to information meaningful for Classicists?T

he d

igita

l im

ag

e o

f th

e V

ene

tus

A (

Ma

rc.

Gra

ec.

Z.4

54)

use

d fo

r th

e g

rap

hics

ha

s b

ee n

pro

duc

ed

by

the

CH

S (

Ha

rva

rd U

)