Ihr june15-evans

16
NLP and Data Mining: From Chartex to Traces Through Time and beyond Dr Roger Evans Natural Language Technology Group & Cultural Informatics Research Group University of Brighton

Transcript of Ihr june15-evans

NLP and Data Mining: From Chartex to Traces Through Time

and beyond

Dr Roger EvansNatural Language Technology Group &

Cultural Informatics Research GroupUniversity of Brighton

One man, two guvnors

ChartEx TTT‘Deep’

processing

Two men, two guvnors

ChartEx TTT

Natural language

processing

Data mining

Two men, two guvnors

ChartEx TTT

Natural language

processing

Data mining

Brighton

Leiden

ChartEx Architecture

1000’s of chartersVirtual

workbenchData

mining

Natural language

processing

DM development

NLP development

5-10 charters

Markupscheme

Expertelicitation

100-200 Charters Marked-up chartersManual markup

ChartExrepository

VWB development

VWB requirements

Repositorydevelopment

ChartEx Architecture

1000’s of chartersVirtual

workbenchData

mining

Natural language

processing

DM development

NLP development

5-10 charters

Markupscheme

Expertelicitation

100-200 Charters Marked-up chartersManual markup

ChartExrepository

VWB development

VWB requirements

Repositorydevelopment

Runtime architecture

TTT architecture

Record Linkage

Visualisation

Shallow language

processing

Extract content

Deep language

processing

DocumentsOptimisation

/statistics

Comparison

Record Linkage

Visualisation

Shallow language

processing

Extract content

Deep language

processing

DocumentsOptimisation

/statistics

1000’s of chartersVirtual

workbenchData

mining

Natural language

processing

ChartExrepository

Comparison

Record Linkage

Visualisation

Shallow language

processing

Extract content

Deep language

processing

DocumentsOptimisation

/statistics

1000’s of chartersVirtual

workbenchData

mining

Natural language

processing

ChartExrepository

Range of data

Medieval charters

English and Latin

Early and modern

Free text

Text and data

Comparison

Record Linkage

Visualisation

Shallow language

processing

Extract content

Deep language

processing

DocumentsOptimisation

/statistics

1000’s of chartersVirtual

workbenchData

mining

Natural language

processing

ChartExrepository

Range of data

Analytic Complexity

Medieval charters

English and Latin

Early and modern

Free text

Text and data

Focus on people

Detailed view

Focus on places

Broad relational view

Comparison

Record Linkage

Visualisation

Shallow language

processing

Extract content

Deep language

processing

DocumentsOptimisation

/statistics

1000’s of chartersVirtual

workbenchData

mining

Natural language

processing

ChartExrepository

Range of data

Target users

Analytic Complexity

Medieval charters

English and Latin

Early and modern

Free text

Text and data

Focus on people

Detailed view

Focus on places

Broad relational view‘Researchers’

Controlled environment

Web users

Less control

Comparison

Record Linkage

Visualisation

Shallow language

processing

Extract content

Deep language

processing

DocumentsOptimisation

/statistics

1000’s of chartersVirtual

workbenchData

mining

Natural language

processing

ChartExrepository

Range of data

Target users

Analytic Complexity

Medieval charters

English and Latin

Early and modern

Free text

Text and data

Focus on people

Detailed view

Focus on places

Broad relational view‘Researchers’

Controlled environment

Web users

Less control

(Heritage) Enterprise

Bespoke

What can Computer Science do?

• State of the art is broadly based on statistics

• Answers are always only approximate

• Different kinds of approximation:• Precision – focus on making sure answers are right (but

may miss some)

• Recall - focus on getting as many right answers as possible (but may give some wrong answers too)

Precision and recall

What does Digital Humanities want?

• Perfect results? • How do you respond if we say we can’t do that?

• Control over tradeoff?• How easy is it to understand what control you have?

• Does this help you interpret the results you get?

Where are we now, and where are we going?• Human in the loop

• Tools always require human interpretation of results

• Is this really just a cop out by computer scientists?

• Or just a pragmatic expression of the state of the art?

• Deskilling• Do we really mean an expert in the loop?

• Conversations• Are we really only just at the point of negotiating what is

possible and what is required?