A Collaborative, Indeterministic and Partly Automatized Approach to ...

40
A Collaborative, Indeterministic and Partly Automatized Approach to Text Annotation Thomas Bögel 1 , Evelyn Gius 2 , Marco Petris 2 , Jannik Strötgen 1 The heureCLÉA Project 1 Heidelberg University, 2 University of Hamburg http://www.heureclea.de/ July 8, 2014

Transcript of A Collaborative, Indeterministic and Partly Automatized Approach to ...

A Collaborative, Indeterministic and PartlyAutomatized Approach to Text Annotation

Thomas Bögel1, Evelyn Gius2, Marco Petris2, Jannik Strötgen1

The heureCLÉA Project1Heidelberg University, 2University of Hamburg

http://www.heureclea.de/

July 8, 2014

A Collaborative, Indeterministic and PartlyAutomatized Approach to Text Annotation

Thomas Bögel1, Evelyn Gius2, Marco Petris2, Jannik Strötgen1

The heureCLÉA Project1Heidelberg University, 2University of Hamburg

http://www.heureclea.de/

July 8, 2014

Motivation NLP & UIMA UIMA-Catma

Motivation

Different types of annotations:some annotation tasks are complex, e.g.,

flashbacks, prolepsis, analepsis, ...

some annotation tasks are simple, e.g.,sentences, part-of-speech information, tenses, ...

Manual creation of simple annotationsslowtime-consumingboring

Spend time on complex tasks, add simple annotationsautomatically

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 1 / 8

Motivation NLP & UIMA UIMA-Catma

Motivation

Different types of annotations:some annotation tasks are complex, e.g.,

flashbacks, prolepsis, analepsis, ...some annotation tasks are simple, e.g.,

sentences, part-of-speech information, tenses, ...

Manual creation of simple annotationsslowtime-consumingboring

Spend time on complex tasks, add simple annotationsautomatically

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 1 / 8

Motivation NLP & UIMA UIMA-Catma

Motivation

Different types of annotations:some annotation tasks are complex, e.g.,

flashbacks, prolepsis, analepsis, ...some annotation tasks are simple, e.g.,

sentences, part-of-speech information, tenses, ...

Manual creation of simple annotationsslowtime-consumingboring

Spend time on complex tasks, add simple annotationsautomatically

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 1 / 8

Motivation NLP & UIMA UIMA-Catma

Motivation

Different types of annotations:some annotation tasks are complex, e.g.,

flashbacks, prolepsis, analepsis, ...some annotation tasks are simple, e.g.,

sentences, part-of-speech information, tenses, ...

Manual creation of simple annotationsslowtime-consumingboring

Spend time on complex tasks, add simple annotationsautomatically

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 1 / 8

Motivation NLP & UIMA UIMA-Catma

Natural Language Processing

Automatic processing of text datagoals (among others):

information extractionautomatic annotations

UIMA

UIMA: Unstructured Information Management Architecturecomponent framework for unstructured data (e.g., text)helps to connect tools not developed to be used together→ all components rely on the same data structure

(Common Analysis Structure, CAS)

UIMA programs are processing pipelines

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 2 / 8

Motivation NLP & UIMA UIMA-Catma

Natural Language Processing

Automatic processing of text datagoals (among others):

information extractionautomatic annotations

UIMA

UIMA: Unstructured Information Management Architecturecomponent framework for unstructured data (e.g., text)helps to connect tools not developed to be used together→ all components rely on the same data structure

(Common Analysis Structure, CAS)

UIMA programs are processing pipelines

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 2 / 8

Motivation NLP & UIMA UIMA-Catma

Natural Language Processing

Automatic processing of text datagoals (among others):

information extractionautomatic annotations

UIMA

UIMA: Unstructured Information Management Architecturecomponent framework for unstructured data (e.g., text)helps to connect tools not developed to be used together→ all components rely on the same data structure

(Common Analysis Structure, CAS)

UIMA programs are processing pipelines

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 2 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

UIMA pipelines contain three components

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

AnalysisEngines

CASConsumer

DocsResults

AnalysisEngines

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

AnalysisEngines

CASConsumer

DocsResults

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

AnalysisEngines

CASConsumer

DocsResults

Collection Readerreads documents from source (e.g., file system, database)

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

AnalysisEngines

CASConsumer

DocsResults

CAS

Collection Readerreads documents from source (e.g., file system, database)instantiates a CAS for each document

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

AnalysisEngines

CASConsumer

DocsResults

CAS­ doc text­ metadata

Collection Readerreads documents from source (e.g., file system, database)instantiates a CAS for each documentinitializes CAS with doc text (metadata, etc.)

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

CASConsumer

DocsResults

AnalysisEngines

AnalysisEngines

AnalysisEngines

AnalysisEngines

AnalysisEngines

CAS­ doc text­ metadata

Analysis Enginesusually several Analysis Enginesanalyze the document

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

CASConsumer

DocsResults

AnalysisEngines

AnalysisEngines

AnalysisEngines

AnalysisEngines

AnalysisEngines

CAS­ doc text­ metadata

Analysis Enginesusually several Analysis Enginesanalyze the documentread content of the CAS

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

CASConsumer

DocsResults

AnalysisEngines

AnalysisEngines

AnalysisEngines

AnalysisEngines

AnalysisEngines

CAS­ doc text­ metadata­ annotations

Analysis Enginesusually several Analysis Enginesanalyze the documentread content of the CASadd annotations to the CAS

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

CASConsumer

DocsResults

AnalysisEngines

CAS­ doc text­ metadata­ annotations

CAS Consumerreads content of the CAS

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

CASConsumer

DocsResults

AnalysisEngines

CAS­ doc text­ metadata­ annotations

CAS Consumerreads content of the CASdoes final processing

evaluation, visualization, indexing

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

CASConsumer

DocsResults

AnalysisEngines

CAS

UIMA - What’s the clue?

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

CASConsumer

DocsResults

AnalysisEngines

CAS

UIMA - What’s the clue?single components are not directly connected to each otherinstead: CAS object

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Components of a UIMA Pipeline

   

CollectionReader

CASConsumer

DocsResults

CAS

AnalysisEnginesAnalysisEngines

UIMA - What’s the clue?single components are not directly connected to each otherinstead: CAS objectcomponents are independent of each othercomponents only have to be able to handle CAS

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 3 / 8

Motivation NLP & UIMA UIMA-Catma

Example Tasks

heureCLÉA project:sentence splittingtokenizationpart of speech taggingtemporal expressionstemporal signalstense

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 4 / 8

Motivation NLP & UIMA UIMA-Catma

Example Tasks

heureCLÉA project:sentence splittingtokenizationpart of speech taggingtemporal expressionstemporal signalstense

further projects:sentence splittingtokenizationpart of speech taggingtemporal expressionsgeographic expressionsnamed entities (persons)event extraction

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 4 / 8

Motivation NLP & UIMA UIMA-Catma

Example Tasks

heureCLÉA project:sentence splittingtokenizationpart of speech taggingtemporal expressionstemporal signalstense

further projects:sentence splittingtokenizationpart of speech taggingtemporal expressionsgeographic expressionsnamed entities (persons)event extraction

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 4 / 8

Motivation NLP & UIMA UIMA-Catma

The Temporal Tagger HeidelTime

Extraction and normalization of temporal expressionsJuly 8, 2014 → 2014-07-08today → 2014-07-08

Most of the work so far: English news documents

HeidelTime: Multilingual, Cross-domain Temporal Tagger

Current languages:English, Spanish, German, French, ItalianDutch, Arabic, Vietnamese, Chinese, Russian

Publicly available:UIMA & standalone versions, online demo@ Google Code

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 5 / 8

Motivation NLP & UIMA UIMA-Catma

The Temporal Tagger HeidelTime

Extraction and normalization of temporal expressionsJuly 8, 2014 → 2014-07-08today → 2014-07-08

Most of the work so far: English news documents

HeidelTime: Multilingual, Cross-domain Temporal Tagger

Current languages:English, Spanish, German, French, ItalianDutch, Arabic, Vietnamese, Chinese, Russian

Publicly available:UIMA & standalone versions, online demo@ Google Code

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 5 / 8

Motivation NLP & UIMA UIMA-Catma

The Temporal Tagger HeidelTime

Extraction and normalization of temporal expressionsJuly 8, 2014 → 2014-07-08today → 2014-07-08

Most of the work so far: English news documents

HeidelTime: Multilingual, Cross-domain Temporal Tagger

Current languages:English, Spanish, German, French, ItalianDutch, Arabic, Vietnamese, Chinese, Russian

Publicly available:UIMA & standalone versions, online demo@ Google Code

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 5 / 8

Motivation NLP & UIMA UIMA-Catma

The UIMA-Catma Pipeline

   

CollectionReader

AnalysisEngines

CASConsumer

DocsResults

AnalysisEngines

Collection Reader: reads documents from a sourceAnalysis Engines: add annotationsCAS Consumer: does final processing

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 6 / 8

Motivation NLP & UIMA UIMA-Catma

The UIMA-Catma Pipeline

   

CATMACollectionReader

CATMACAS

Consumer

AnalysisEngines

Collection Reader: reads document from CATMAAnalysis Engines: add annotationsCAS Consumer: sends annotations to CATMA

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 6 / 8

Motivation NLP & UIMA UIMA-Catma

Live Demo

So far: manual annotations of adverbs and superlatives→ nice annotation task to get familiar with CATMA→ boring if you want to annotate a whole book

Our CATMA-UIMA workflow:CATMA Collection Reader → get documents from CATMATreeTagger wrapper (part of UIMA HeidelTime kit)HeidelTime→ sentence splitting, tokenization, part-of-speech tagging→ temporal expressions

CATMA CAS Consumer: → send annotations to CATMA

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 7 / 8

Motivation NLP & UIMA UIMA-Catma

Live Demo

So far: manual annotations of adverbs and superlatives→ nice annotation task to get familiar with CATMA→ boring if you want to annotate a whole book

Our CATMA-UIMA workflow:CATMA Collection Reader → get documents from CATMATreeTagger wrapper (part of UIMA HeidelTime kit)HeidelTime→ sentence splitting, tokenization, part-of-speech tagging→ temporal expressions

CATMA CAS Consumer: → send annotations to CATMA

DH’14, Lausanne Partly Automatized Approach to Text Annotation c© Bögel, Strötgen 7 / 8