Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of...

32
Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg Knowledge Based Systems and Document Processing

Transcript of Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of...

Page 1: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Experiences with UIMA in NLP teaching and research

Manuela Kunze,

Dietmar Rösner

University of Magdeburg Knowledge Based Systems and Document Processing

Page 2: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 2

Overview

• What is UIMA?

• First Experiments

• NLP Teaching

• Conclusion

Page 3: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 3

UIMA: Unstructured Information Management Architecture

• a software architecture for developing and deploying unstructured information management (UIM) applications

• UIM application: a software system – analyse large volumes of unstructured information to

• discover, • organize, and • deliver relevant knowledge to the end user

• software architecture which specifies – component interfaces, data representations, …

• http://www.research.ibm.com/UIMA/

Page 4: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 4

UIMA: Unstructured Information Management Architecture

… interfaces to a collection of data items (e.g., documents) to beanalyzed. Collection Readers return CASes that contain the documents toanalyze, possibly along with additional metadata.

… takes a CAS, analyzes its contents, and produces an enrichedCAS. Analysis Engines can be recursively composed of other Analysis Engines(called an Aggregate Analysis Engine). Aggregates may also contain CASConsumers.

… may be used by a Collection Reader to populate a CAS from a document. An example of a CAS Initializer is an HTML parser that de-tags an HTML document and also inserts paragraph annotations (determined from <P> tags in the original HTML) into the CAS.

CAS: Common Analysis StructureCPE: Collecting Processing Manager… consume the enriched CAS that was produced by the sequence of Analysis

Engines before it, and produce an application-specific data structure, such as a search engine index or database.

[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]

Page 5: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 5

• Analysis Engine (AE):– a component that analyzes artifacts (e.g. documents) and infers

information about them

– consists of two parts:• Java classes (typically packaged as one or more JAR files) and

• AE descriptors (one or more XML files)– the configuration settings for the Analysis Engine as well as – a description of the AE’s input and output requirements.

UIMA: Unstructured Information Management Architecture

Page 6: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 6

UIMA: Unstructured Information Management Architecture

analysis engine

Annotator

processing resources

type system

Annotation Interface

define annotation type:• name• features (begin, end, …)

describe analysis engine:• annotator class• input parameter • output of annotations• external resources

• interface• resources

linked to atype system

uses

define anannotator

create

JavaXML

Page 7: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 7

• Aggregate Analysis Engine:– combine different analysis engine within one Analysis Engine

UIMA: Unstructured Information Management Architecture

[Ferucci et al.: Unstructured Information Management Architecture (UIMA): SDK User's Guide and Reference]

Page 8: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 8

Overview

• Introduction

• First Experiments

• NLP Teaching

• Conclusion

Page 9: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 9

First Experiments: UIMA vs. GATE

• base line:– 2 persons, 2 systems, 1 corpus and 1 extraction task– skills/experiences of the persons:

UIMA GATE Eclipse/Java

Person 1

Person 2

Page 10: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 10

Task of the Experiment

• process a corpus of websites– to detect and extract information relevant for tourists

• opening times of museum, prices of hotels,…

• corpus:– 30 tourism web sites of Egypt– additional 20 web sites of Washington, New York, London

• output: – Prolog facts for a reasoner– Questions:

• Which museum is now open?• …

Page 11: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 11

Evaluation Topics/Points

• ease of getting acquainted with system?:

– quality of docus: completeness, clarity, up-to-date, …?

– tutorials, use cases, …?

• processing and linguistic resources?

– lexica, Gazetteer lists, tools

• tools for resource maintenance and extension?

– quality: selfexplanatory, robust, comfortable

• speed of processing?

• single document vs. large corpora?

• limitations, suggestions for improvement?

• support for im-/export of a variety of document formats?

Page 12: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 12

Excerpts from the Corpus

• The Egyptian Museum is open the hours: 9am-5pm daily

• The Military Museum is open the hours: Summer: 8am-5:30pm; winter: 8am-4:30pm

• Palace Museum is open the hours: 8am-5:30pm (summer) 8am-4:30pm (winter)

• 10am-2pm, 6pm-9pm Sat-Wed; 6pm-9pm Fri

• …

Page 13: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 13

UIMA Application

• several annotators (like a pipeline)

museum pattern

time pattern

interval of times

restrictions

museum information

... *Fraunces Tavern Museum*54 Pearl St. - 1-212-425-1778Tuesday-Friday, 12pm?5pm; …

regular expressions

regular expressions

regular expressions

window covering two time intervals and a restriction

window covering a museum and opening hours

Prolog facts: museumopen('Fraunces Tavern Museum ',

'2005-12-01T12:00:00', '2005-12-01T17:00:00').museumopen('Fraunces Tavern Museum ',

'2005-12-02T12:00:00', '2005-12-02T17:00:00').museumopen('Fraunces Tavern Museum ',

'2005-12-03T12:00:00', '2005-12-03T17:00:00').

Page 14: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 14

UIMA: Results

• information annotated in the documents:– names of museums, hotels

– times, time intervals

– time restrictions

– prices, intervals of prices (hotel prices)

– keywords for museum category

– names of pharaohs (annotated with a correction of mispellings)

• information about hotel and museum are exported into Prolog facts and into a short textual summary – templates filled with the detected information

• hotels: Price information about Cosmopolitan Hotel : $157• museums:

*** *Fraunces Tavern Museum* ***

Open from 12:00:00 to 17:00:00;

Restriction: Tuesday-Friday

Page 15: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 15

UIMA vs. GATE: Conclusion

• no final judgement about: use GATE or UIMA– depends on

• your tasktask description

expected results

which processing resources are necessary

• your preferences for interfaceprefer the Eclispe environment (or other Java editors)

prefer a comfortable GUI

Page 16: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 16

UIMA vs. GATE: Conclusion

• GATE:tools availablecomfortable GUI

• UIMA: plain frameworksimplified definition of (complex) result structures simplified pre- and postprocessing of annotations

• both are extensible– e.g. for processing German documents

Page 17: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 17

'German' Extension of Processing Resources

• XDOC document suite– tools for processing German documents – tools implemented in CommonLisp

• for UIMA– Java reimplementation of the tools– several analysis engines

Page 18: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 18

XDOC in UIMA

• annotation of – part-of-speech (Morphix, heuristics)– semantic categories – named entities (vehicles, cities, …)

• a coarse approach for classification of PP – using maxent library

Page 19: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 19

UIMA: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- good

- illustrative examples (tutorial)

- completeness: sometimes it is very shortly described

- experiences with Eclipse and Java programming are advantageous

- prior knowledge about Java and Eclipse is helpful

Page 20: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 20

UIMA: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- annotators only from tutorial- sentence annotation

- word annotation

- date/time annotators

- examples for using regular expressions etc.

- external resources can be integrated:- lexical resources as external resources

(text files)

- existing processing resources- implementation of an interface is

necessary

Page 21: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 21

UIMA: Evaluation

documentation?

processing and linguistic resources?

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- specific Eclipse component editors or - simple text editors

Page 22: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 22

UIMA: Evaluation

documentation

processing and linguistic resources

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- faster than GATE?- in CPE detailed information about

processing time for each module

Page 23: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 23

UIMA: Evaluation

documentation

processing and linguistic resources

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- Collection Reader- document(s) from a directory

Page 24: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 24

UIMA: Evaluation

documentation

processing and linguistic resources

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

• no limitations: – all is possible, but implementation or

interfacing by user

• wish: – more processing and linguistic

resources within the distribution

Page 25: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 25

UIMA: Evaluation

documentation

processing and linguistic resources

tools for resource maintenance and extension?

speed of processing?

single docs vs. large corpora?

limitations, suggestions for improvement?

im-/export of document formats?

- import: CAS Initializer- export: CAS Consumer

- transform annotations in any other format

- export of - document + annotations

- only annotations

- required: Java application

Page 26: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 26

Overview

• Introduction

• First Experiments

• NLP Teaching

• Conclusion

Page 27: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 27

NLP Teaching

• course: Information Extraction

• aim of the course: to make our students acquainted with information extraction as basic NLP technology– UIMA, GATE

• students: computer science, data-knowledge engineering

• skills of the students: programming Java

Page 28: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 28

NLP Teaching

• different corpora: – news about FIFA world cup 2006 in Germany,– description of drugs,– announcements of new books, …

• tasks for students– to develop different anaylsis engines and combine them for

annotation of• URLs, • email addresses, • name of players, • results of games, …

• using regular expressions, external resources, maximum entropy models

Page 29: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 29

NLP Teaching

Page 30: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 30

UIMA: A Students View

easy to handle

Java programming (environment)

problems of students:– to understand the dependencies between the several

descriptors

• for teaching helpful (future work):– a 'comparator' of different solutions of students– which solution is the best, related to a 'master' solution

Page 31: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 31

Overview

• Introduction

• First Experiments

• NLP Teaching

• Conclusion

Page 32: Experiences with UIMA in NLP teaching and research Manuela Kunze, Dietmar Rösner University of Magdeburg C Knowledge Based Systems and Document Processing.

Kunze, Rösner: Experiences with UIMA in NLP teaching and research 32

Conclusion

• UIMA:– easy to learn and to handle– support the management of

• different annotations

• different processing resources

– integration of external resources (processing resources as well lexical resources)

– splitting of 'processing steps':• reader, initalizer, analysis engine, consumer

• 'wish-list':– a kind of jape transducer

• interface to GATE's processing resources is available

– 'comparator' for evaluation of solutions