Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

download Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

If you can't read please download the document

Transcript of Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing

Poio API - An annotation framework to bridgeLanguage Documentation and Natural LanguageProcessing

Centro Interdisciplinar de Documentao Lingustica e SocialMinde/Portugal

Vera Ferreira, [email protected] Bouda, [email protected] Lopes, [email protected]

Language documentation

Aim of developing a "lasting, multipurpose record of a language"

Collection, distribution, and preservation of primary data of a variety of communicative events

Data is normally transcribed, translated, and it should also be annotated

Archives to preserve and publish documentationThe Language Archive

Endangered Languages Archive (ELAR)

Natural Language Processing

Any kind of computer manipulation of natural language

Mostly for major languages like English, Spanish, German, etc.

NLP is rarely used on LD data

Archiving needs led to digitization

Now we see corpus-based XYZ in General Linguistics

Indiviual examples are hand-picked

(Semi-)automated tagging of lesser-known languages

Quantitative Language Comparison

In contrast to corpus linguistics (see Michael Cysouw's research group)

Based on LD data, bible texts, movie subtitles etc.

Supports typological research

Annotation Graphs, LAF and GrAF

ISO standard 24612 "Language resource management - Linguistic annotation framework (LAF)

Annotation graphs as the underlying data model for linguistic annotations

Developed for MASC of the American National Corpus

Existing connectors for UIMA and GATE

Radical stand-off approachUnsupervised collaboration

Poio API

Part of Clarin-D curation project at the University of Cologne

Connectors to The Language Archive and Clarin Weblicht

Layered architectureAPI

Internal representation (LAF)

File format plugins (EAF, Toolbox, TCF)

Based on PyAnnotation and graf-python

Poio API

Data Structure Types (1/2)

List of lists, tree structure[ utterance, [word, wfw],

translation ]

For example GRAID (Grammatical Relations and Animacy in Discourse)[ utterance, [clause unit, [ word, wfw, graid1],

graid2],

translation ]

Data Structure Types (2/2)

ObjectiveMapping the tree structures into GrAF structure

AdvantagesFlexibility in the construction of annotation hierarchies

Automatic transformation of the tree structures into a user interface (Poio Editor and Analyzer)

Customization and colloboration

DisadvantagesNot all annotation schemes can be mapped onto a tree-like structure

Annotation Tree

Graf-python (1/3)

Python implementation of GrAFDeveloped by Stephen Matysik for ANC

Provides the underlying data structure for all data and annotations that Poio API can manage (interoperability)Accessing the nodes, edges, regions and their annotations from the parsed files (GrAF ISO)

Graf-python (2/3)

Example: accessing the nodes in a graid1 tier

Block Code:

gparser = GraphParser()file = 'example-graid1.xml'file_stream = codecs.open(file, 'r', 'utf-8')g = gparser.parse(file_stream)for node in g.nodes: print(node) for annotation in node.annotations: print(annotation) graid1 = annotation.features.get('graid1') if graid1 is not None: print(graid1)

Result:

NodeID = word-n1Annotation('word', 'a-112')Annotation('graid1', 'a-508')compNodeID = word-n2Annotation('word', 'a-113')Annotation('graid1', 'a-509')detiNodeID = word-n3Annotation('word', 'a-114')Annotation('graid1', 'a-510')np.h:s=cop:predp

Graf-python (3/3)

Word-n1Region [0 2]

compUtterance 1Region [0-20]kiWord-n2Region [3 7]detiyagword

graid1

word

graid1

The future: Usage of graphs

Graph-coloring algorithm to provide insight on LD datamake common subgraphs visible after merge of corpora

Graph-traversal algorithms to collect statistical dataClusters of annotation values

Weighted graphs to reflect links between sourcesQuantitative Historical Linguistics with dictionaries

Linked via spanish translations

Thank you for your attention!

Centro Interdisciplinar de Documentao Lingustica e SocialMinde/Portugal

Vera Ferreira, [email protected] Bouda, [email protected] Lopes, [email protected]

Links

Poio (API): http://media.cidles.eu/poio/

ISO 24612: http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326

The Language Archive:http://tla.mpi.nl/

Weblicht: http://weblicht.sfs.uni-tuebingen.de/index.shtml

Centro Interdisciplinar de Documentao Lingustica e Social, http://www.cidles.eu@ ACRH-2, Lisbon, 29.11.2012