Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing
-
Upload
peter-bouda -
Category
Documents
-
view
475 -
download
2
Transcript of Poio API - An annotation framework to bridge Language Documentation and Natural Language Processing
Poio API - An annotation framework to bridgeLanguage Documentation and Natural LanguageProcessing
Centro Interdisciplinar de Documentao Lingustica e SocialMinde/Portugal
Vera Ferreira, [email protected] Bouda, [email protected] Lopes, [email protected]
Language documentation
Aim of developing a "lasting, multipurpose record of a language"
Collection, distribution, and preservation of primary data of a variety of communicative events
Data is normally transcribed, translated, and it should also be annotated
Archives to preserve and publish documentationThe Language Archive
Endangered Languages Archive (ELAR)
Natural Language Processing
Any kind of computer manipulation of natural language
Mostly for major languages like English, Spanish, German, etc.
NLP is rarely used on LD data
Archiving needs led to digitization
Now we see corpus-based XYZ in General Linguistics
Indiviual examples are hand-picked
(Semi-)automated tagging of lesser-known languages
Quantitative Language Comparison
In contrast to corpus linguistics (see Michael Cysouw's research group)
Based on LD data, bible texts, movie subtitles etc.
Supports typological research
Annotation Graphs, LAF and GrAF
ISO standard 24612 "Language resource management - Linguistic annotation framework (LAF)
Annotation graphs as the underlying data model for linguistic annotations
Developed for MASC of the American National Corpus
Existing connectors for UIMA and GATE
Radical stand-off approachUnsupervised collaboration
Poio API
Part of Clarin-D curation project at the University of Cologne
Connectors to The Language Archive and Clarin Weblicht
Layered architectureAPI
Internal representation (LAF)
File format plugins (EAF, Toolbox, TCF)
Based on PyAnnotation and graf-python
Poio API
Data Structure Types (1/2)
List of lists, tree structure[ utterance, [word, wfw],
translation ]
For example GRAID (Grammatical Relations and Animacy in Discourse)[ utterance, [clause unit, [ word, wfw, graid1],
graid2],
translation ]
Data Structure Types (2/2)
ObjectiveMapping the tree structures into GrAF structure
AdvantagesFlexibility in the construction of annotation hierarchies
Automatic transformation of the tree structures into a user interface (Poio Editor and Analyzer)
Customization and colloboration
DisadvantagesNot all annotation schemes can be mapped onto a tree-like structure
Annotation Tree
Graf-python (1/3)
Python implementation of GrAFDeveloped by Stephen Matysik for ANC
Provides the underlying data structure for all data and annotations that Poio API can manage (interoperability)Accessing the nodes, edges, regions and their annotations from the parsed files (GrAF ISO)
Graf-python (2/3)
Example: accessing the nodes in a graid1 tier
Block Code:
gparser = GraphParser()file = 'example-graid1.xml'file_stream = codecs.open(file, 'r', 'utf-8')g = gparser.parse(file_stream)for node in g.nodes: print(node) for annotation in node.annotations: print(annotation) graid1 = annotation.features.get('graid1') if graid1 is not None: print(graid1)
Result:
NodeID = word-n1Annotation('word', 'a-112')Annotation('graid1', 'a-508')compNodeID = word-n2Annotation('word', 'a-113')Annotation('graid1', 'a-509')detiNodeID = word-n3Annotation('word', 'a-114')Annotation('graid1', 'a-510')np.h:s=cop:predp
Graf-python (3/3)
Word-n1Region [0 2]
compUtterance 1Region [0-20]kiWord-n2Region [3 7]detiyagword
graid1
word
graid1
The future: Usage of graphs
Graph-coloring algorithm to provide insight on LD datamake common subgraphs visible after merge of corpora
Graph-traversal algorithms to collect statistical dataClusters of annotation values
Weighted graphs to reflect links between sourcesQuantitative Historical Linguistics with dictionaries
Linked via spanish translations
Thank you for your attention!
Centro Interdisciplinar de Documentao Lingustica e SocialMinde/Portugal
Vera Ferreira, [email protected] Bouda, [email protected] Lopes, [email protected]
Links
Poio (API): http://media.cidles.eu/poio/
ISO 24612: http://www.iso.org/iso/catalogue_detail.htm?csnumber=37326
The Language Archive:http://tla.mpi.nl/
Weblicht: http://weblicht.sfs.uni-tuebingen.de/index.shtml
Centro Interdisciplinar de Documentao Lingustica e Social, http://www.cidles.eu@ ACRH-2, Lisbon, 29.11.2012