CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago –...

24
CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012

Transcript of CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago –...

Page 1: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

CODA – CATCHPlus Open Document Annotation

Hennie Brugman

OAC II Project Review meeting

Chicago – July 26-27, 2012

Page 2: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Annotation context

• Audiovisual – ASR, language, gesture, oral history

• Text – Semantic annotation

• Music – lyrics, music notation

• Linguistic Annotation – named entities

• Image annotation

• Programs: CATCH, CATCHPlus, CLARIN

Page 3: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

CODA main use cases

• Queen’s Cabinet (Henny van Schie/National Archive, Lambert Schomaker/Univ Groningen)

– Line strip and word zone annotations– ML: search in manuscript images– Add Named Entity annotations

• Sailing Letters (Nicoline van de Sijs/Meertens + consortium, Lambert Schomaker)

– Support manual annotation– Line strip detection service

Page 4: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

2

Page 5: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.
Page 6: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Line annotation tools (catchplus)

Page 7: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

<txt>godefroit</txt> <id>navis-SAL7316_0195-line-026

-y1=2094-y2=2317-zone-HUMAN-x=1145-y=105-w=315-h=116-unshear=0.0-version=ortho </id>

<user>mceunen</user> <time>Wed Jan 26 16:37:01 2011</time>

Page 8: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

OAC representationImageAnnotation TextAnnotations

hasBody

hasTarget

hasBodyhasTarget

constrainsconstrains

constrainsconstrains

hasTarget hasBody

“Dit is een beschrijving van DenHaag. En dit is een tweede zin.”

cnt:chars

imageScan.jpg

ia:1

page:0

zone:2

line:1

Canvas1

ct:1

ct:2 cb:2

cb:1

ib:0

hasBody

linestrip.jpg ia:2

Named Entity

Page 9: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

OAC representation – Named Entities

ImageAnnotation TextAnnotations EntityAnnotation

hasBody hasTarget hasBodyhasTarget hasTarget

hasTarget

hasBodyconstrains

constrains

constrainsconstrains

constrainsconstrains

hasTarget hasBody“Dit is een beschrijving van DenHaag. En dit is een tweede zin.”

“location”cnt:chars

cnt:charsimageScan.jpg

ia:1 ta:0

ta:2

ta:1

Canvas1

ct:1

ct:2

ct:3

ct:4

cb:2

cb:1

ib:0 ib:1

ea:1

! Annotation of annotations?

! Annotation of segments of inline text?

InlineTextConstraint:<rdf:Description rdf:about="urn:uuid:533624bb-d565-40ba-a14a-2e95c19c20df">

<rdf:type rdf:resource="http://www.openannotation.org/ns/ConstrainedTarget"/><constrains xmlns="http://www.openannotation.org/ns/"

rdf:resource="http://oas.dev.seecr.nl:8000/resolve/urn%3Auuid %3Ad8741024-18bf-40a8-a648-2cd5ebb9acfd"/><constrainedBy xmlns="http://www.openannotation.org/ns/"

rdf:resource="urn:uuid:4f6b7d34-2329-4ab6-be89-a0feec9e7208"/></rdf:Description>

<rdf:Description rdf:about="urn:uuid:4f6b7d34-2329-4ab6-be89-a0feec9e7208"><rdf:type rdf:resource="http://www.openannotation.org/ns/Constraint"/><rdf:type

rdf:resource="http://www.catchplus.nl/annotation/InlineTextConstraint"/><rdf:type rdf:resource="http://www.w3.org/2008/content#ContentAsText"/><chars xmlns="http://www.w3.org/2008/content#">

"&lt;textsegment offset="279" range="2"/&gt;"</chars><characterEncoding xmlns="http://www.w3.org/2008/content#">

UTF-8</characterEncoding></rdf:Description>

Page 10: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

KdK-2-OAC conversion

• Implicit line and page text

• Word and line order

• Text offsets and ranges

• Spatial information

• Identifiers and ‘annotatability’

• Redundant text for searchability

! Need for explicit representation of Sequence?

! Search on text of ConstrainedTarget/Body?

Page 11: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

KdK2OAC conclusions

• Bidirectional mapping is possible

• Compatible with SharedCanvas model

• OAC + Canvas links everything together

• Implicit information made explicit

• Supports alternative text segmentations

• OAC representation is extremely verbose

! For many annotation tasks OA may be overkill

Page 12: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.
Page 13: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Open Annotation Service (OAS)• Upload annotation RDF using SRU/Update

• Inlines external text and XML Bodies and authors

• Indexes OA and DC properties

• Assigns resolvable http URIs and resolves those

• Implementation: RDF store icw Solr, production quality software components (Meresco)

• Built-in OAI-PMH data provider and harvester for ‘annotation sets’

• Query: SRU/CQL, SPARQL, OAI-PMH

• Simple management dashboard (authentication and authorization, collection management, harvesting)

• Easy installation and Open Source

! Model does not support Annotation “sets”

Page 14: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

OAS: issues

• Annotation publication

• Searchability: ‘harvest and index’

• Text search on external bodies

• Annotation boundaries

• ‘Bypassing’ oac:constrains

! In RDF, what are the boundaries of an annotation?

Page 15: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.
Page 16: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Entity Recognition service

service

frog

converter

URL ortext OAS

resolve

source_text

FoLiA_document

URLor ID

entityannotations

Page 17: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

‘frog’ and FoLiA

• ‘Frog’ tool generates FoLiA XML document with– Segmentation of text in paragraphs, sentences and words

(tokens) – XML hierarchy

– Part of speech, lemma, morphology, chunking, dependency structure and named entities

• Mix of inline and standoff annotation– ‘Frog’ does not keep track of character offsets– Explicit ordering: numbering system in ids

• Trained for Dutch• Widely used for Dutch corpora• Made available by: ILK @ Tilburg University

Page 18: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

FoLiA-2-OAC conversion

• Reconstruct character offsets after tokenization• Operates on inline text as published by OAS• Construct and add entity text from tokens +

sequence (the+hague != hague+the)• Two approaches

1. Minimal: extract entity annotations and tokens, and convert to OAC

2. Maximal: full conversion to OAC

Page 19: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Linguistic Annotation

! Mix-in domain semantics as subtypes/subproperties?

! Maximal OA mapping or embed linguistic standards?

! Layers, hierarchies (syntax) and Documents

! Sequence (e.g. entities, morpheme breakup)

Page 20: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.
Page 21: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Synchronized viewing clientdemo

• Demo/screenshot

Page 22: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Summary of OA issues! Annotation of annotations?

! Annotation of segments of inline text?

! Need for explicit representation of Sequence?

! Search on ConstrainedTarget/Body?

! For many annotation tasks OA may be overkill

! Model does not support Annotation sets

! In RDF, what are the boundaries of an annotation?

Page 23: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Future work

• Finalize and integrate software (with web services)

• Upgrade to new OA spec (incl OAS)

• Line strip detection web service

• Possible applications– AV annotation in CATCHPlus– Nederlab

Page 24: CODA – CATCHPlus Open Document Annotation Hennie Brugman OAC II Project Review meeting Chicago – July 26-27, 2012.

Questions?