Annotation as Algebra: a formal framework for linguistic annotation

Post on 12-Jan-2016

39 views 0 download

description

Annotation as Algebra: a formal framework for linguistic annotation. Mark Liberman University of Pennsylvania myl@cis.upenn.edu. (joint work with Steven Bird, Melbourne University). Outline. Motivation Sketch of the idea Survey of linguistic annotation - PowerPoint PPT Presentation

Transcript of Annotation as Algebra: a formal framework for linguistic annotation

1

Penn

HP Labs Bangalore, 8/21/2003

Annotation as Algebra:a formal framework for linguistic

annotation

Mark LibermanUniversity of Pennsylvania

myl@cis.upenn.edu

(joint work with Steven Bird, Melbourne University)

2

Penn

HP Labs Bangalore, 8/21/2003

Outline

Motivation Sketch of the idea Survey of linguistic annotation Annotation graphs as a formal framework Practical implementations and experience Issues for the future

3

Penn

HP Labs Bangalore, 8/21/2003

What linguistic annotation is (and isn’t) “Linguistic annotation” means

symbolic descriptions of specific linguistic signals e.g. transcriptions, parses, etc.

it does not include things like: metadata

e.g. information about speakers, recordings, documents, etc.

typically stored in RDB referenced by elements of linguistic annotation

lexicons but these can be treated in a common

framework

4

Penn

HP Labs Bangalore, 8/21/2003

Motivation

A jungle of annotation file formats e.g. more than 20 common formats

for time-marked orthographic transcriptions Many new formats every year

Multiple annotations of the same data No good way to search annotations

different coding needed for each format extra difficulty of searches across formats

Problems for: tool builders researchers corpus builders and maintainers

5

Penn

HP Labs Bangalore, 8/21/2003

Basic idea #1: what to do Abstract away from file formats,

to the logical structure of linguistic annotation Replace two-level model with three-level model

as in database technology several decades ago so many applications can access many kinds of data

through a consistent API

Choose a logical structure with good properties simple, conceptually natural, computationally efficient algebra to facilitate boolean combination of queries

6

Penn

HP Labs Bangalore, 8/21/2003

Two-level model:

7

Penn

HP Labs Bangalore, 8/21/2003

Three-level model:

8

Penn

HP Labs Bangalore, 8/21/2003

Basic idea #2: how to do it Three kinds of assertion recur in linguistic

annotation assigning a label

“This chunk of stuff has property X” sequencing labels

“chunk B immediately follows chunk A” anchoring the edges of labels

“this chunk boundary has coordinates k” (in time, space, text...)

Formalized as a labeled DAG, these primitives provides a logical structure

adequate for all linguistic annotation The result also defines an algebra

useful for searching and in other ways

9

Penn

HP Labs Bangalore, 8/21/2003

Associate a “label” (typed, structured symbolic information) with a region of a linguistic signal

Basic assertion type 1: Labeling

10

Penn

HP Labs Bangalore, 8/21/2003

Basic assertion type 2: sequencing

Example:

The stretch of signal labeled “this”is followed by a stretch of signal labeled “is”

11

Penn

HP Labs Bangalore, 8/21/2003

Basic assertion type 3: anchoring

Example:

The stretch of signal labeled “this”begins 137.4592 secondsfrom the start of file XYZ.

12

Penn

HP Labs Bangalore, 8/21/2003

Informal formalization

An “annotation graph” (AG) is: a directed acyclic graph whose arcs are labeled with fielded records

e.g. phoneme=“p” or word=“this”

whose nodes may be labeled with signal coordinates

e.g. 3.45692 seconds

Labeling → arc labelsSequencing → Anchoring → signal coordinates on nodes

That’s all!

13

Penn

HP Labs Bangalore, 8/21/2003

Outcome

API, open source toolkit (C,C++,TCL,Python); sample tools:

Java version (“ATLAS”) developed by NIST

14

Penn

HP Labs Bangalore, 8/21/2003

Annotation formats & tools

Surveyed in 1999 by Liberman and Bird

Documented on web pagehttp://ldc.upenn.edu/annotation

Used in designing annotation graphsystem & AG software

Survey is updated periodically

15

Penn

HP Labs Bangalore, 8/21/2003

Some animals in the annotation zoo1 TIMIT2 BAS Partitur3 CHILDES4 LACITO5 LDC CALLHOME6 NIST UTF7 Switchboard (four types of

annotation)8 ... etc. ...

16

Penn

HP Labs Bangalore, 8/21/2003

train/dr1/fjsp0/sa1.wrd: train/dr1/fjsp0/sa1.phn:2360 5200 she 0 2360 h#5200 9680 had 2360 3720 sh9680 11077 your 3720 5200 iy11077 16626 dark 5200 6160 hv16626 22179 suit 6160 8720 ae22179 24400 in 8720 9680 dcl24400 30161 greasy 9680 10173 y30161 36150 wash 10173 11077 axr36720 41839 water 11077 12019 dcl41839 44680 all 12019 12257 d44680 49066 year ...

Sample TIMIT data

17

Penn

HP Labs Bangalore, 8/21/2003

5200 6160 96808720

had

hv ae dcl

TIMIT interpreted graphically

18

Penn

HP Labs Bangalore, 8/21/2003

W = word level5200 9680 had

P = phoneme level5200 6160 hv6160 8720 ae8720 9680 dcl

TIMIT as Annotation Graph

19

Penn

HP Labs Bangalore, 8/21/2003

BAS Partitur

Goal: a common format for research results

from many German speech projects.

A multi-tier description of speech signals:

KAN - the canonical transcriptionORT - orthographic transcriptionTRL - transliterationMAU - phonetic transcriptionDAS - dialogue act transcription

20

Penn

HP Labs Bangalore, 8/21/2003

BAS Partitur: example

KAN:0 j'a: ORT:0 ja MAU: 4160 1119 0 jKAN:1 S'2:n@n ORT:1 schönen MAU: 5280 2239 0 a:KAN:2 d'aNk ORT:2 Dank MAU: 7520 2399 1 SKAN:3 das+ ORT:3 das MAU: 9920 1599 1 2:KAN:4 vE:r@+ ORT:4 wäre MAU: 11520 479 1 nKAN:5 z'e:6 ORT:5 sehr MAU: 12000 479 1 nKAN:6 n'Et ORT:6 nett MAU: 12480 479 -1

DAS:0,1,2 @(THANK_INIT BA)DAS:3,4,5,6 @(FEEDBACK_ACKNOWLEDGEMENT BA)

21

Penn

HP Labs Bangalore, 8/21/2003

j'a: S'2:n@n

KAN:

ORT: ja sch"onen

DAS: @(THANK_INIT BA)

4160 5280

7520

j a:MAU

:

BAS Partitur graphical structure:

KAN:0 j'a: ORT:0 ja MAU: 4160 1119 0 jKAN:1 S'2:n@n ORT:1 sch"onen MAU: 5280 2239 0 a:DAS:0,1,2 @(THANK_INIT BA)

22

Penn

HP Labs Bangalore, 8/21/2003

Partitur differences from TIMIT

File organization:everything is in a single file

(even metadata)Time marking:

time anchors are in only one tier (MAU)

time anchors use <start offset, duration-1>

Relationship between the tiers:KAN tier supplies a set of identifiersMAU tier: several lines for each KAN lineDAS tier: one line for several KAN lines

Temporal structure:MAU and DAS define convex intervals

23

Penn

HP Labs Bangalore, 8/21/2003

BAS Partitur: Annotation graph

ORT: 0 ja MAU: 4160 1119 0 jORT: 1 sch"onen MAU: 5280 2239 0 a: MAU: 7520 2399 1 S MAU: 9920 1599 1 2: MAU: 11520 479 1 n

DAS:0,1,2 @(THANK_INIT BA)

24

Penn

HP Labs Bangalore, 8/21/2003

CHILDES

Child language acquisition data Archive organized by Brian

MacWhinney at CMU

CHAT transcription format Tools for creating, browsing, searching Contributions by many researchers

around the world

25

Penn

HP Labs Bangalore, 8/21/2003

CHILDES Annotation

*ROS: yahoo.%snd: "boys73a.aiff" 7349 8338*FAT: you got a lot more to do # don't you?%snd: "boys73a.aiff" 8607 9999*MAR: yeah.%snd: "boys73a.aiff" 10482 10839*MAR: because I'm not ready to go to <the bathroom> [>] +/.%snd: "boys73a.aiff" 11621 13784

26

Penn

HP Labs Bangalore, 8/21/2003

CHILDES differences from TIMIT

long recordings with multiple speakers time specified at turn level only there are gaps between the turns the transcription contains embedded

annotations

27

Penn

HP Labs Bangalore, 8/21/2003

CHILDES annotation graph

*ROS: yahoo.%snd: "boys73a.aiff" 7349 8338*FAT: you got a lot more to do # don't you?%snd: "boys73a.aiff" 8607 9999

NB: incomplete time info, disconnected structure

28

Penn

HP Labs Bangalore, 8/21/2003

CHILDES: RDB connection

ID NAME ROLE AGE SEX BIRTH

1 Ross Child 6;3.11 male 23-DEC-1977

2 Mark Child 4;4.15 male 19-NOV-1979

3 Brian Father

4 Mary Mother

“metadata” about speakers, recordings etc. stored separately in relational tables

29

Penn

HP Labs Bangalore, 8/21/2003

LACITO

Langues et Civilisations a Tradition Orale recordings of unwritten languages,

collected and transcribed over three decades preservation and dissemination

Based on XML markup for alignment to audio signal different XSL style sheets for display

generating HTML with hyperlinks to audio clips

30

Penn

HP Labs Bangalore, 8/21/2003

LACITO example

<S id="s1"> <AUDIO start="2.3656" end="7.9256"/> <TRANSCR> <W><FORM>nakpu</FORM> <GLS>deux</GLS></W> <W><FORM>nonotso</FORM> <GLS>soeurs</GLS></W> <W><FORM>si&x014b;</FORM> <GLS>bois</GLS></W> <W><FORM>pa</FORM> <GLS>faire</GLS></W> <W><FORM>la&x0294;natshem</FORM> <GLS>allerent</GLS></W> <W><FORM>are</FORM> <GLS>dit.on</GLS></W> <PONCT>.</PONCT> </TRANSCR> <TRADUC lang="Francais">On raconte que deux soeurs allerent chercher du bois.</TRADUC> <TRADUC lang="Anglais">They say that two sisters went to get firewood.</TRADUC></S>

31

Penn

HP Labs Bangalore, 8/21/2003

LACITO as AG

<AUDIO start="2.3656" end="7.9256"/><W><FORM>nakpu</FORM> <GLS>deux</GLS></W><W><FORM>nonotso</FORM> <GLS>soeurs</GLS></W><W><FORM>si&x014b;</FORM> <GLS>bois</GLS></W><W><FORM>pa</FORM> <GLS>faire</GLS></W><TRADUC lang="Francais">On raconte que deux ...</TRADUC><TRADUC lang="Anglais">They say that two ...</TRADUC>

32

Penn

HP Labs Bangalore, 8/21/2003

LACITO discussion

Two kinds of partiality for times: where they are simply unknown where they are inappropriate

Unknown times: the annotation is incomplete time-alignment is coarse-grained

Inappropriate times: for word boundaries in the phrasal

translation for punctuation?

33

Penn

HP Labs Bangalore, 8/21/2003

LDC Call Home example

980.18 989.56 A: you know, given how he's how far he's gotten, you know, he got his degree at &Tufts and all, I found that surprising that for the first time as an adult they're diagnosing this. %um 989.42 991.86 B: %mm. I wonder about it. But anyway. 991.75 994.65 A: yeah, but that's what he said. And %um 994.19 994.46 B: yeah. 995.21 996.59 A: He %um 996.51 997.61 B: Whatever's helpful. 997.40 1002.55 A: Right. So he found this new job as a financial consultant and seems to be happy with that. 1003.14 1003.45 B: Good.

34

Penn

HP Labs Bangalore, 8/21/2003

LDC CallHome as AG

995.21 996.59 A: He %um 996.51 997.61 B: Whatever's helpful. 997.40 1002.55 A: Right. So ...

35

Penn

HP Labs Bangalore, 8/21/2003

CallHome discussion

Speaker overlap No special devices, just turn time-marks Scales for an arbitrary number of

speakers Information about word-level overlap

is left ambiguous Additional time references

could easily specify word overlap

36

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF (circa 1999)

NIST: National Institute for Standards and Technology(USA)

UTF: “Universal Transcription Format” Intended to generalize over several earlier

LDC broadcast news and conversation transcription formats

Special treatment for: metadata, time stamps, speaker overlap,

contractions

N.B. now abandoned in favor of AG-based representations

37

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF example (from BN)

<turn speaker="Roger_Hedgecock" spkrtype="male" dialect= "native" start="2348.811875" end="2391.606000" mode="spontaneous" fidelity="high"> <time sec="2387.353875"> on welfare and away from real ownership \{breath and <contraction e_form="[that=>that]['s=>is]">that's a real problem in this <b_overlap start="2391.115375" end="2391.606000"> country<e_overlap></turn><turn speaker="Gloria_Allred" spkrtype="female" dialect= "native" start="2391.299625" end="2439.820312" mode="spontaneous" fidelity="high"> <b_overlap start="2391.299625" end="2391.606000">well i<e_overlap> think the real problem is that %uh these kinds of republican attacks <time sec="2395.462500">i see as code words for discrimination</turn>

38

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF: turn element

<turn speaker="Roger_Hedgecock" spkrtype="male" dialect= "native" start="2348.811875" end="2391.606000" mode="spontaneous" fidelity="high">

39

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF: Contraction

<contraction e_form="[that=>that]['s=>is]"> that's

40

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF: overlap

<b_overlap start="2391.115375" end="2391.606000">country<e_overlap>

41

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF: discussion

Relational data (e.g. speaker demographics)is embedded in the annotation (redundantly).

Time stampsare stored in three different places.

Speaker overlapis convolved with the speaker turn,so time relation with an external event disrupts the internal structure of a turn

Contractionsare treated in a way that facilitates link to

lexicon,but may be hard to ignore in a search function

42

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF as AG

43

Penn

HP Labs Bangalore, 8/21/2003

AG contraction treatment

Additional textual annotations: e.g. for expanding a contraction don't complicate the existing representation

--facilitates search

44

Penn

HP Labs Bangalore, 8/21/2003

NIST UTF / AG version

Metadatastored in a separate RDB table (cf.

CHILDES)Time stamps

stored in a single place -- AG nodesSpeaker overlap

not convolved with the speaker turn so temporal relationship with an external

event remains external to the structure of a turn

Contractionsno new device, easily ignored in search

No artificial order on speaker turns

45

Penn

HP Labs Bangalore, 8/21/2003

Switchboard

Corpus of 2400 5-minute telephone conversations collected at Texas Instruments in 1991Transcribed and aligned on three levels:

conversation, speaker turn, wordSubsequently annotated for:

POS, syntactic structure,breath groups, disfluencies,speech acts,phonetic segments,etc.

Then re-transcribed with many corrections!

--Proliferation of layers with different tokenizations--Problem of correction after annotation

46

Penn

HP Labs Bangalore, 8/21/2003

SWB example (1, 2)

B 21.86 0.26 MetricB 22.12 0.26 system,B 22.38 0.18 noB 22.56 0.06 one'sB 22.86 0.32 very,B 23.88 0.14 uh,B 24.02 0.16 noB 24.18 0.32 oneB 24.52 0.28 wantsB 24.80 0.06 itB 24.86 0.12 atB 24.98 0.22 allB 25.66 0.22 seemsB 25.88 0.22 like.

[ Metric/JJ system/NN ],/, [ no/DT one/NN ]'s/BESvery/RB ,/, [ uh/UH ] ,/, [ no/DT one/NN ]wants/VBZ [ it/PRP ]at/IN [ all/DT ]seems/VBZlike/IN ./.

47

Penn

HP Labs Bangalore, 8/21/2003

SWB example (3, 4)

B.22: Yeah, / no one seems to be adopting it. / Metric system, [ no one's very, + {F uh, } no one wants ] it at all seems like. /

((S (NP-TPC Metric system) , (S-TPC-1 (EDITED (RM [) (S (NP-SBJ no one) (VP 's (ADJP-PRD-UNF very))) , (IP +)) (INTJ uh) , (NP-SBJ no one) (VP wants (RS ]) (NP it) (ADVP at all))) (NP-SBJ *) (VP seems (SBAR like (S *T*-1))) . E_S))

48

Penn

HP Labs Bangalore, 8/21/2003

Switchboard: AG

49

Penn

HP Labs Bangalore, 8/21/2003

Another multiple annotation

It is quite realistic to have this many diverse annotations (and more!)

for the same material...

50

Penn

HP Labs Bangalore, 8/21/2003

AG formalization: Background

Annotation - the basic action: associate a label with an extent of signal labels may be of different types different types may span different

amounts of time; need not form a hierarchy

Minimal formalization: directed graph typed, fielded records on the arcs optional time references on the nodes

51

Penn

HP Labs Bangalore, 8/21/2003

TimelinesNodes are anchored to signals using offsetsAn annotation may reference more than one

signal e.g. simultaneous audio and video signals

signals from multiple microphonesaudio and physiological signals

All the signals covered by a given annotation must be from the same "flow of time" = timeline T

but signals may cover a timeline only partially(Other ordered sets,

such as the sequence of characters in a text,may also be treated as timelines... )

52

Penn

HP Labs Bangalore, 8/21/2003

Two Signals, One Timeline

(Could be treated as a single multi-channel signal --but different channels might be in different files,have different frame rates, etc.)

53

Penn

HP Labs Bangalore, 8/21/2003

AG: Formal Definition

An Annotation Graph G over a label set L and timeline T is a 3-tuple <N,A,t>:

N = set of nodes A = set of arcs labelled with elements of L t = partial function from N to T

satisfying the following conditions:1 <N,A> is acyclic, with no nodes of degree

zero2 for any path from node n1 to n2, if t(n1)

and t(n2) are defined, then t(n1) <= t(n2)

54

Penn

HP Labs Bangalore, 8/21/2003

Condition 1

1. <N,A> is acyclic, with no nodes of degree zero

1a. AGs are acyclic expresses the linearity of signal

annotations an important property wrt implementations

and to QLs containing path expressions

1b. AGs have no orphan nodes the only point of nodes is to anchor the

arcs avoids the situation of AGs that are

identical but for orphan nodes

55

Penn

HP Labs Bangalore, 8/21/2003

Condition 2for any path from node n1 to n2, if t(n1) and

t(n2) are defined, then t(n1) <= t(n2)

2. AGs respect the flow of time(or the structure of another anchoring

space)

1 12 1.23

1 122 3.15

1 2

56

Penn

HP Labs Bangalore, 8/21/2003

AG: Interpretation of LabelsArc labels may be interpreted as:

substantive content conforming to a coding practice as meta-commentary as a reference to other material as an identifier as arbitrary binary data

Choice of label interpretations falls outside the scope of the formalism

57

Penn

HP Labs Bangalore, 8/21/2003

AG: ExpressivenessIs the formalism too minimalist?Some things that some people want:

1. cross-reference from a label to another arbitrary label, arc or node

2. labels as well as anchors for nodes3. anchoring nodes to arcs or labels rather than timelines4. anchoring arcs/labels in 2- or 3-dimensional spaces5. recursive structures in labels

“Core AG” has sufficient expressive capacity to encode, in an intuitive way, all commonly used formats,and also good properties wrt creation, maintenance, search

Our strategy:- see how far we can go with this core- dispense with more complex syntax and focus on

semantics- but some of (1) has been added in core AG

implementation,and (4) has been added in “ATLAS” (NIST version)

58

Penn

HP Labs Bangalore, 8/21/2003

Structures for a single layer

All of these have (one or more) natural representations

in the basic AG formalism.

Multiple layers can of course be added in a general way.

59

Penn

HP Labs Bangalore, 8/21/2003

Equivalence classes

Equivalence classes (joint reference to an external ID)provide a way to establish symmetrical inter-label

linkageswithout any new formal devices

60

Penn

HP Labs Bangalore, 8/21/2003

AG as algebra An AG can be represented as a set of arcs

each with an associated labeland (optionally-anchored)source and destination nodes

The power set of this arc setdefines a boolean algebra (as usual)

Every member of the power setis itself a well-defined AG

This algebra can be used for queries,just as the relational algebra is for RDBs

Adding e.g. pointers from labels to other arccompromises this property(because arc subsets are not well-formed

if pointers cannot be dereferenced)

61

Penn

HP Labs Bangalore, 8/21/2003

AG as RDB

An AG can therefore also be interpretedas a relational table

or (more conveniently) as a set of three relational tables

This allows standard RDB implementationsto be used for AG storage and

retrieval Obvious advantages,

though standard RDBmay not use AG structure optimally...

62

Penn

HP Labs Bangalore, 8/21/2003

Relational Representation

a1t1

a2t2

Ann1: <l1,l2,...,ln>

Three relations: anchor, annotation (=arc), feature

(=label)

63

Penn

HP Labs Bangalore, 8/21/2003

Anchor Relation

a1t1

a2t2

Ann1: <l1,l2,...,ln>

AnchorId Offseta1 t1a2 t2

64

Penn

HP Labs Bangalore, 8/21/2003

Annotation (arc) Relation

a1t1

a2t2

Ann1: <l1,l2,...,ln>

AnnotationId Source DestinationAnn1 a1 a2

65

Penn

HP Labs Bangalore, 8/21/2003

Feature Relation

a1t1

a2t2

Ann1: <l1,l2,...,ln>

AnnotationId Feature ValueAnn1 F1 l1Ann1 F2 l2... ... ...

66

Penn

HP Labs Bangalore, 8/21/2003

Queries across multiple tables

ID Sex DR Ht

AKS0 F 1 5'04"

ASW0 F 5 5'06"

BJL0 F 5 5'07"

train/dr2/fbjl0/

ha /hh aa1/

habit /hh ae1 b ix t/

had /hh ae1 d/

hafta /hh ae1 f t ax/

67

Penn

HP Labs Bangalore, 8/21/2003

Queries on AG Tablesselect * from FEATURE where

FEATURE.AGID="Timit:AG80"select ANNOTATIONID,SPKRINFO.ID

from FEATURE,SPKRINFOwhere SPKRINFO.DR=1and SPKRINFO.Ht=70and FEATURE.VALUE="dark"

68

Penn

HP Labs Bangalore, 8/21/2003

AG software

AGTK provides API and language bindings version 2.0 recently released

Sample applications Open-source license Available on sourceforge:

69

Penn

HP Labs Bangalore, 8/21/2003

AGTK architecture

70

Penn

HP Labs Bangalore, 8/21/2003

API Summary Functions for creating, accessing,

modifying, storing and loading AGs C++ library Compiles on Unix and Windows Scripting language access:

Python, Tcl/tk

71

Penn

HP Labs Bangalore, 8/21/2003

File I/O LibraryApproach:

build import methods for all widely used formats

public API & documentation to encourage others to contribute code for their formats

Currently supported: AIF (ATLAS Interchange Format -

XML) BAS, BU, CALLHOME, CSV,

Switchboard, TIMIT, Treebank, xlabel

72

Penn

HP Labs Bangalore, 8/21/2003

Integration with other tools

Example: WaveSurfer/SNACKSjölander and Beskowwww.speech.kth.se/wavesurfer/

open source software for sound visualization, analysis and manipulation

Linux, Windows 95/98/NT/2k, Mac, Solaris, ... customizable, extensible, embeddable can read and write:

wav, au, aiff, mp3, csl, sd, sphere unlimited file size

Unicode support

73

Penn

HP Labs Bangalore, 8/21/2003

Wavesurfer Screenshot 1

74

Penn

HP Labs Bangalore, 8/21/2003

Wavesurfer Screenshot 2

75

Penn

HP Labs Bangalore, 8/21/2003

Wavesurfer Screenshot 3

76

Penn

HP Labs Bangalore, 8/21/2003

Wavesurfer Screenshot 4

77

Penn

HP Labs Bangalore, 8/21/2003

Annotation Component: Spreadsheet (TRAINS+DAMSL)

Annotation here presented in spreadsheet mode

Each row is an annotation of stretch of signalEach column is a type of annotation

78

Penn

HP Labs Bangalore, 8/21/2003

TableTrans tool

Seamless integration of AGTK for annotation,and Wavesurfer for audio display and playback.

79

Penn

HP Labs Bangalore, 8/21/2003

Components in TableTrans

80

Penn

HP Labs Bangalore, 8/21/2003

Another annotation GUI

81

Penn

HP Labs Bangalore, 8/21/2003

Issues for the future

Some positive things “stand-off” (rather than in-line) annotation

is now common though by no means universal but in-line annotators mostly realize they are

sinful AGTK implementation is mature

libraries are well designed & implemented good integration with GUIs and DB backends can read/write many common formats

Some AG-based tools are good basically, those that have really been used demand pull & influence of users on

development

82

Penn

HP Labs Bangalore, 8/21/2003

Issues for the future

Some things need more work AG API and AGTK are not yet widely used Many AG-based tools are rough sketches NIST ATLAS is not popular with researchers

(java, complexity) For many projects,

something simpler & less general is still the local optimum:

lines of tab-separated fields, or in-line mark-up (XML or ad hoc), or other legacy or new ad hoc formats

but it’s still early days...