Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony...
-
Upload
irea-rankin -
Category
Documents
-
view
217 -
download
3
Transcript of Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony...
![Page 1: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/1.jpg)
Enabling xComForTable Mapping to the Linguistic Annotation Framework
Marion Freese
Sony International (Europe) Gmbh;
IMS, Universität Stuttgart;
hmb Datentechnik Gmbh
![Page 2: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/2.jpg)
2/24 LREC 2004 05/29/2004Marion Freese
Overview
xComForT – Outline Relevance for richly annotated corpora xComForT Features
– Adaptation to new text formats– Integration of annotation tools
Proposal for integration into LAF Summary
![Page 3: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/3.jpg)
3/24 LREC 2004 05/29/2004Marion Freese
xComForT – What is it?
extensible Common Format for Text based on
– XML– Text Encoding Initiative (TEI)– Corpus Encoding Standard (CES / XCES)
provides extensibility and reusability
![Page 4: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/4.jpg)
4/24 LREC 2004 05/29/2004Marion Freese
xComForT – What’s it for?
NOT– Standard for linguistic annotation
BUT– Standards proposal for structural annotation of
primary data– Common anchor for linguistic annotations (LA)– Set of guidelines for LA architecture
(company-internal standard)
![Page 5: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/5.jpg)
5/24 LREC 2004 05/29/2004Marion Freese
Example: Newspaper (plain text)
bylinecopyright
meta information
headlinequotation
bylinedateline paragraph
![Page 6: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/6.jpg)
6/24 LREC 2004 05/29/2004Marion Freese
xComForT – Primary Document Example
<xcomfortDoc type="text" extension="SZ" version="v0.6" TEIform="TEI.2"> <cesHeader ...> <!-- ... -> </cesHeader> <text xml:lang="de"> <!-- ... -> zu erhalten.</p> <byline type="signer"> <docAuthor type="short">mgd</docAuthor> </byline> </div>
<div type="article" id="d19990104_a12"> <opener id="d19990104_a12o"> <divMeta> <publDate>Montag, 4. Januar 1999</publDate> <cat target="ns8"><hi>BAYERN</hi></cat> <!-- ... -> </divMeta> <head id="d19990104_a12hl1">Kafkaeskes Augsburg</head> <head id="d19990104_a12hl2" type="sub">Der nächste Akzent <!-- ... -></head> <byline type="main">Von <docAuthor type="full">Peter Richter</docAuthor> </byline> <dateline><location>Augsburg</location> – </dateline> <p id="d19990104_a12p1">Auch wenn nicht <!-- ... -></p><!-- ... -></xcomfortDoc>
![Page 7: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/7.jpg)
7/24 LREC 2004 05/29/2004Marion Freese
xComForT – Data Architecture
substring / 1:1ran
ge-to / 1
:1
1:1 (#id)
range-to / 1:1
1:1 (#id)
1:1 (#id)
xComForT storage format
base document
level 1 level 2
token level
token stream
substring
e.g. morpheme, syllable streams
e.g. sentence, chunk, mw streams
level 3
1st linguistic level
e.g. PoS, lemma, pronunciation
streams
level 4
2nd linguistic level
e.g. parse tree stream
e.g. intonation stream
segInfo
![Page 8: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/8.jpg)
8/24 LREC 2004 05/29/2004Marion Freese
Relevance for richly annotated Corpora
Standoff-Markup– supports huge amount of annotation data
» alternative / concurrent / ambiguous annotations» partial / underspecified results» flexible merging» various annotation types (multimodal, multimedia,
metadata, …) media independence– reduces annotation dependencies
Support for integration of external tools for annotation and exploitation
common standards-based starting point for rich annotation
![Page 9: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/9.jpg)
9/24 LREC 2004 05/29/2004Marion Freese
Comparison with CES
Structural markup and linguistic annotation are strictly separated in xComForT
provides common base format for arbitrary linguistic annotation
allows for using consistent annotation schema Primary document DTD is easily extensible while
retaining TEI conformance
xComForT provides more flexibility than CES wrt. resource formats (e.g. integration of different modalities possible)
![Page 10: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/10.jpg)
10/24 LREC 2004 05/29/2004Marion Freese
Creation of an extended DTD for storage
xComForT.ent
xComForT.dtd
core markup definition
class.modclass.new
class.comments
elem.modelem.new
elem.comments
xcomfort_new.ent
xcomfort_new.dtd
extension definition
xComForT_store.dtd
TEI conformant storage format
template
TEI conformant extension
storage format
xComForT_store_new.dtd
![Page 11: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/11.jpg)
11/24 LREC 2004 05/29/2004Marion Freese
Extension Definition Support
core markup definition contains extension entity for each element and entity, e.g.
» <!ENTITY % x.byline ‘’>
» <!ELEMENT byline (#PCDATA | author %x.byline;)>
<!ENTITY % x.byline ‘| interviewer’>
<!ELEMENT byline (#PCDATA | author | interviewer)>
![Page 12: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/12.jpg)
12/24 LREC 2004 05/29/2004Marion Freese
Integration of Annotation Tools
Toolbox support for converting annotation tool output to xComForT
annotationstream
elementnames
xComForTdocument
type of annotation annotate.perl
text nodes for annotation tool input:
<tn ancestors=“div p“ parentID=“div1.p1“>With</tn>
<tn ancestors=“div p“ parentID=“div1.p1“>the</tn>
...
e.g. sentence
<elem>p</elem>
<s xlink:href=“..“/>
![Page 13: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/13.jpg)
13/24 LREC 2004 05/29/2004Marion Freese
Linguistic Annotation Tools – implemented examples
input and output formats of– Tokenizer (from IMS, University of Stuttgart)
» tokens» sentences
– IMS TreeTagger» lemma» part-of-speech
![Page 14: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/14.jpg)
14/24 LREC 2004 05/29/2004Marion Freese
Relation to current LAF standardization issues (1)
General requirements for the standard for a Linguistic Annotation Framework (LAF) (cf. Ide & Romary 2003)
xComForT conforms to these requirements, i.e. to– Media independence– Human readability– Processability
![Page 15: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/15.jpg)
15/24 LREC 2004 05/29/2004Marion Freese
Relation to current LAF standardization issues (2)
Remaining requirements are xComForT’s main features, i.e. – Consistency– Uniformity– Incrementality– Expressiveness
Two proposals for integration into the LAF Mapping between proprietary resource formats and
the LAF annotation data model Resource reusability
![Page 16: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/16.jpg)
16/24 LREC 2004 05/29/2004Marion Freese
Proposal to the LAF (1-1)
LAF architecture (Ide & Romary)
Dump format
![Page 17: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/17.jpg)
17/24 LREC 2004 05/29/2004Marion Freese
Proposal to the LAF (1-2)
Dump Format conforming to xComForT guidelines Advantages
– Direct mapping from/to user-defined formats– Support for annotation tool integration– Easy conversion into proprietary formats
Disadvantages– xComForT is possibly not the most
adequate/efficient processing format– Different requirements of processing format vs.
exchange format
![Page 18: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/18.jpg)
18/24 LREC 2004 05/29/2004Marion Freese
Proposal to the LAF (2-1)
LAF architecture (Ide & Romary)
Intermediate Format between resource and LAF dump format
![Page 19: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/19.jpg)
19/24 LREC 2004 05/29/2004Marion Freese
Proposal to the LAF (2-2)
Intermediate Format (Common Document Format) Disadvantages
– One more mapping step Advantages
– Standards-based adaptation to proprietary formats– Mapping to dump format tightly defined and
targeted– Common mapping tool, e.g. provided by the LAF
![Page 20: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/20.jpg)
20/24 LREC 2004 05/29/2004Marion Freese
Example: Potential LAF dump format
“Jones followed him into the front room, closing the door behind him” (Ide&Romary2001)
<struct id="s0" type="S"> <struct id="s1" type="NP" xlink:href="xptr(substring(p/s[1]/text(),1,5))" rel="SBJ"/> <struct id="s2" type="VP" xlink:href="xptr(substring(p/s[1]/text(),7,8))"/> <struct id="s3" type="NP" xlink:href="xptr(substring(p/s[1]/text(),16,3))"/> <struct id="s4" type="PP" xlink:href="xptr(substring(p/s[1]/text(),20,4))" rel="DIR"> <struct id="s5" type="NP" xlink:href="xptr(substring(p/s[1]/text(),25,14))"/> </struct> <struct id="s6" type="S" rel="ADV"> <!-- ... --></struct>
![Page 21: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/21.jpg)
21/24 LREC 2004 05/29/2004Marion Freese
Example: Possible xComForT Representation (1)
segments
xComForT storage format
level 1
PTBraw.xml
level 2
token level
substring
token.xml
level 3
1st linguistic level
level 4
2nd linguistic level
range-t
o
range-to
sentence.xml
chunk.xml
segInfo
chunk_relation.xml
1:1 (#id)
![Page 22: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/22.jpg)
22/24 LREC 2004 05/29/2004Marion Freese
Example: Possible xComForT Representation (2)
chunk.xml
chunk_relation.xml
<segments level="ling1" type="chunk" xml:base="token.xml"> <chunk id="div1.p1.chunk1" type="NP" xlink:href="#div1.p1.tok1"/> <chunk id="div1.p1.chunk2" type="VP" xlink:href="#div1.p1.tok2"/> <chunk id="div1.p1.chunk3" type="NP" xlink:href="#div1.p1.tok3"/> <chunk id="div1.p1.chunk4" type="PP" xlink:href="#xpointer(id('div1.p1.tok4')/ range-to(id('div1.p1.tok7'))"/> <chunk id="div1.p1.chunk5" type="NP" xlink:href="#xpointer(id('div1.p1.tok5')/ range-to(id('div1.p1.tok7'))"/></segments>
<segInfo level="ling2" type="rel" xml:base="chunk.xml"> <rel id="div1.p1.chunk1.rel" xlink:href="#div1.p1.chunk1>SBJ</rel> <rel id="div1.p1.chunk4.rel" xlink:href="#div1.p1.chunk4>DIR</rel></segInfo>
![Page 23: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/23.jpg)
23/24 LREC 2004 05/29/2004Marion Freese
Summary
standards-based
common tools available and usable stand-off annotation
easy plugging-in of linguistic annotation schema easily extensible markup of primary document
easy adaptation to arbitrary resource
Standard base format, e.g. to simplify support for mapping into the Linguistic Annotation Framework
![Page 24: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/24.jpg)
24/24 LREC 2004 05/29/2004Marion Freese
xComForTable Mapping to the LAF
Thanks for your attention!
… Any questions?
![Page 25: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/25.jpg)
25/24 LREC 2004 05/29/2004Marion Freese
Structural Markup improves Analysis
e.g. sentence boundary detection
Then things would get even worse. (see also pages 4 and 11)
SHADOWS
By Leena Dhingra
I couldn’t possibly do that.
tokenizer input:<p>-elements (without <rs>-elements)
correct sentence markup
<p>[..]Then things would get even worse.<rs type=“see also“> (see also pages 4 and 11)</rs></p></div>
<div><head>SHADOWS</head><byline>By Leena Dhingra</byline><p>I couldn’t possibly do that.</p>
![Page 26: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/26.jpg)
26/24 LREC 2004 05/29/2004Marion Freese
Example – Discontinuous Material
CES
xComForT<div id="d19990607_a1" type="article"> <opener><!-- ... --></opener> <discontinuous id="d19990607_a1. discontinuous" type="rubbish"> Die GewinnzahlenLotto (5. Juni): 5, 19, 21, 31, 43, 48 Zusatzzahl: 32, Superzahl: 9 Toto: lag noch nicht vor </ discontinuous> <closer><!-- ... --></closer> </div>
Die Gewinnzahlen
Lotto (5. Juni): 5, 19, 21, 31, 43, 48 Zusatzzahl: 32, Superzahl: 9 Toto: lag noch nicht vor
<!ELEMENT discontinuous (#PCDATA)><!ATTLIST discontinuous id ID #REQUIRED type (rubbish | editorial | ..) #IMPLIED>
![Page 27: Enabling xComForTable Mapping to the Linguistic Annotation Framework Marion Freese Sony International (Europe) Gmbh; IMS, Universität Stuttgart; hmb Datentechnik.](https://reader035.fdocuments.us/reader035/viewer/2022062618/551441df550346494e8b49aa/html5/thumbnails/27.jpg)
27/24 LREC 2004 05/29/2004Marion Freese
Example – Meta Information
CES
xComForT<opener> <divMeta> <publDate>Montag, 7. Juni 1999</publDate> <cat target="ns1">NACHRICHTEN</cat> <distribution>M / F</distribution> <publBy>Süddeutsche Zeitung</publBy> <volNr>Nr. 127</volNr> / <pageNr>Seite 7</pageNr> </divMeta></opener>
Montag, 7. Juni 1999 NACHRICHTEN M / F Süddeutsche Zeitung Nr. 127 / Seite 7
<opener><date>Montag, 7. Juni 1999</date> NACHRICHTEN M / F Süddeutsche Zeitung Nr. 127 / Seite 7</opener>
reference to taxonomy