What is the relevant information in a text?
-
Upload
silvia-giannini -
Category
Data & Analytics
-
view
151 -
download
4
Transcript of What is the relevant information in a text?
What is the relevant information in a text?
Silvia Giannini
Visiting PhD student
Politecnico di Bari
Web & Media Group meeting | 27.10.2014
The scenario
• Entertainment domain: BBC TV-programs (TV-series, movies, documentaries, …)
• Aim: Enrich the content description with links to the Web of Data • Applications: Linked Data patterns for recommendations; multi-
domain datasets creation, …
Following the grandeur of Baroque, Rococo art is often dismissed as frivolous and unserious, but Waldemar Januszczak disagrees. The first episode is about travel in the 18th century and how it impacted greatly on some of the finest art ever made. The world was getting smaller and took on new influences shown in the glorious Bavarian pilgrimage architecture, Canaletto's romantic Venice and the blossoming of exotic designs and tastes all over Europe. The Rococo was art expressing itself in new, exciting ways.
How?
• Yet another semantic annotation tool?
• Peculiarities:
- Different formats
- Broad coverage of topics
Following the grandeur of Baroque, Rococo art is often dismissed as frivolous and unserious, but Waldemar Januszczak disagrees. The first episode is about travel in the 18th century and how it impacted greatly on some of the finest art ever made. The world was getting smaller and took on new influences shown in the glorious Bavarian pilgrimage architecture, Canaletto's romantic Venice and the blossoming of exotic designs and tastes all over Europe. The Rococo was art expressing itself in new, exciting ways.
Multiple annotators integration
Following the grandeur of Baroque, Rococo art is often dismissed as frivolous and unserious, but Waldemar Januszczak disagrees. The first episode is about travel in the 18th century and how it impacted greatly on some of the finest art ever made. The world was getting smaller and took on new influences shown in the glorious Bavarian pilgrimage architecture, Canaletto's romantic Venice and the blossoming of exotic designs and tastes all over Europe. The Rococo was art expressing itself in new, exciting ways.
text enrichment
“Canaletto”
ontology:Location
“Rococo”
dbpedia:Rococo_(band)
• Type mis-classification • URI mis-annotation • Not relevant labels
The NERD framework
Multiple annotators integration
• Feature-based solution for entity relevance definition and entity classification
• Majority vote and disagreement metrics
• Extractors can disagree on: – The existence of a label, e.g. some can identify a lable and other can’t
– The span of the label, e.g. ‘Myra’ VS ‘Myra Gail’
– The type of the label, e.g. ‘Building’ VS ‘Organization’
– The URI of the label
Proposal
Multiple annotators integration
DEFINE THE RELEVANCE OF A LABEL
• Feature-based solution for entity relevance definition and entity classification
• Majority vote and disagreement metrics
• Extractors can disagree on: – The existence of a label, e.g. some can identify a lable and other can’t
– The span of the label, e.g. ‘Myra’ VS ‘Myra Gail’
– The type of the label, e.g. ‘Building’ VS ‘Organization’
– The URI of the label
Proposal
Multiple annotators integration
INFLUENCE THE CONFIDENCE OF AN ANNOTATION
• Feature-based solution for entity relevance definition and entity classification
• Majority vote and disagreement metrics
• Extractors can disagree on: – The existence of a label, e.g. some can identify a lable and other can’t
– The span of the label, e.g. ‘Myra’ VS ‘Myra Gail’
– The type of the label, e.g. ‘Building’ VS ‘Organization’
– The URI of the label
Proposal
Workflow for relevance assessment
Following the grandeur of Baroque, Rococo art is often dismissed as frivolous and unserious, but Waldemar Januszczak disagrees. […] The first episode is about travel in the 18th century and how it impacted greatly on some of the finest art ever made. The world was getting smaller and took on new influences shown in the glorious Bavarian pilgrimage architecture, Canaletto's romantic Venice and the blossoming of exotic designs and tastes all over Europe. The Rococo was art expressing itself in new, exciting ways.
Relevant labels
crowd
NLP tools
metrics Features matrix extraction
Classifier
• Disagreement between TextRazor annotations in NERD and standalone TextRazor in terms of missing labels, missing types, granularity of types.
PID: b0074t2b Title: Great plains
Synopsis ‘’The great plains are the vast open spaces of our planet. […] Close on their heels come an array of plains predators including eagles, wolves and lions. […]‘’
Label eagles#439#445
Extractors Types URI
textrazor(nerd) nerd:Thing http://en.wikipedia.org/wiki/Eagle
textrazor dbpedia-owl:Bird http://en.wikipedia.org/wiki/Eagle
Workflow for relevance assessment
Pre-processing
• Alignment of extractors’ results:
- Label: each label has a list of alternative labels contained in or overlapping with the given one
- Type: same vocabulary for all extraction methods (529 classes of the Dbpedia ontology, extended with owl:Thing and Amount type)
- URI: Dbpedia resources
• Label • NERD ontology class • sameAs link
• Label • DBpedia ontology class • Wikipedia page
• Label • DBpedia category • Wikipedia page
• Label • DBpedia ontology class • DBpedia URI
Majority-vote for relevance: longest-span strategy*
extractor label startOffset endOffset Aligned label
Rococo 35 41 Rococo art#35#45
Rococo art 35 45 Rococo art#35#45
Rococo Art
35 42
41 45
Rococo art#35#45 Rococo art#35#45
Rococo Art
35 42
41 45
Rococo art#35#45 Rococo art#35#45
• Label • NERD ontology class • sameAs link
• Label • DBpedia ontology class • Wikipedia page
• Label • DBpedia category • Wikipedia page
• Label • DBpedia ontology class • DBpedia URI
Label & span alignment
The LONGEST-SPAN strategy
*Analogously, the shortest-span strategy can be applied
Issues
• In the previous example, Rococo and Art are related to the same category (Arts). Thus, the longest-span strategy for labels alignment will lead to a consistent conceptual category for the new label (Rococo Art).
• Consider this program description: A journey back to the 1950s for a look at the wildest pop music of all time in a film that
tells the stories of Bill Haley, Elvis Presley, Little Richard, Chuck Berry, Jerry Lee Lewis and Buddy Holly, giants from an era when pop music really was mad, bad and
dangerous to know.The programme features the artists themselves, alongside people like Bill Haley's original Comets, the Crickets, Buddy Holly's widow Maria Elena, Jerry Lee Lewis's former wife Myra Gail and his sister, Chuck Berry's son and many more, including June Juanico, Elvis' first serious girlfriend.Other contributors include Tom Jones, Jamie Callum, Paul McCartney, Cliff Richard, Joe Brown, Marty Wilde, Green Day, Minnie Driver, Jack White, the Mavericks, Jools Holland, Hank Marvin, Fontella Bass, John Waters and more.Elvis's pelvis was just the start. Who had to change the lyrics to their biggest hit because the originals were too obscene? Who married their
13-year-old cousin? Who used lard to get their hair just right? And what happened on the day the music died?
BBC Program: Kings of Rock and Roll (Pid: b007c95q)
Issues
• In the previous example, Rococo and Art refer to the same conceptual category. Thus, the longest-span strategy for labels alignment will lead to a consistent conceptual category for the new label (Rococo Art).
• Consider this program description:
BBC Program: Kings of Rock and Roll (Pid: b007c95q)
extractor label startOffset endOffset Type Aligned label
Myra Gail 453 462 Person Myra Gail#453#462
Myra Myra Gail
453 453
457 462
Settlement Person
Myra Gail#453#462 Myra Gail#453#462
Myra Gail
453 458
457 462
Band,Artist Person
Myra Gail#453#462 Myra Gail#453#462
Myra Gail 453 462 Thing Myra Gail#453#462
The HYBRID-SPAN strategy1
Given two labels l1 and l2 and an upper ontology O, l1 and l2 belong to the same annotation span if:
1. l1 is contained in l2 or l2 is contained in l1 and type(l1) and type(l2) are in super(sub)class relationship (e.g. Royal Academy[Organization] in Royal Academy of Music[University])
OR
2. l1 and l2 are overlapping but neither l1 is contained in l2 nor l2 is contained in l1 (e.g., Royal Academy[Organization] and Academy of Music[Building])
OR
3. l1 coincides with l2 (e.g., Royal Academy[Organization] and Royal Academy[Museum])
What about Thing type?
1Chen, L., Ortona, S., Orsi, G., & Benedikt, M. (2013). Aggregating Semantic Annotators. Proceedings of the VLDB Endowment, Vol. 6, No. 13, (p. 1486-1497). Riva del Garda, Trento, Italy.
• Label • NERD ontology class • sameAs link
• Label • DBpedia ontology class • Wikipedia page
• Label • DBpedia category • Wikipedia page
• Label • DBpedia ontology class • DBpedia URI
Label & span alignment
The HYBRID-SPAN strategy*
extractor label startOffset endOffset Type Aligned label
Myra Gail 453 462 Person Myra Gail#453#462
Myra Myra Gail
453 453
457 462
Settlement Person
Myra#453#457 Myra Gail#453#462
Myra Gail
453 458
457 462
Band,Artist Person
Myra#453#457 Myra Gail#453#462
Myra Gail 453 462 Thing Myra Gail#453#462
*The vocabulary alignment is required as previous step
Majority-vote for relevance: hybrid-span strategy
Features for Relevance
• F1: nerd(l) -> 1 if label l is extracted by NERD;
0 otherwise
label#offset Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
Features for Relevance
• F2: textrazor(l) -> 1 if label l is extracted by TextRazor;
0 otherwise
label#offset Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
Features for Relevance
• F3: tagme(l) -> 1 if label l is extracted by TAGME;
0 otherwise
label#offset Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
Features for Relevance
• F4: nltk(l) -> 1 if label l is extracted by the NLTK-based method; 0 otherwise
label#offset Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
Features for Relevance
• F5: abs(l) = 𝑛𝑒𝑟𝑑 𝑙 +𝑡𝑒𝑥𝑡𝑟𝑎𝑧𝑜𝑟 𝑙 +𝑡𝑎𝑔𝑚𝑒 𝑙 +𝑛𝑙𝑡𝑘 𝑙
|𝐸𝑀|
Absolute score for l over the set EM of all Extraction Methods (four in this setting)
label#offset Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
Features for Relevance label#offset
Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
• F6: lss(l) = 𝑤𝑐 𝑙
𝑤𝑐(𝑙LS), where wc is the word count function and lLS is the longest
span containing l in the union set of all labels recognized by each extraction methods
Expresses the span overlapping between l and the longest span containing l, i.e. the portion of l contained in the longest span lLS
Features for Relevance
• F7: wlss(l) = 𝑎𝑏𝑠 𝑙 ∗ 𝑙𝑠𝑠(𝑙)
Longest span score for l, weighted by the absolute score for l
label#offset
Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
Features for Relevance
• F8: sss(l) = 𝑤𝑐 𝑙SS
𝑤𝑐(𝑙), where lSS is the shortest span contained in l in the
union set of all labels recognized by each extraction methods Expresses the span overlapping between l and the shortest span contained in l, i.e. the portion of l containing the shortest span lSS
label#offset
Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
Features for Relevance
• F9: wsss(l) = 𝑎𝑏𝑠 𝑙 ∗ 𝑠𝑠𝑠(𝑙) Shortest-span score for l, weighted by the absolute score for l
label#offset
Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
Features for Relevance
• F10: oss(l) =
|𝑗 ∩𝑙|
|𝑗 ∪𝑙|𝑗 ∈𝑂𝐿
|𝑂𝐿|, where |OL| is the number of
overlapping labels among the alternative ones.
label#offset
Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
Features for Relevance
• F11: woss(l) = 𝑜𝑠𝑠 𝑙 ∗ 𝑎𝑏𝑠(𝑙)
label#offset
Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1 0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
Features for Relevance with type
Label#offset type Alternative label F1 …
wildlife #912#920
Thing 1
east africa #361#372
Thing africa #366#372 [Place,Continent]
0
Country 0
africa #366#372
Place east africa #361#372 [Thing,Country]
1
Continent 1
…
Features for Relevance with type
• F1: nerd(l,t) -> 1 if label l with type t is extracted by NERD; 0 otherwise
• F2: textrazor(l,t) -> 1 if label l with type t is extracted by TextRazor; 0 otherwise
• F3: tagme(l,t) -> 1 if label l with type t is extracted by TAGME; 0 otherwise
• F4: nltk(l,t) -> 1 if label l with type t is extracted by the NLTK-based method; 0 otherwise
Features for Relevance with type
Label#offset type Alternative label F1 F2 F3 F4 F5a F5b F6 …
east africa #361#372
Thing africa #366#372 [Place,Continent]
0 1 0 0 0.25 0.33 0.5
Country 0 0 1 1 0.5 0.67 0.5
• F5a: abs(l,t) = 𝑛𝑒𝑟𝑑 𝑙,𝑡 +𝑡𝑒𝑥𝑡𝑟𝑎𝑧𝑜𝑟 𝑙,𝑡 +𝑡𝑎𝑔𝑚𝑒 𝑙,𝑡 +𝑛𝑙𝑡𝑘 𝑙,𝑡
|𝐸𝑀|
Absolute score for l with type t over the set EM of all Extraction Methods (four in this setting)
• F5b: rel(l,t) = 𝑎𝑏𝑠 𝑙,𝑡
𝑎𝑏𝑠 𝑙
Relative score for label l with type t over the total number of extraction methods recognizing l
Features for Relevance with type
• F6: lss(l,t) = 𝑙𝑠𝑠(𝑙)
𝑛_𝑐𝑎𝑡(𝑙), where n_cat is the number of
different types associated with l. Expresses the span overlapping between l and the longest span containing l, weighted by the number of different types associated with the same label l.
Label#offset type Alternative label F1 F2 F3 F4 F5a F5b F6 …
east africa #361#372
Thing africa #366#372 [Place,Continent]
0 1 0 0 0.25 0.33 0.5
Country 0 0 1 1 0.5 0.67 0.5
Features for Relevance with type
• F7a: wlss(l,t) = 𝑎𝑏𝑠 𝑙, 𝑡 ∗ 𝑙𝑠𝑠(𝑙, 𝑡)
Longest span score for l with type t, weighted by the absolute score for label l and type t
• F7b: wrlss(l,t) = 𝑟𝑒𝑙 𝑙, 𝑡 ∗ 𝑙𝑠𝑠(𝑙, 𝑡)
Longest span score for l with type t, weighted by the relative score for label l and type t
Label#offset type Alternative label F1 F2 F3 F4 F5a F5b F6 …
east africa #361#372
Thing africa #366#372 [Place,Continent]
0 1 0 0 0.25 0.33 0.5
Country 0 0 1 1 0.5 0.67 0.5
Features for Relevance with type
• F8: sss(l,t) = 𝑠𝑠𝑠(𝑙)
𝑛_𝑐𝑎𝑡(𝑙)
• F9a: wsss(l,t) = 𝑎𝑏𝑠 𝑙, 𝑡 ∗ 𝑠𝑠𝑠(𝑙, 𝑡)
• F9b: wrsss(l,t) = 𝑟𝑒𝑙 𝑙, 𝑡 ∗ 𝑠𝑠𝑠(𝑙, 𝑡)
Label#offset type Alternative label F1 F2 F3 F4 F5a F5b F6 …
east africa #361#372
Thing africa #366#372 [Place,Continent]
0 1 0 0 0.25 0.33 0.5
Country 0 0 1 1 0.5 0.67 0.5
Features for Relevance with type
• F10: oss(l,t) = 𝑜𝑠𝑠(𝑙)
𝑛_𝑐𝑎𝑡(𝑙)
• F11a: woss(l,t) = 𝑎𝑏𝑠 𝑙, 𝑡 ∗ 𝑜𝑠𝑠(𝑙, 𝑡)
• F11b: wross(l,t) = 𝑟𝑒𝑙 𝑙, 𝑡 ∗ 𝑜𝑠𝑠(𝑙, 𝑡)
Label#offset type Alternative label F1 F2 F3 F4 F5a F5b F6 …
east africa #361#372
Thing africa #366#372 [Place,Continent]
0 1 0 0 0.25 0.33 0.5
Country 0 0 1 1 0.5 0.67 0.5
Features for Relevance with type
Label#offset type Alternative label F1 F2 F3 F4 … F12 F13 …
east africa #361#372
Thing africa #366#372 [Place,Continent]
0 1 0 0 1 0.375
Country 0 0 1 1 0.5 0.17
• F12: hss(l,t) = |𝑖𝑛𝑇𝑟𝑒𝑒𝐴𝐿 𝑙,𝑡 |
|𝐴𝐿|, where |inTreeAL(l,t)| is the
number of Alternative Labels in the set AL with type in a sub(super)-sumption relation with t
• F13: whss(l,t) = 1
𝑑 𝑡𝑙,𝑡𝑗 +1/|𝐴𝐿|𝑗 ∈𝑖𝑛𝑇𝑟𝑒𝑒𝐴𝐿 , where
|d(tl, tj)| is the distance between class tl and tj in the ontology
• Disagreement on the extractors corner (i.e., tools that more sistematically disagree with every other tool) could reveal:
- bad quality tools (in recognizing specific set of labels/types) - specialized tools able to recognized particular entities better than all the other tools
Disagreement metrics evaluation on the extractors corner2
Disagreement for relevance: Humans VS Machine Annotation
2G. Soberon, L. Aroyo, C. Welty, O. Inel, H. Lin, M. Overmeen, Measuring Crowd Truth: Disagreement Metrics Combined with Worker Behavior Filters, Proc. of CrowdSem2013 Workshop, ISWC2013.
Features for Relevance Label#offset Alternative
labels F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1
0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
• DISTRIBUTED AGREEMENT
• UNIQUE INFORMATION
Features for Relevance Label#offset
Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1
0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
• ela(ei, ej, l) = 𝒆𝒊 𝒍 ∗𝒆𝒋 𝒍
|𝑳(𝒆𝒊,𝒑)|, where 𝑖 ≠ 𝑗. 𝑒𝑖 𝑙 is the corresponding extractor score
(F1-4) and 𝑳 𝒆𝒊, 𝒑 the number of labels recognized by extractor i in program p (the extractor-label agreement operator is not commutative)
Features for Relevance Label#offset
Alternative labels
F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1
0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
• avg_ela(ei, l) = 𝒆𝒍𝒂(𝒊≠𝒋 𝒆𝒊,𝒆𝒋,𝒍)
|𝑬𝑴|
Average extractor-label agreement over the set of extraction methods
Features for Relevance Label#offset Alternative
labels F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11
wildlife #912#920
1 1 0 1
0.75 1 0.75 1 0.75 0 0
east africa #361#372
africa #366#372
0 1 1 1 0.75 1 0.75 0.5 0.375 0 0
africa #366#372
east africa #361#372
1 1 0 0 0.5 0.5 0.25 1 0.5 0 0
earth: two #227#237
two million #234#245; earth #227#232; two million gazelles #234#254
0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07
…
• Both extractor-label agreement and the consequent average are evaluated also with reference to the pairs (label,type)
Other possible relevance features
• TF-IDF (with type) Shall the corpus for idf contain more episodes of the same TV-series? Labels referring to characters mentioned in many episodes of the same TV series will gain a higher tf but lower idf score -> consider metadata
Animated adventures of Pingu, the clumsy young penguin. Pingu helps his
neighbour and is rewarded. Pingu's friend tries to get a reward too, but the neighbour refuses. They decide to play a trick on the neighbour, but it all ends with an innocent passer-by becoming the victim of their prank.
BBC Program: Pingu's Trick (Pid: b0077x84)
• Enhance metadata (words in title and subject)
Labels lemmatization (WordNetLemmatizer)
Dani is understudying the part of a witch in Macbeth: The Musical,
which means Jack and Sam get the job of ensuring little brother Max does not cause chaos. Dani's most loyal viewers, the aliens, have got bored of never getting to meet their heroine and her pals, and have decided to teleport down to Earth, where they soon find themselves embroiled in Max's scheme to win the 10,000 pound reward from
the UFO Society.
BBC Program: Alien Invasion (Pid: b00ph91v)
Other possible relevance features
State of work
• Dataset: 52 BBC programs
• Realized:
- Span and Type Alignment
- Relevance scores for labels
• To do:
– Computation of relevance score for pairs (label,type)
– Crowdsourcing tasks
– Connecting relevance/relevance-with-type outputs
– Evaluation of results (precision, recall, complementarity, …)
Does the method deal with complementarity?
http://dbpedia.org/resource/Gazelle
PID: b0074t2b Title: Great plains
Synopsis ‘’The great plains are the vast open spaces of our planet. These immense wilderness areas are seemingly empty. But any feeling of emptiness is an illusion - the plains of our planet support the greatest gatherings of wildlife on earth: two million gazelles on the Mongolian steppes, three million caribou in North America and one and a half million wildebeest in East Africa. […]‘’
Label two million gazelles#234#254
Types Amount;Mammal;Single
Extractors wikimeta(nerd);textrazor;tagme;
http://dbpedia.org/resource/Two_in_a_Million/You're_My_Number_One
COMPLEMENTARITY!!
(Amount of Mammal)