What is the relevant information in a text?

What is the relevant information in a text?

Silvia Giannini

Visiting PhD student

Politecnico di Bari

Web & Media Group meeting | 27.10.2014

The scenario

• Entertainment domain: BBC TV-programs (TV-series, movies, documentaries, …)

• Aim: Enrich the content description with links to the Web of Data • Applications: Linked Data patterns for recommendations; multi-

domain datasets creation, …

Following the grandeur of Baroque, Rococo art is often dismissed as frivolous and unserious, but Waldemar Januszczak disagrees. The first episode is about travel in the 18th century and how it impacted greatly on some of the finest art ever made. The world was getting smaller and took on new influences shown in the glorious Bavarian pilgrimage architecture, Canaletto's romantic Venice and the blossoming of exotic designs and tastes all over Europe. The Rococo was art expressing itself in new, exciting ways.

How?

• Yet another semantic annotation tool?

• Peculiarities:

- Different formats

- Broad coverage of topics


Multiple annotators integration


text enrichment

“Canaletto”

ontology:Location

“Rococo”

dbpedia:Rococo_(band)

• Type mis-classification • URI mis-annotation • Not relevant labels

The NERD framework


• Feature-based solution for entity relevance definition and entity classification

• Majority vote and disagreement metrics

• Extractors can disagree on: – The existence of a label, e.g. some can identify a lable and other can’t

– The span of the label, e.g. ‘Myra’ VS ‘Myra Gail’

– The type of the label, e.g. ‘Building’ VS ‘Organization’

– The URI of the label

Proposal


DEFINE THE RELEVANCE OF A LABEL







Proposal


INFLUENCE THE CONFIDENCE OF AN ANNOTATION







Proposal

Workflow for relevance assessment

Following the grandeur of Baroque, Rococo art is often dismissed as frivolous and unserious, but Waldemar Januszczak disagrees. […] The first episode is about travel in the 18th century and how it impacted greatly on some of the finest art ever made. The world was getting smaller and took on new influences shown in the glorious Bavarian pilgrimage architecture, Canaletto's romantic Venice and the blossoming of exotic designs and tastes all over Europe. The Rococo was art expressing itself in new, exciting ways.

Relevant labels

crowd

NLP tools

metrics Features matrix extraction

Classifier

• Disagreement between TextRazor annotations in NERD and standalone TextRazor in terms of missing labels, missing types, granularity of types.

PID: b0074t2b Title: Great plains

Synopsis ‘’The great plains are the vast open spaces of our planet. […] Close on their heels come an array of plains predators including eagles, wolves and lions. […]‘’

Label eagles#439#445

Extractors Types URI

textrazor(nerd) nerd:Thing http://en.wikipedia.org/wiki/Eagle

textrazor dbpedia-owl:Bird http://en.wikipedia.org/wiki/Eagle

Workflow for relevance assessment

http://en.wikipedia.org/wiki/Eagle

http://en.wikipedia.org/wiki/Eagle

Pre-processing

• Alignment of extractors’ results:

- Label: each label has a list of alternative labels contained in or overlapping with the given one

- Type: same vocabulary for all extraction methods (529 classes of the Dbpedia ontology, extended with owl:Thing and Amount type)

- URI: Dbpedia resources

• Label • NERD ontology class • sameAs link

• Label • DBpedia ontology class • Wikipedia page

• Label • DBpedia category • Wikipedia page

• Label • DBpedia ontology class • DBpedia URI

Majority-vote for relevance: longest-span strategy*

extractor label startOffset endOffset Aligned label

Rococo 35 41 Rococo art#35#45

Rococo art 35 45 Rococo art#35#45

Rococo Art

35 42

41 45

Rococo art#35#45 Rococo art#35#45

Rococo Art

35 42

41 45

Rococo art#35#45 Rococo art#35#45





Label & span alignment

The LONGEST-SPAN strategy

*Analogously, the shortest-span strategy can be applied

Issues

• In the previous example, Rococo and Art are related to the same category (Arts). Thus, the longest-span strategy for labels alignment will lead to a consistent conceptual category for the new label (Rococo Art).

• Consider this program description: A journey back to the 1950s for a look at the wildest pop music of all time in a film that

tells the stories of Bill Haley, Elvis Presley, Little Richard, Chuck Berry, Jerry Lee Lewis and Buddy Holly, giants from an era when pop music really was mad, bad and

dangerous to know.The programme features the artists themselves, alongside people like Bill Haley's original Comets, the Crickets, Buddy Holly's widow Maria Elena, Jerry Lee Lewis's former wife Myra Gail and his sister, Chuck Berry's son and many more, including June Juanico, Elvis' first serious girlfriend.Other contributors include Tom Jones, Jamie Callum, Paul McCartney, Cliff Richard, Joe Brown, Marty Wilde, Green Day, Minnie Driver, Jack White, the Mavericks, Jools Holland, Hank Marvin, Fontella Bass, John Waters and more.Elvis's pelvis was just the start. Who had to change the lyrics to their biggest hit because the originals were too obscene? Who married their

13-year-old cousin? Who used lard to get their hair just right? And what happened on the day the music died?

BBC Program: Kings of Rock and Roll (Pid: b007c95q)

Issues

• In the previous example, Rococo and Art refer to the same conceptual category. Thus, the longest-span strategy for labels alignment will lead to a consistent conceptual category for the new label (Rococo Art).

• Consider this program description:

BBC Program: Kings of Rock and Roll (Pid: b007c95q)

extractor label startOffset endOffset Type Aligned label

Myra Gail 453 462 Person Myra Gail#453#462

Myra Myra Gail

453 453

457 462

Settlement Person

Myra Gail#453#462 Myra Gail#453#462

Myra Gail

453 458

457 462

Band,Artist Person

Myra Gail#453#462 Myra Gail#453#462

Myra Gail 453 462 Thing Myra Gail#453#462

The HYBRID-SPAN strategy1

Given two labels l1 and l2 and an upper ontology O, l1 and l2 belong to the same annotation span if:

1. l1 is contained in l2 or l2 is contained in l1 and type(l1) and type(l2) are in super(sub)class relationship (e.g. Royal Academy[Organization] in Royal Academy of Music[University])

OR

2. l1 and l2 are overlapping but neither l1 is contained in l2 nor l2 is contained in l1 (e.g., Royal Academy[Organization] and Academy of Music[Building])

OR

3. l1 coincides with l2 (e.g., Royal Academy[Organization] and Royal Academy[Museum])

What about Thing type?

1Chen, L., Ortona, S., Orsi, G., & Benedikt, M. (2013). Aggregating Semantic Annotators. Proceedings of the VLDB Endowment, Vol. 6, No. 13, (p. 1486-1497). Riva del Garda, Trento, Italy.





Label & span alignment

The HYBRID-SPAN strategy*

extractor label startOffset endOffset Type Aligned label

Myra Gail 453 462 Person Myra Gail#453#462

Myra Myra Gail

453 453

457 462

Settlement Person

Myra#453#457 Myra Gail#453#462

Myra Gail

453 458

457 462

Band,Artist Person

Myra#453#457 Myra Gail#453#462

Myra Gail 453 462 Thing Myra Gail#453#462

*The vocabulary alignment is required as previous step

Majority-vote for relevance: hybrid-span strategy

Features for Relevance

• F1: nerd(l) -> 1 if label l is extracted by NERD;

0 otherwise

label#offset Alternative labels

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0

east africa #361#372

africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237

two million #234#245; earth #227#232; two million gazelles #234#254

0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…


• F2: textrazor(l) -> 1 if label l is extracted by TextRazor;

0 otherwise


F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…


• F3: tagme(l) -> 1 if label l is extracted by TAGME;

0 otherwise


F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…


• F4: nltk(l) -> 1 if label l is extracted by the NLTK-based method; 0 otherwise


F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…


• F5: abs(l) = 𝑛𝑒𝑟𝑑 𝑙 +𝑡𝑒𝑥𝑡𝑟𝑎𝑧𝑜𝑟 𝑙 +𝑡𝑎𝑔𝑚𝑒 𝑙 +𝑛𝑙𝑡𝑘 𝑙

|𝐸𝑀|

Absolute score for l over the set EM of all Extraction Methods (four in this setting)


F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…

Features for Relevance label#offset

Alternative labels

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…

• F6: lss(l) = 𝑤𝑐 𝑙

𝑤𝑐(𝑙LS), where wc is the word count function and lLS is the longest

span containing l in the union set of all labels recognized by each extraction methods

Expresses the span overlapping between l and the longest span containing l, i.e. the portion of l contained in the longest span lLS


• F7: wlss(l) = 𝑎𝑏𝑠 𝑙 ∗ 𝑙𝑠𝑠(𝑙)

Longest span score for l, weighted by the absolute score for l

label#offset

Alternative labels

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…


• F8: sss(l) = 𝑤𝑐 𝑙SS

𝑤𝑐(𝑙), where lSS is the shortest span contained in l in the

union set of all labels recognized by each extraction methods Expresses the span overlapping between l and the shortest span contained in l, i.e. the portion of l containing the shortest span lSS

label#offset

Alternative labels

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…


• F9: wsss(l) = 𝑎𝑏𝑠 𝑙 ∗ 𝑠𝑠𝑠(𝑙) Shortest-span score for l, weighted by the absolute score for l

label#offset

Alternative labels

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…


• F10: oss(l) =

|𝑗 ∩𝑙|

|𝑗 ∪𝑙|𝑗 ∈𝑂𝐿

|𝑂𝐿|, where |OL| is the number of

overlapping labels among the alternative ones.

label#offset

Alternative labels

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…


• F11: woss(l) = 𝑜𝑠𝑠 𝑙 ∗ 𝑎𝑏𝑠(𝑙)

label#offset

Alternative labels

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1 0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…

Features for Relevance with type

Label#offset type Alternative label F1 …

wildlife #912#920

Thing 1


Thing africa #366#372 [Place,Continent]

0

Country 0

africa #366#372

Place east africa #361#372 [Thing,Country]

1

Continent 1

…


• F1: nerd(l,t) -> 1 if label l with type t is extracted by NERD; 0 otherwise

• F2: textrazor(l,t) -> 1 if label l with type t is extracted by TextRazor; 0 otherwise

• F3: tagme(l,t) -> 1 if label l with type t is extracted by TAGME; 0 otherwise

• F4: nltk(l,t) -> 1 if label l with type t is extracted by the NLTK-based method; 0 otherwise


Label#offset type Alternative label F1 F2 F3 F4 F5a F5b F6 …



0 1 0 0 0.25 0.33 0.5

Country 0 0 1 1 0.5 0.67 0.5

• F5a: abs(l,t) = 𝑛𝑒𝑟𝑑 𝑙,𝑡 +𝑡𝑒𝑥𝑡𝑟𝑎𝑧𝑜𝑟 𝑙,𝑡 +𝑡𝑎𝑔𝑚𝑒 𝑙,𝑡 +𝑛𝑙𝑡𝑘 𝑙,𝑡

|𝐸𝑀|

Absolute score for l with type t over the set EM of all Extraction Methods (four in this setting)

• F5b: rel(l,t) = 𝑎𝑏𝑠 𝑙,𝑡

𝑎𝑏𝑠 𝑙

Relative score for label l with type t over the total number of extraction methods recognizing l


• F6: lss(l,t) = 𝑙𝑠𝑠(𝑙)

𝑛_𝑐𝑎𝑡(𝑙), where n_cat is the number of

different types associated with l. Expresses the span overlapping between l and the longest span containing l, weighted by the number of different types associated with the same label l.




0 1 0 0 0.25 0.33 0.5

Country 0 0 1 1 0.5 0.67 0.5


• F7a: wlss(l,t) = 𝑎𝑏𝑠 𝑙, 𝑡 ∗ 𝑙𝑠𝑠(𝑙, 𝑡)

Longest span score for l with type t, weighted by the absolute score for label l and type t

• F7b: wrlss(l,t) = 𝑟𝑒𝑙 𝑙, 𝑡 ∗ 𝑙𝑠𝑠(𝑙, 𝑡)

Longest span score for l with type t, weighted by the relative score for label l and type t




0 1 0 0 0.25 0.33 0.5

Country 0 0 1 1 0.5 0.67 0.5


• F8: sss(l,t) = 𝑠𝑠𝑠(𝑙)

𝑛_𝑐𝑎𝑡(𝑙)

• F9a: wsss(l,t) = 𝑎𝑏𝑠 𝑙, 𝑡 ∗ 𝑠𝑠𝑠(𝑙, 𝑡)

• F9b: wrsss(l,t) = 𝑟𝑒𝑙 𝑙, 𝑡 ∗ 𝑠𝑠𝑠(𝑙, 𝑡)




0 1 0 0 0.25 0.33 0.5

Country 0 0 1 1 0.5 0.67 0.5


• F10: oss(l,t) = 𝑜𝑠𝑠(𝑙)

𝑛_𝑐𝑎𝑡(𝑙)

• F11a: woss(l,t) = 𝑎𝑏𝑠 𝑙, 𝑡 ∗ 𝑜𝑠𝑠(𝑙, 𝑡)

• F11b: wross(l,t) = 𝑟𝑒𝑙 𝑙, 𝑡 ∗ 𝑜𝑠𝑠(𝑙, 𝑡)




0 1 0 0 0.25 0.33 0.5

Country 0 0 1 1 0.5 0.67 0.5

• Disagreement on the extractors corner (i.e., tools that more sistematically disagree with every other tool) could reveal:

- bad quality tools (in recognizing specific set of labels/types) - specialized tools able to recognized particular entities better than all the other tools

Disagreement metrics evaluation on the extractors corner2

Disagreement for relevance: Humans VS Machine Annotation

2G. Soberon, L. Aroyo, C. Welty, O. Inel, H. Lin, M. Overmeen, Measuring Crowd Truth: Disagreement Metrics Combined with Worker Behavior Filters, Proc. of CrowdSem2013 Workshop, ISWC2013.

Features for Relevance Label#offset Alternative

labels F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1

0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…

• DISTRIBUTED AGREEMENT

• UNIQUE INFORMATION

Features for Relevance Label#offset

Alternative labels

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1

0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…

• ela(ei, ej, l) = 𝒆𝒊 𝒍 ∗𝒆𝒋 𝒍

|𝑳(𝒆𝒊,𝒑)|, where 𝑖 ≠ 𝑗. 𝑒𝑖 𝑙 is the corresponding extractor score

(F1-4) and 𝑳 𝒆𝒊, 𝒑 the number of labels recognized by extractor i in program p (the extractor-label agreement operator is not commutative)

Features for Relevance Label#offset

Alternative labels

F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1

0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…

• avg_ela(ei, l) = 𝒆𝒍𝒂(𝒊≠𝒋 𝒆𝒊,𝒆𝒋,𝒍)

|𝑬𝑴|

Average extractor-label agreement over the set of extraction methods

Features for Relevance Label#offset Alternative

labels F1 F2 F3 F4 F5 F6 F7 F8 F9 F10 F11

wildlife #912#920

1 1 0 1

0.75 1 0.75 1 0.75 0 0


africa #366#372

0 1 1 1 0.75 1 0.75 0.5 0.375 0 0

africa #366#372


1 1 0 0 0.5 0.5 0.25 1 0.5 0 0

earth: two #227#237


0 0 1 0 0.25 1 0.25 0.5 0.125 0.29 0.07

…

• Both extractor-label agreement and the consequent average are evaluated also with reference to the pairs (label,type)

Other possible relevance features

• TF-IDF (with type) Shall the corpus for idf contain more episodes of the same TV-series? Labels referring to characters mentioned in many episodes of the same TV series will gain a higher tf but lower idf score -> consider metadata

Animated adventures of Pingu, the clumsy young penguin. Pingu helps his

neighbour and is rewarded. Pingu's friend tries to get a reward too, but the neighbour refuses. They decide to play a trick on the neighbour, but it all ends with an innocent passer-by becoming the victim of their prank.

BBC Program: Pingu's Trick (Pid: b0077x84)

• Enhance metadata (words in title and subject)

Labels lemmatization (WordNetLemmatizer)

Dani is understudying the part of a witch in Macbeth: The Musical,

which means Jack and Sam get the job of ensuring little brother Max does not cause chaos. Dani's most loyal viewers, the aliens, have got bored of never getting to meet their heroine and her pals, and have decided to teleport down to Earth, where they soon find themselves embroiled in Max's scheme to win the 10,000 pound reward from

the UFO Society.

BBC Program: Alien Invasion (Pid: b00ph91v)

Other possible relevance features

State of work

• Dataset: 52 BBC programs

• Realized:

- Span and Type Alignment

- Relevance scores for labels

• To do:

– Computation of relevance score for pairs (label,type)

– Crowdsourcing tasks

– Connecting relevance/relevance-with-type outputs

– Evaluation of results (precision, recall, complementarity, …)

Does the method deal with complementarity?

http://dbpedia.org/resource/Gazelle

PID: b0074t2b Title: Great plains

Synopsis ‘’The great plains are the vast open spaces of our planet. These immense wilderness areas are seemingly empty. But any feeling of emptiness is an illusion - the plains of our planet support the greatest gatherings of wildlife on earth: two million gazelles on the Mongolian steppes, three million caribou in North America and one and a half million wildebeest in East Africa. […]‘’

Label two million gazelles#234#254

Types Amount;Mammal;Single

Extractors wikimeta(nerd);textrazor;tagme;

http://dbpedia.org/resource/Two_in_a_Million/You're_My_Number_One

COMPLEMENTARITY!!

(Amount of Mammal)

What is the relevant information in a text?

Data & Analytics

Transcript of What is the relevant information in a text?