Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max...

39
Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni Qatar Computing Research Institute 3 Maya Ramanath Dept. of CSE, IIT-Delhi, India 4 Volker Tresp 4 Siemens AG, Corporate Technology, Munich, Germany EMNLP 2012

Transcript of Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max...

Page 1: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

1 Mohamed Yahya, Klaus Berberich, Gerhard WeikumMax Planck Institute for Informatics, Germany

2 Shady ElbassuoniQatar Computing Research Institute

3 Maya RamanathDept. of CSE, IIT-Delhi, India

4 Volker Tresp4Siemens AG, Corporate Technology, Munich, Germany

EMNLP 2012

Page 2: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

QNL Translation to

QNL : Natural Language Questions“Which female actor played in Casablanca and is married to a writer who was born in Rome?”.

QFL: SPARQL 1.0?x hasGender female ?x marriedTo ?w?x isa actor ?w isa writer?x actedIn Casablanca_(film) ?w bornIn Rome

Characteristics of SPARQL :Complex query

good resultsDifficult for the user

Translation

Page 3: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

3

Yago2

YAGO2s is a huge semantic knowledge base, derived from Wikipedia, WordNet and GeoNames.

RelationClass Entities

Page 4: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Architecture of DEANNA.

Page 5: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

5

Phrase detection

A detected phrase p is a pair < Toks, l >Toks : phrasel : label (l {concept, relation})∈

Phrase detectionQNL Phrase

Pr : {<*, relation >}Pc : {<*, concept >}

Page 6: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Phrase detection

e.q. “Which female actor played in Casablanca and is married to a writer who was born in Rome?”

Search instances of the means relation in Yago2

concept phrase detection :

Page 7: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Phrase detection

relation phrase detection : rely on a relation detector based on ReVerb (Fader et al., 2011) with additional POS tag patterns

e.q. “Which female actor played in Casablanca and is married to a writer who was born in Rome?”

Page 8: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Phrase Mapping

to map concept phrases:also Search instances of the means relation in Yago2

to map relation phrases: rely on a corpus of textual patterns to relation mappings

e.q. “Which female actor played in Casablanca and is married to a writer who was born in Rome?”

textual patterns relation

Phrase MappingPhrase Mapping

Page 9: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Q-Unit Generation

Q-Unit GenerationMapping Candidategraph

Dependency parsing :

q-unit is a triple of sets of phrases

Page 10: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Q-Unit GenerationDependency parsing :

identifies triples of tokens:<trel, targ1, targ2>, where trel, targ1, targ2 q∈ NL

who was born in Rome?

nsubjpass(born-3, who-1)auxpass(born-3, was-2)root(ROOT-0, born-3)prep_in(born-3, Rome-5)

e.q.

born

who Rome

trel

targ1targ2

root

nsubjpass in

<born, who, Rome>,

Page 11: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Q-Unit Generation

q-unit is a triple of sets of phrases<{prel P∈ r}, {parg1 P∈ c}, {parg2 P∈ c}> ,trel p∈ rel , targ1 p∈ arg1 , and targ2 p∈ arg2 .

<born, writer, Rome>

triples of tokens phrase

<born, relation ><was born, relation ><Rome, concept ><a writer, concept >

Page 12: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Joint Disambiguation

Joint Disambiguation

1.each phrase is assigned to at most one semantic item2.resolves the phrase boundary ambiguity

(only nonoverlapping phrases are mapped)

Rule

Page 13: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Joint Disambiguation

Disambiguation Graph• Joint disambiguation takes place over a disambiguation

graph DG = (V, E), – V = Vs V∪ p V∪ q

– E = Esim E∪ coh E∪ q

Page 14: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

V = Vs V∪ p V∪ q

Vq : a set of placeholder nodes for q–units

Joint Disambiguation

Vs : the set of s-node (s-node is semantic items)

Vp : the set of p-node p-node is phrases Vrp : the set of relation phrases Vrc : the set of concept phrases

Disambiguation Graph

Page 15: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Disambiguation Graph

Eq V⊆ q × Vp × d, d {rel, arg1, arg2}∈Called q-edge

E = Esim E∪ coh E∪ q

Esim V⊆ p × Vs

a set of weighted similarity edges

Ecoh V⊆ s × Vs

a set of weighted coherence edges

Disambiguation Graph

Page 16: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Disambiguation Graph

Edge Weights• Cohsem (Semantic Coherence)

– between two semantic items s1 and s2 as the Jaccard coefficient of their sets of inlinks.

• Three kinds of inlink– InLinks(e)– InLinks(c)– InLinks(r)

Page 17: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

17

InLinks(e)

• InLinks(e): the set of Yago2 entities whose corresponding Wikipedia pages link to the entity.

• e.q. – Let e = Casablanca– InLinks(Casablanca)

= {Marwan_al-Shehhi , Ingrid_Bergman, …, Morocco…}

Page 18: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

18

InLinks(c)• InLinks(c) = ∪e c ∈ Inlinks(e)• e.q. let c = wikicategory_Metropolitan_areas_of_Morocco

– InLinks(wikicategory_Metropolitan_areas_of_Morocco) = InLinks(Casablanca) InLinks(Marrakech) InLinks(Fes) ∪ ∪ ∪InLinks(Agadir) InLinks(Safi,_Morocco) InLinks(Oujda) ∪ ∪ ∪InLinks(Tangier) InLinks(Rabat)∪

Page 19: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

19

InLinks(r)

• InLinks(r) = ∪(e1, e2) r ∈ (InLinks(e1) ∩ InLinks(e2))

Page 20: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

20

Similarity Weights

• For entities– how often a phrase refers to a certain entity in

Wikipedia.• For classes– reflects the number of members in a class

• For relations– reflects the maximum n-gram similarity between

the phrase and any of the relation’s surface forms

Page 21: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

21

Disambiguation Graph Processing

• The result of disambiguation is a subgraph of the disambiguation graph, yielding the most coherent mappings.

• We employ an ILP to this end.

Page 22: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

22

Definitions (part1)

Page 23: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

23

Definitions (part2)

Page 24: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

24

objective function

Page 25: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

25

Constraints(1~3)

Page 26: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

26

Constraints(4~7)

Page 27: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

27

Constraints(8~9)

This is not invoked for existential questions

Page 28: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

28

resulting subgraph for the disambiguation graph of Figure 3

Page 29: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

29

Query Generation

• not assign subject/object roles in triploids and q-units

• Example:– “Which singer is married to a singer?”• ?x type singer , ?x marriedTo ?y , and ?y type singer

Page 30: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

30

5 Evaluation

• Datasets• Evaluation Metrics• Results & Discussion

Page 31: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

31

Datasets• author's experiments are based on two collections of questions:

– QALD-1• 1st Workshop on Question Answering over Linked Data (QALD-1)• the context of the NAGA project

– NAGA collection• The NAGA collection is based on linking data from the Yago2 knowledge base

• Training set– 23 QALD-1 questions – 43 NAGA questions

• Test set– 27 QALD-1 questions – 44 NAGA questions

• Get hyperparameters (α, β, γ) in the ILP objective function.• 19 QALD-1 questions in Test set

Page 32: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

32

Evaluation Metrics

• author evaluated the output of DEANNA at three stages– 1. after the disambiguation of phrases– 2. after the generation of the SPARQL query– 3. after obtaining answers from the underlying linked-data

sources• Judgement– two human assessors who judged whether an output item

was good or not– If the two were in disagreement , then a third person

resolved the judgment.

Page 33: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

33

disambiguation stage

• The task of judges – looked at each q-node/s-node pair, in the context

of the question and the underlying data schemas, – determined whether the mapping was correct or

not – determined whether any expected mappings were

missing.

Page 34: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

34

query-generation stage

• The task of judges– Looked at each triple pattern– determined whether the pattern was meaningful

for the question or not– whether any expected triple pattern was missing.

Page 35: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

35

query-answering stage

• the judges were asked to identify if the result sets for the generated queries are satisfactory.

Page 36: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

36

•Micro-averaging • aggregates over all assessed items

regardless of the questions to which they belong.

•Macro-averaging • first aggregates the items for the same

question, and then averages the quality measure over all questions.

•For a question q and item set s in one of the stages of evaluation

•correct(q, s) : the number of correct items in s•ideal(q) : the size of the ideal item set•retrieved(q, s) : the number of retrieved items

•define coverage and precision as follows:cov(q, s) = correct(q, s) / ideal(q)

prec(q, s) = correct(q, s) / retrieved(q, s).

Page 37: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

37

Page 38: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.

Natural Language Questions for the Web of Data

38

Conclusions

• Author presented a method for translating natural language questions into structured queries.

• Although author’s model, in principle, leads to high combinatorial complexity, they observed that the Gurobi solver could handle they judiciously designed ILP very efficiently.

• Author’s experimental studies showed very high precision and good coverage of the query translation, and good results in the actual question answers.

Page 39: Natural Language Questions for the Web of Data 1 Mohamed Yahya, Klaus Berberich, Gerhard Weikum Max Planck Institute for Informatics, Germany 2 Shady Elbassuoni.