Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia...
Transcript of Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia...
![Page 1: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/1.jpg)
KIT – Universität des Landes Baden-Württemberg undnationales Großforschungszentrum in der Helmholtz-Gemeinschaft
Institute of Applied Informatics and Formal Description Methods (AIFB)
www.kit.edu
EfficientGraph-basedDocumentSimilarity
ChristianPaul,AchimRettinger,AdityaMogadala,CraigA.Knoblock,PedroSzekely
![Page 2: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/2.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
2 06/01/2016
Commontask:Related-documentSearch
Applebreakslaptopsalesrecord
Hedrinksapplejuiceduringhalf-timebreak
All-timehighinMacBooks sold
U2recordpre-installedoniPhones
.
.
.
Querydocument
.
.
.
DocumentCollection
![Page 3: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/3.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
3 06/01/2016
Hedrinksapplejuiceduringhalf-timebreak
Matchingwordsdonotalwaysindicatesimilarity
Applebreakslaptopsalesrecord
All-timehighinMacBookssold
U2recordpre-installedoniPhones
.
.
.
.
.
.
DocumentCollection
Querydocument
![Page 4: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/4.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
4 06/01/2016
Wordco-occurrencecanbemisleading,too
Applebreakslaptopsalesrecord
All-timehighinMacBookssold
U2recordpre-installedoniPhones
.
.
.
.
.
.
DocumentCollection
Querydocument
Hedrinksapplejuiceduringhalf-timebreak
![Page 5: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/5.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
5 06/01/2016
SemanticTechnologies:resolveambiguity&exploitrelationalknowledge
Applebreakslaptopsalesrecord
All-timehighinMacBookssold
U2recordpre-installedoniPhones
.
.
.
.
.
.
MacBook
AppleInc.
developer
Laptop
type
iPhone
developer
DocumentCollection
Querydocument
AppleJuice
Hedrinksapplejuiceduringhalf-timebreak
![Page 6: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/6.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
6 06/01/2016
SemanticTechnologies:resolveambiguity&exploitrelationalknowledge
Applebreakslaptopsalesrecord
All-timehighinMacBookssold
U2recordpre-installedoniPhones
.
.
.
.
.
.
MacBook
AppleInc.
developer
Laptop
type
iPhone
developer
DocumentCollection
Querydocument
AppleJuice
Hedrinksapplejuiceduringhalf-timebreak
Expensive graph traversal
![Page 7: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/7.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
7 06/01/2016
RelatedWork
Distributional:+scalable,fast- Noexplicitdisambiguationandconceptualrelations
ExplicitSemanticAnalysis(ESA)[GM07]
TF-IDF,VectorSpaceModel
SalientSemanticAnalysis(SSA) [HM11]
Knowledge-based:+richsemanticknowledge
- expensivegraphtraversal
PathSim [SHY+11]
HeteSim [SKH+14]
AnnSim:1-1matching,hierarchicalsimilarity
[PVH+13]
Schuhmacher, Ponzetto:GraphEditDistance[SP14]
Nunes etal.:Transversaldoc.similarity
[NKF+13]
![Page 8: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/8.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
8 06/01/2016
Bridgingthegap
Distributional:+scalable,fast- Noexplicitdisambiguationandconceptualrelations
ExplicitSemanticAnalysis(ESA)[GM07]
TF-IDF,VectorSpaceModel
SalientSemanticAnalysis(SSA) [HM11]
Knowledge-based:+richsemanticknowledge
- expensivegraphtraversal
PathSim [SHY+11]
HeteSim [SKH+14]
AnnSim:1-1matching,hierarchicalsimilarity
[PVH+13]
Efficient Graph-based
DocumentSimilarity Schuhmacher, Ponzetto:
GraphEditDistance[SP14]
Nunes etal.:Transversaldoc.similarity
[NKF+13]
![Page 9: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/9.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
9 06/01/2016
CoreContributions
Ø Scalablerelated-documentsearchprocess
Ø Graphtraversalduring pre-processing
Ø Light-weighttasksatsearchtime
Weachievesimilarcomputationalefficiencyasstatisticalapproaches
![Page 10: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/10.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
10 06/01/2016
CoreContributions
Ø Scalablerelated-documentsearchprocess
Ø Graphtraversalduring pre-processing
Ø Light-weighttasksatsearchtime
Weachievesimilarcomputationalefficiencyasstatisticalapproaches
Ø Bag-of-entitiesdocumentmodel&similarity
Ø Documentsimilarityascombination ofpairwiseentitysimilarities
Ø Exploitshierarchical&transversal knowledgegraphrelations
Inourexperiments,weachievehighercorrelationwithhumannotion ofdocumentsimilaritythanthecompetition
![Page 11: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/11.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
11 06/01/2016
Related-documentSearchusingGraph-basedSimilarity1) SemanticDocumentExpansion
• Enrichquerydocumentwithrelationalknowledge
2) Inclusionincorpus
• Store&indexexpanded document
3) Pre-search
• Useinvertedindextogeneratecandidateset
4) Fullsearch
• Entity-level,path-basedsimilarities
![Page 12: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/12.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
12 06/01/2016
SemanticDocumentExpansion
l Enrichdocumentannotations
l Hierarchically
- Categories&theirancestors+hierarchicaldepths
l Transversally
- Weightneighboring entitiesbasedon
l numberofpaths
l lengthofpaths
w(e)=∑l= 1
L
βl∗∣pathsa , e(l) ∣
DocA
1.5
0.75
0.25
0.5
1
1
0.5
0.5
0.5
0.5
![Page 13: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/13.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
13 06/01/2016
Pre-Search:GenerateCandidateSet
l Invertedindexfromentitiestodocuments
- Retrievecandidatesefficiently
l Assumption:Entityoverlapà contextualsimilarity
- Coarse,document-levelassessment
![Page 14: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/14.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
14 06/01/2016
FullSearch:Graph-basedDocumentSimilarity
l Foreachcandidatedocument,reconstructquery-candidateannotationsubgraph-hierarchical &transversal
Ø Computeallpairwiseentitysimilarityscores
Ø Combine intodocumentscore
DocA DocB
![Page 15: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/15.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
15 06/01/2016
l Usingstoredancestors&depthstocompute
l Example:
Hierarchicalentitysimilarity
hierSimdps(ent1 , ent2)=1
1+ 2+ 2= 0.2
hierSimdps (x , y )=d (root , lca( x , y ))
d (root , lca(x , y ))+ d (lca( x , y) , x )+ d ( lca(x , y ,) , y )
![Page 16: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/16.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
16 06/01/2016
Transversalentitysimilarity
l Usestoredneighbors&weightstocompute:
l Example:transSim(ent1 , ent2)= 0.52+ 2∗0.252+ 0.5∗0.25= 0.5
transSim(a ,b)=∑l= 1
L∗ 2
βl∗∣pathsa ,b(l ) ∣
![Page 17: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/17.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
17 06/01/2016
Documentsimilarity:bipartitegraphofentitysimilarities1. Annotationpairsimilarity:Combinetransversal&hierarchicalscores
2. DeterminemaxGraph: foreachannotation,choosemax.scoreedge(bold)
3. Computedocumentscorebasedonmax.edges foreachannotationa1i ofDocA:
DocA DocB
docSim(docA , docB)=∑a1i∈A1
(entSiment(a1i ,matched (a1i)))
∣A1∣+∣A2∣
(a1i ,matched (a1i))
![Page 18: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/18.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
18 06/01/2016
Documentsimilarity:DBpediaexample
l Exampledocumentsscore:
docSim(docA , docB)= 0.53+ 0.92+ 0.43+ 0.53+ 0.58+ 0.813+ 3
≈ 0.63
![Page 19: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/19.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
19 06/01/2016
Evaluation
• Task:Measurecorrelationwithhumannotionofsimilarity
• Datasets
• Documentsimilarity:Lee50[1]
• Sentencesimilarity:2012-MSRvid-Test[2],2015-Images[3]
• ...using andX-LiSA[ZR14] entityextractor
[1]https://webfiles.uci.edu/mdlee/LeePincombeWelsh.zip[2]http://research.microsoft.com/en-us/downloads/38cf15fd-b8df-477e-a4e4-a4680caa75af/[3]http://ixa2.si.ehu.es/stswiki/index.php/
![Page 20: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/20.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
20 06/01/2016
DocumentSimilarity: Lee50corpus
• 50shortnewsarticles(51to126words)
• Goldstandardsetoffullpairwisedocumentsimilarityscores
Ø Outperformingbaselines&competition:
• Statistical(LSA,ESA,SSA)
• Knowledge-based(GED)
![Page 21: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/21.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
21 06/01/2016
SentenceSimilarity
• Comparedtorelatedunsupervisedapproaches(ontextswithoneormoreextractedentities)
• 2012-MSRvid-Test:Videodescriptions fromMSRVideoParaphraseCorpus
• 2015-Images:Flickrimagedescriptions
Ø Outperformingbaselines&competition
• Statistical(Polyglot)
• Knowledge-based(Tiantianzhu7,IRIT,WSL)
![Page 22: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/22.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
22 06/01/2016
Related-documentSearch:Pre-Search,FullSearch&Efficiency
Ø Rankingscore(nDCG)improvesfromPre-Search toFullSearch
Ø Computationtimegrowslinearlywithcandidatesetsize
Ø Here:candidatesetofsize~15achieveshighperformance
![Page 23: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/23.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
23 06/01/2016
Conclusion&Outlook
l EfficientGraph-basedDocumentSimilarity
• …combineshierarchical&transversalrelationalknowledge
• …outperforms relateddistributional &knowledge-basedapproaches,onbotharticlesandsentences
• …iscomputationallyefficient:related-documentsearch
l Lessonslearned
Ø ValueofDBpediaforsemanticsimilarity
Ø Themoreentities(atleastone)perdocument, thebetter:
Ø Fewentities:disambiguationhelps
Ø Manyentities:maxGraph entitypairingemphasizesmeaningful relations
l Resources(code,data,documents):http://people.aifb.kit.edu/amo/eswc2016/
![Page 24: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/24.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
24 06/01/2016
ReferencesI
l [TMS08] Thiagarajan,Manjunath,Stumptner. Computing semanticsimilarityusingontologies.InISWC08,theInternational SemanticWebConference(ISWC),2008.
l [LD08] Lemaire,Denhière.Effectsofhigh-order co-occurrences onwordsemanticsimilarities.
l [GM07]Gabrilovich, Markovitch.Computing semanticrelatednessusingwikipedia-basedexplicit semanticanalysis.InIJCAI,volume7,pages1606–1611, 2007.
l [HM11] Hassan,Mihalcea.Semanticrelatednessusingsalientsemanticanalysis.InAAAI,2011.
l [SP14] Schuhmacher, Ponzetto. Knowledge-basedgraphdocument modeling.InProceedingsofthe7thACMInternational ConferenceonWebSearchandDataMining,WSDM’14.
l [NKF+13] Nunes, Kawase,Fetahu,Dietze, Casanova,Maynard.Interlinkingdocumentsbasedonsemanticgraphs.ProcediaComputerScience,22:231–240,2013.
l [PSA08] Potthast, Stein, Anderka.Awikipedia-basedmultilingual retrieval model.InAdvancesinInformationRetrieval, pages522–530.Springer, 2008.
l [SHY+11] Sun,Han,Yan,Yu,Wu.Pathsim:Metapath-basedtop-ksimilaritysearchinheterogeneous information networks.VLDB’11, 2011.
l [SKH+14] Chuan,Xiangnan,Yue,Yu,Bin.Hetesim:Ageneralframeworkforrelevancemeasureinheterogeneous networks.IEEETransactionsonKnowledge&DataEngineering.
l [PVH+13] Palma,Vidal,Haag,Raschid,Thor.Measuringrelatedness betweenscientific entities inannotation datasets.InProceedingsoftheInternational ConferenceonBioinformatics,Computational Biology andBiomedical Informatics,BCB’13.
l [ZR14] Zhang,Rettinger.X-lisa:Cross-lingual semanticannotation.ProceedingsoftheVLDBEndowment(PVLDB), the40thInternational ConferenceonVeryLargeDataBases(VLDB).
l [KJC+15] PavanKapanipathi, PrateekJain,Chitra Venkataramani,AmitSheth.Hierarchical interestgraph, 21January2015.wiki.knoesis.org/index.php/Hierarchical_Interest_Graph, lastaccessed07/15/2015
![Page 25: Efficient Graph -based Document Similarity · 2018-12-05 · l Lessons learned Ø Value of DBpedia ... Computing semantic similarity using ontologies. In ISWC 08, the International](https://reader034.fdocuments.us/reader034/viewer/2022042303/5ecdffb609cdde2c76388ea3/html5/thumbnails/25.jpg)
Institute of Applied Informatics and Formal Description Methods (AIFB)
25 06/01/2016
ReferencesII
l [LIJ+15] Lehmann,J.,Isele,R.,Jakob,M.,Jentzsch, A.,Kontokostas, D.,Mendes,P.N.,Hellmann, S.,Morsey,M.,vanKleef,P.,Auer, S.,etal.:Dbpedia-alarge-scale,multilingual knowledgebaseextracted fromwikipedia.SemanticWeb6(2),167-195(2015)