Genera&ng Linked Data by Inferring the Semancs of Tables
Transcript of Genera&ng Linked Data by Inferring the Semancs of Tables
Genera&ngLinkedDatabyInferringthe
Seman&csofTables
VarishMulwad,Ph.D.2015h5p://ebiq.org/j/96
Goal:Table=>LOD*
Name Team Posi&on HeightMichaelJordan Chicago ShooMngguard 1.98
AllenIverson Philadelphia Pointguard 1.83
YaoMing Houston Center 2.29
TimDuncan SanAntonio Powerforward 2.11
h5p://dbpedia.org/class/yago/NaMonalBasketballAssociaMonTeams
h5p://dbpedia.org/resource/Allen_Iverson Playerheightinmeters
dbprop:team
*DBpedia 2/49
Goal:Table=>LOD*
Name Team Posi&on HeightMichaelJordan Chicago ShooMngguard 1.98
AllenIverson Philadelphia Pointguard 1.83
YaoMing Houston Center 2.29
TimDuncan SanAntonio Powerforward 2.11
@prefixdbpedia:<h5p://dbpedia.org/resource/>.@prefixdbo:<h5p://dbpedia.org/ontology/>.@prefixyago:<h5p://dbpedia.org/class/yago/>."Name"@enisrdfs:labelofdbo:BasketballPlayer."Team"@enisrdfs:labelofyago:NaMonalBasketballAssociaMonTeams."MichaelJordan"@enisrdfs:labelofdbpedia:MichaelJordan.dbpedia:MichaelJordanadbo:BasketballPlayer."ChicagoBulls"@enisrdfs:labelofdbpedia:ChicagoBulls.dbpedia:ChicagoBullsayago:NaMonalBasketballAssociaMonTeams.
RDFLinkedData
Allthisinacompletelyautomatedway*DBpedia 3/49
Tablesareeverywhere!!…yet…
Theweb–154millionhighqualityrelaMonaltables
4/49
Evidence–basedmedicine
Figure:Evidence-BasedMedicine-theEssenMalRoleofSystemaMcReviews,andtheNeedforAutomatedTextMiningTools,IHI2010
Evidence-basedmedicinejudgestheefficacyoftreatmentsortestsbymeta-analysesofclinicaltrials.KeyinformaMonisolenfoundintablesinarMcles
However,therateatwhichmeta-analysesarepublishedremainsverylow…hamperseffec=vehealthcaretreatment…
#ofClinicaltrialspublishedin2008
#ofmetaanalysispublishedin2008
5/49
~400,000datasets~<1%inRDF
6/49
2010PreliminarySystem
ClasspredicMonforcolumn:77%EnMtyLinkingfortablecells:66%
Examplesofclasslabelpredic=onresults:Column–NaMonalityPredicMon–MilitaryConflictColumn–BirthPlacePredicMon–PopulatedPlace
PredictClassforColumns
Linkingthetablecells
IdenMfyandDiscoverrelaMons
T2LDFramework
SourcesofErrors
• Thesequen9alapproachleterrorsperco-latefromonephasetothenext• ThesystemwasbiasedtowardpredicMngoverlygeneralclassesovermoreappropriatespecificones• HeurisMcslargelydrivethesystem• AlthoughweconsidermulMplesourcesofevidence,wedidnotjointassignment
8/49
Sampling AcronymdetecMon
Pre-processingmodules
QueryandgenerateiniMalmappings
2 1
GenerateLinkedRDF Verify(op9onal) Storeinaknowledgebase&publishasLOD
JointInference/Assignment
ADomainIndependentFramework
9/49
QueryMechanism
MichaelJordan ChicagoBulls Shoo&ngGuard 1.98
{dbo:Place,dbo:City,yago:WomenArMst,yago:LivingPeople,yago:NaMonalBasketballAssociaMonTeams…}
ChicagoBulls,Chicago,JudyChicago… ………
Team
possibletypes possibleen99es
10/49
Rankingthecandidates
Stringsimilaritymetrics
Stringincolumnheader Classfromanontology
11/49
Rankingthecandidates
Stringsimilaritymetrics
Popularitymetrics
Stringintablecell EnMtyfromtheknowledgebase(KB)
12/49
JointInferenceoverevidenceinatable
ü ProbabilisMcGraphicalModels
13/49
AgraphicalmodelfortablesJointinferenceoverevidenceinatable
C1 C2 C3
R11
R12
R13
R21
R22
R23
R31
R32
R33
Team
Chicago
Philadelphia
Houston
SanAntonio
Class
Instance
14/49
Parameterizedgraphicalmodel
C1 C2C3
𝝍𝟓
R11 R12 R13 R21 R22 R23 R31 R32 R33
𝝍𝟑
𝝍𝟑
𝝍𝟑
𝝍𝟒
𝝍𝟒
𝝍𝟒
FuncMonthatcapturestheaffinitybetweenthecolumnheadersandrowvalues
Rowvalue
VariableNode:Columnheader
CapturesinteracMonbetweencolumnheaders
CapturesinteracMonbetweenrowvalues
FactorNode
15/49
Challenge:InterpreMngLiterals
Popula&on
690,000
345,000
510,020
120,000
Age
75
65
50
25
PopulaMon?Profitin$K?
Ageinyears?Percent?
Manycolumnshaveliterals,e.g.,numbers
• PredictproperMesbasedoncellvalues• Cychadhandcodedrules:humansdon’tlivepast120• Weextractvaluedistribu9onsfromLODresources• Differforsubclasses:ageofpeoplevs.poli9calleadersvs.athletes• Representasmeasurements:value+units
• Metric:possibility/probabilityofvaluesgivendistribuMon16/49
OtherChallenges• Usingtablecap9onsandothertextisassociateddocumentstoprovidecontext
• Sizeofsomedata.govtables(>400Krows!)makesusingfullgraphicalmodelimpracMcal– Sampletableandrunmodelonthesubset
• Achievingacceptableaccuracymayrequirehumaninput– 100%accuracyuna5ainableautomaMcally– Howbesttolethumansofferadviceand/orcorrectinterpretaMons?
17/49