Type Inference on Noisy RDF Data

Folie 1

Type Inference on Noisy RDF Data

Heiko Paulheim, Christian Bizer

The Problem

One promise of the Semantic Web:You can issue structured queries

e.g., List all presidents that graduated from Harvard Law School

SELECT ?x WHERE {
?x a dbpedia-owl:President .
?x dbpedia-owl:almaMater
dbpedia:Harvard_Law_School }

The Problem

SELECT ?x WHERE {

...if we run this against DBpedia, we get one resulti.e., Elwell Stephen Otis

But...

The Problem

The Problem

So what is going wrong?

SELECT ?x WHERE {

In DBpedia, Barack Obama is not of type President!

How can we add missing types?

Is It a Big Problem?

DBpedia has at least 2.7 million missing type statementsw.r.t. the DBpedia ontology

found using co-occurence analysis of matching classes
in YAGO and DBpedia

a very optimistic lower bound

Highly incomplete classes:Species: >870,000 missing statements

Person: >510,000 missing statements

Event: >150,000 missing statements

A Naive Approach

Idea: exploit properties with domain and range

Pseudo RDFS Reasoning:CONSTRUCT {?x a ?t}
WHERE { {?x ?r ?y . ?r rdfs:domain ?t}
UNION {?y ?r ?x . ?r rdfs:range ?t} }

A Naive Approach

Experiment with Barack ObamaPerson, PersonFunction, Actor, Organization

Experiment with Germany:Place, Award, Populated Place, City, SportsTeam, Mountain, Agent, Organisation, Country, Stadium, RecordLabel, MilitaryUnit, Company, EducationalInstitution, PersonFunction, EthnicGroup, Architect, WineRegion, Language, MilitaryConflict, Settlement, RouteOfTransportation

A Naive Approach

What is going on here?DBpedia data is noisy

One wrong statement is enough for a wrong conclusion

e.g.: dbpedia:Kurt_H._Debus dbpedia-owl:award dbpedia:Germany

Germany example: 69,000 statements20 wrong types can come from 20 wrong statements

i.e., an error rate of 0.03% is enough for a totally screwed result

...but that would be an excellent data quality for a LOD source!

SDType Approach

Idea: outgoing/incoming properties are indicators
for a resource's typee.g.: starring Movie

e.g.: author-1 Writer

Basic compiled statisticsP(C|p) := probability of class C in presence of property p

e.g.: P(dbpedia:Film|starring) = 0.79

e.g.: P(dbpedia:Writer|author-1) = 0.44

SDType Approach

Based on precompiled statisticsFind types of instances

Using voting

score(C) = avg(all properties p) P(C|p)

Refinement:Weight for properties: discriminative power

weight(p) = sum(all classes c) (p(c)-p(c|p))

i.e., how strongly this property's class distribution
deviates from the overall class distribution

Evaluation

Two fold evaluationOn DBpedia and OpenCyc as Silver Standard
(automatic, 10,000 random instances)

On untyped DBpedia resources (manual, 100 instances)

Using only incoming propertiesUsing outgoing properties is trivial!

Evaluation Results

On DBpedia

Evaluation Results

On OpenCyc

Evaluation Results

Evaluation on untyped resourcesRandom sample of 100 untyped resources

Manual checking of precision

Evaluation Results

DBpedia:works reasonably well (F-measure 0.89)

OpenCyc:harder because of deeper class hierarchy (F-measure 0.60)

General:having more links increases precision
(in contrast to RDFS reasoning)

more general types (e.g., Band) are easier than specific ones
(e.g., PunkRockBand)

Deployment

Heuristic types have been included in DBpedia 3.9for previously untyped instances

3.4 million type statements at precision ~0.95

Includes also many resources without a Wikipedia pagei.e., generated from a red link

RuntimeComplexity O(PT)
P: number of property assertions
T: number of type assertions

~24h for processing DBpedia

Conclusion and Outlook

SDType approach works at high qualityoutperforms most state of the art on DBpedia

deployed for DBpedia 3.9

Same approach can be used forvalidating links

within dataset: deployed for DBpedia 3.9 (removed ~13,000 wrong statements)

across datasets: to be done

Type Inference on Noisy RDF Data

Heiko Paulheim, Christian Bizer

Klicken Sie, um die Formate des Gliederungstextes zu bearbeiten

Zweite GliederungsebeneDritte GliederungsebeneVierte GliederungsebeneFnfte GliederungsebeneSechste GliederungsebeneSiebente Gliederungsebene

Type Inference on Noisy RDF Data

Technology

Transcript of Type Inference on Noisy RDF Data