Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux...

Entity-Centric Data Management

Philippe Cudré-Mauroux

eXascale Infolab, University of Fribourg Switzerland

MIT CSAIL – DB GROUP March 29, 2012

eXascale Infolab

•  New lab @ U. of Fribourg, Switzerland •  Financed largely by the Swiss Federal State •  Extremely large-scale, non-relational data

management (… mostly)

XI 10.11.2010

ZenCrowd

XI Projects

Entities

Information Management

•  The story so far: – Strict separation between unstructured and structured

data management infrastructures

Inverted Index

Keywords

Information Integration

•  Information integration is still the one of the biggest CS problem out there (according to many e.g., Gartner)

•  Information integration typically requires some sort of mediation 1.  Unstructured Data: keywords, synsets 2.  Structured Data: global schema, transitive closure of

schemas (mostly syntactic)

⇒ nightmarish if 1, and 2. taken separately, horror marathon if considered together

Entities as Mediation

•  Rising paradigm – Store information at the entity granularity –  Integrate information by inter-linking entities

•  Advantages? – Coarser granularity compared to keywords

•  More natural, e.g., brain functions similarly (or is it the other way around?)

– Denormalized information compared to RDBMSs •  Pre-computed joins •  “Semantic” linking

•  Drawbacks?

Example (1): Yahoo! Portal

Image © R. Rabbat

Example (2): Web Data Integration

http://data.semanticweb.org/person/philippe-cudre-mauroux

Focus on 4 Core Problems

1. Extracting entities from (HTML) text (ZenCrowd)

2.  Searching for entities

3.  Accessing entities (DNS^3)

4. Storing entities (Diplodocus[RDF])

Extracting Entities

•  Extracting entities from text is an old problem… – … and is extremely hard, esp. for machines

•  Dozens of approaches have been suggested

•  What if – We want to combine approaches / frameworks? – We want to leverage both human computations &

algorithms?

ZenCrowd

•  Extracts entities from text using state-of-the-art techniques

•  Uses sets of algorithmic matchers to match entities to online concepts

•  Uses dynamic templating to create micro-matching-tasks and publish them on MTurk

•  Combines both algorithmic and human matchers using probabilistic networks

ZenCrowd Architecture

Micro Matching

HTMLPages

HTML+ RDFaPages

LOD Open Data Cloud

CrowdsourcingPlatform

ZenCrowdEntity

Extractors

LOD Index Get Entity

Input Output

Probabilistic Network

Decision Engine

Workers Decisions

AlgorithmicMatchers

“Black-Magic” Component

•  Probabilistic network to integrate a priori & a posteriori information – Agreement of good turkers & algorithms

•  Learning process – Constraints

•  Unicity •  Equality (SameAs)

– Giant probabilistic graph •  Instantiated selectively

pw1( ) pw2( )

lf1( ) lf2( )

pl1( ) pl2( )

lf3( )

pl3( )

c11 c22c12c21 c13 c23

u2-3( )sa1-2( )

Does it Work?

•  Improves avg. prec. by 0.14 on average! – Minimal crowd involvement – Embarrassingly parallel problem

Top$US$Worker$

0$ 250$ 500$

Worker&P

recision

Number&of&Tasks&

US$Workers$

IN$Workers$

0.6$0.62$0.64$0.66$0.68$0.7$

0.72$0.74$0.76$0.78$0.8$

1$ 2$ 3$ 4$ 5$ 6$ 7$ 8$ 9$

Precision)

Top)K)workers)

Searching for Entities

•  How can end-users reach entities? ⇒  Keyword search

•  On their names or attributes

– Obviously not ideal •  BM25 on TREC 2011 AOR: MAP=0.15, P@10=0.20 •  Query extension, query completion or pseudo-relevance

feedback yield comparable (or worse) results

Hybrid Entity Search

The Descendants

TheDescendants

titleGeorgeClooney

George Clooney

nameMay 6, 1961

dateOfBirth

ShaileneW

Shailene Woodley

Nov. 15, 1991

dateOfBirth

playsIn

•  Main idea: combine unstructured and structured search –  Inverted index to locate first candidates – Graph queries to refine the results

•  Graph traversals (queries on object properties) •  Graph neighborhoods (queries on

data type properties)

Inverted Index

Keywords

SPARQL

Architecture

LOD Cloud

index()User

Query Annotation and Expansion

Inverted Index

RDF Store

Ranking FunctionsRanking

FunctionsRanking Functions

query()

Entity SearchKeyword Query

intermediatetop-k resultsGraph-Enriched

Results

Graph Traversals(queries on object

properties)

Neighborhoods(queries on datatype

properties)

Structured Inverted Index

WordNet

3rd partysearch engines

Final Ranking Function

Pseudo-Relevance Feedback

Query Refinement

•  Finding the right graph queries to refine results is a real challenge – Which object property to follow? Transitive closures?

– Which data type property to take into account? Scope?

Does it Work?

•  Up to 25% improvement over best IR (stat. sign.) •  Very modest impact on latency (17%)

0" 0.2" 0.4" 0.6" 0.8" 1"

Precision)

Recall)

BM25"baseline"

SAMEAS"

Whom to Ask for Entities?

•  Void + SPARQL end-point •  DOA •  DNS [DNS^3] •  Same_as service / Okkam IDs •  P2P Mesh of entities [idMesh] •  Downscaled registries?

How to Store Entities?

•  Fundamental impedance mismatch between graphs of entities and… – N-ary / decomposition storage model –  Inverted Indices – Key-value paradigms

Entity Storage

•  Materialize the joins! •  Dense-pack the values •  Provide new indices

•  Co-locate •  Co-locate •  Co-locate

DipLODocus: System Architecture

Main Idea - data structures

Template Matching

Hash Table

lexicographic tree to encode URIs

template based indexing

extremely compact lists of homologous nodes

Molecule Clusters •  extremely compact sub-graphs •  precomputed joins

List of Literals •  extremely compact list of sorted values

Basic operations - queries - triple patterns ?x type Student. ?x takesCourse Course0.

?x type Student. ?x takesCourse Course0. ?x takesCourse Course1.

=> intersection of sorted lists

Basic operations - queries aggregates and analytics

?x type Student. ?x age ?y filter (?y < 21)

Basic operations - queries - molecule queries

?a name 'Student1'. ?a ?b ?c. ?c ?d ?e.

Results - LUBM - 100 Universities

35 x faster for OLTP 300 x faster for OLAP

Current Directions

•  ZenCrowd: Web-scale Integration •  Entity Search: recommendation based on graph

mining •  Entity registry: DOA / Downscaling for ad-hoc •  Diplodocus

– Open-Source – Cloud Storage – Visualization (sensor data)

References

•  ZenCrowd [WWW 2012] •  idMesh [WWW 2009] •  Hybrid Entity Search [SIGIR 2012] •  DNS^3 [ISWC 2011 demo] •  Downscaling Entity Registries [Downscale 2012] •  Diplodocus[RDF] [ISWC 2011] •  Graph Data Management for New Application

Domains [VLDB 2011]

Upcoming Events

Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux...

Documents

Transcript of Entity-Centric Data Management - unifr.chEntity-Centric Data Management Philippe Cudré-Mauroux...

ODBASE 0426.10.04 A Necessary Condition for Semantic Interoperability in the Large Philippe Cudré-Mauroux and Karl Aberer School of Computer and Communication.

©2004, Philippe Cudré-Mauroux Exploiting Localized Metadata in Decentralized Settings Microsoft Research Asia 09.29.04 Philippe Cudré-Mauroux Distributed.

4 mab peach mauroux jehan baptiste

Lecture Notes in Computer Science 9341 · 10/15/2015 · Claudia d’Amato Università degli Studi di Bari, Italy Philippe Cudré-Mauroux University of Fribourg, Switzerland Challenge

TANDBERG Centric 1700 MXP Data Sheet - Vuports Centric 1700 MXP Title TANDBERG Centric 1700 MXP Data Sheet Author TANDBERG Subject TANDBERG Centric 1700 MXP Keywords TANDBERG Centric

Rearrangement planning using object-centric and robot-centric … · 2018-03-29 · Rearrangement planning using object-centric and robot-centric action spaces Jennifer E. King ⇤,

SEMPEDIA : Sémantisation à partir des documents semi … · 2020-01-22 · méthodes d’apprentissage supervisé et semi-supervisé alors que Smirnova and Cudré-Mauroux [2018],

Centric Relation

A (BRIEF) HISTORY OF COMPUTING By Dane Paschal. BIASES Amero-Euro centric Computer science centric Google centric.

EarlyBridge case from product centric to customer centric eb

People-Centric Collaboration · PC Centric Phone Centric Video Centric Next Generation Virtual Workspace ... Collaboration applications virtualized in the data center 1 ... building

©2005, Karl Aberer and Philippe Cudré-Mauroux Semantic Overlay Networks VLDB 2005 Karl Aberer and Philippe Cudré-Mauroux School of Computer and Communication.

Moving from Device Centric to a User Centric Management

Collaborative Innovation: The Only way to Address … Cudre-Maurou… · Collaborative Innovation: The Only way to Address Critical World’s Challenges Nicolas Cudré-Mauroux, PhD

UnboundedConnected Expressive People centric Content centric App business logic.

Meter-Centric to Customer-Centric: Madrileña Red de Gas ...

User-centric vs. System-centric Evaluation of Recommender ... · PDF fileUser-centric vs. System-centric Evaluation of Recommender Systems . Paolo Cremonesi. 1, Franca Garzotto , Roberto

Stop Talking Customer-Centric, Start Being Customer-Centric

Prevention of natural risks and public expenditures by Amélie Mauroux

7. Female Expatriates: Towards a More Inclusive ViewDavoine, Ravasi, Salamin and Cudré-Mauroux, 2013; Fisher, Hutchings and Pinto, 2015). Thus the research on female expatriates could