Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using...
Transcript of Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using...
![Page 1: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/1.jpg)
1
1
Using CrowdSourcingfor Data Analytics
Hector Garcia-Molina(work with Steven Whang, Peter Lofgren, Aditya Parameswaran and others)
Stanford University
• Big Data Analytics
• CrowdSourcing
![Page 2: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/2.jpg)
2
CrowdSourcing
3
Real World Examples
4
Image MatchingTranslation
Image MatchingTranslation
Categorizing ImagesCategorizing Images
Search RelevanceSearch Relevance
Data GatheringData Gathering
![Page 3: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/3.jpg)
3
Many Crowdsourcing Marketplaces!
Many Research Projects!
6
![Page 4: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/4.jpg)
4
7
analytics
data
results
humans
Example tasks:• get missing data• verify results• analyze data
8
analytics
data
results
humans
Example tasks:• get missing data• verify results• analyze data
Key Point:• use humans judiciously
![Page 5: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/5.jpg)
5
Today will illustrate with
• Entity Resolution
• (may cover another topic briefly)
9
10
Traditional Entity Resolution
cleansing
analysis
System 1 System n...
what matches what??
![Page 6: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/6.jpg)
6
Why is ER Challenging?
• Huge data sets
• No unique identifiers
• Missing data
• Lots of uncertainty
• Many ways to skin the cat
11
Simple ER Example
12
![Page 7: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/7.jpg)
7
Simple ER Example
13
a
c
b
d
sim=0.9
sim=0.8
Simple ER Example
14
a
c
b
d
sim=0.9
sim=0.8
![Page 8: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/8.jpg)
8
15
ER: Exact vs Approximate
products
camerasresolvedcameras
CDs
books
...
resolvedCDs
resolvedbooks
...
ER
ER
ER
Simple ER Algorithm
• Compute pairwise similarities
• Apply threshold
• Perform transitive closure
16
![Page 9: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/9.jpg)
9
Simple ER Algorithm
• Compute pairwise similarities
• Apply threshold
• Perform transitive closure
17
0.45
0.630.5
0.9
0.950.9
0.7
0.9
0.87
0.7 0.4
0.6
0.5
0.63
Simple ER Algorithm
• Compute pairwise similarities
• Apply threshold
• Perform transitive closure
18
0.9
0.950.9
0.7
0.9
0.87
0.7
threshold = 0.7
![Page 10: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/10.jpg)
10
Simple ER Algorithm
• Compute pairwise similarities
• Apply threshold
• Perform transitive closure
19
0.9
0.950.9
0.7
0.9
0.87
0.7
Crowd ER
20
![Page 11: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/11.jpg)
11
Same as this?
21
Crowd ER
• First Cut: For every pair of records,ask workers if they match (i.e., get similarity)
22
![Page 12: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/12.jpg)
12
Crowd ER
• First Cut: For every pair of records,ask workers if they match (i.e., get similarity)
• Too expensive!
23
0.45
0.630.5
0.9
0.950.9
0.7
0.9
0.87
0.7 0.4
0.6
0.5
0.63
Crowd ER
• Second Cut: Compute similarities;workers verify "critical" pairs
24
0.45
0.630.5
0.9
0.950.9
0.7
0.9
0.87
0.7 0.4
0.6
0.5
0.63
critical??
![Page 13: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/13.jpg)
13
Crowd ER
• Second Cut: Compute similarities;workers verify "critical" pairs
25
0.45
0.630.5
0.9
0.950.9
0.7
0.9
0.87
0.7 0.4
0.6
0.5
0.63
critical??
Crowd ER
• Second Cut: Compute similarities;workers verify "critical" pairs
26
0.45
0.630.5
0.9
0.950.9
0.7
0.9
0.87
0.7 0.4
0.6
0.5
0.63
critical??
![Page 14: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/14.jpg)
14
27
pairwise analysis
records
clusters
generate questions
Key Point:• use humans judiciously
global analysis
crowd
new evidence
Key Issue: Semantics of Crowd Answer
28
![Page 15: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/15.jpg)
15
Key Issue: Semantics of Crowd Answer
29
?
EDC
B A
Also issue: Similarities as Probabilities
30
sim(a,b) → prob(a,b)
![Page 16: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/16.jpg)
16
Strategy
31
current state
a
b c
0.9
0.5
0.2
ER result use any given ER algorithm
Strategy
32
current state
a
b c
0.9
0.5
0.2 Q(a,b)
Q(b,c)
Q(a,c)
consider ALL possible questions(three in this example)
![Page 17: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/17.jpg)
17
Strategy
33
current state
a
b c
0.9
0.5
0.2 Q(a,b)
Q(b,c)
Q(a,c)
new state
new state
new state
new state
new state
new state
ER result
ER result
ER result
ER result
ER result
ER result
Y
Y
Y
N
N
N
considerpossibleoutcomes
Strategy
34
current state
a
b c
0.9
0.5
0.2
Q(b,c)
new state ER resultY
example
a
b c
0.9
1.0
0.2a
b c
![Page 18: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/18.jpg)
18
Strategy
35
current state
a
b c
0.9
0.5
0.2 Q(a,b)
Q(b,c)
Q(a,c)
new state
new state
new state
new state
new state
new state
ER result
ER result
ER result
ER result
ER result
ER result
score?
score?
score?
score?
score?
score?
Y
Y
Y
N
N
N
Two Remaining Issues
• How do we score an ER result?
36
• Efficiency?
ER result gold standard
F score
![Page 19: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/19.jpg)
19
Gold Standard?
37
a
b c
0.9
0.5
0.2a
b c
1.0
0.6
0.2
sim to prob
Gold Standard?
38
a
b c
0.9
0.5
0.2a
b c
1.0
0.6
0.2
a
b c
a
b c
a
b c
a
b c
0.12
0.48
0.08
0.32
sim to prob
possible worlds
![Page 20: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/20.jpg)
20
Gold Standard?
39
a
b c
0.9
0.5
0.2a
b c
1.0
0.6
0.2
a
b c
a
b c
a
b c
a
b c
0.12
0.48
0.08
0.32
a
b c
0.68
a
b c
0.32sim to prob
possible worlds possible clustering(via ER algorithm)
Strategy
40
current state
a
b c
0.9
0.5
0.2 Q(a,b)
Q(b,c)
Q(a,c)
new state
new state
new state
new state
new state
new state
ER result
ER result
ER result
ER result
ER result
ER result
scorevs GS?Y
Y
Y
N
N
N
scorevs GS?
scorevs GS?
scorevs GS?
scorevs GS?
scorevs GS?
![Page 21: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/21.jpg)
21
Evaluating Efficiently
• See: Steven E. Whang, Peter Lofgren, and H. Garcia-Molina. Question Selection for Crowd Entity Resolution. To appear in Proc. 39th Int'l Conf. on Very Large Data Bases (PVLDB), Trento, Italy, 2013.
41
Sample Result
42
![Page 22: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/22.jpg)
22
43
analytics
data
results
humans
Example tasks:• get missing data• verify results• analyze data
Key Point:• use humans judiciously
Summary
44
analytics
bigdata
Now for something completely different!
DBMS
![Page 23: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/23.jpg)
23
45
analytics
bigdata
Now for something completely different!
DBMS
humans
46
data
DeCo: Declarative CrowdSourcing
DBMS
humans
Enduser
what is best price forNikon DSLR cameras?
![Page 24: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/24.jpg)
24
47
data
DeCo: Declarative CrowdSourcing
DBMS
humans
Enduser
what is best price forNikon DSLR cameras?
model type brand
D7100 DSLR Nikon
7D DSLR Canon
P5000 comp Nikon
• • • • • • • • •
48
data
DeCo: Declarative CrowdSourcing
DBMS
humans
Enduser
what is best price forNikon DSLR cameras?
model type brand
D7100 DSLR Nikon
7D DSLR Canon
P5000 comp Nikon
• • • • • • • • •
what is best price forNikon D7100 camera?
Crowd
![Page 25: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/25.jpg)
25
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
Example with a bit more detail:
Userview
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
⋈o
50
Userview
restaurant
Chez Panisse
Bytes
• • •
restaurant rating
Chez Panisse 4.8
Chez Panisse 5.0
Chez Panisse 4.9
Bytes 3.6
Bytes 4.0
• • • • • •
restaurant cuisine
Chez Panisse French
Chez Panisse California
Bytes California
Bytes California
• • • • • •
• • • • • •
Anchor
Dependent Dependent
Example with a bit more detail:
![Page 26: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/26.jpg)
26
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
⋈o
51
Userview
restaurant
Chez Panisse
Bytes
• • •
restaurant rating
Chez Panisse 4.8
Chez Panisse 5.0
Chez Panisse 4.9
Bytes 3.6
Bytes 4.0
• • • • • •
restaurant cuisine
Chez Panisse French
Chez Panisse California
Bytes California
Bytes California
• • • • • •
• • • • • •
Anchor
Dependent Dependent
fetch rule
fetch ruleBytes
Chez Panisse
fetch rule
Example with a bit more detail:
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
⋈o
52
Userview
restaurant
Chez Panisse
Bytes
• • •
restaurant rating
Chez Panisse 4.8
Chez Panisse 5.0
Chez Panisse 4.9
Bytes 3.6
Bytes 4.0
• • • • • •
restaurant cuisine
Chez Panisse French
Chez Panisse California
Bytes California
Bytes California
• • • • • •
• • • • • •
Anchor
Dependent Dependent
fetch rule
fetch ruleBytes
Chez Panisse
fetch rulefetch rule
French
Example with a bit more detail:
![Page 27: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/27.jpg)
27
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
⋈o
53
Userview
restaurant
Chez Panisse
Bytes
• • •
restaurant rating
Chez Panisse 4.8
Chez Panisse 5.0
Chez Panisse 4.9
Bytes 3.6
Bytes 4.0
• • • • • •
restaurant cuisine
Chez Panisse French
Chez Panisse California
Bytes California
Bytes California
• • • • • •
• • • • • •
Anchor
Dependent Dependent
resolutionrule
resolutionrule
Bytes
Chez Panisse
Example with a bit more detail:
restaurant rating cuisine
Chez Panisse 4.9 French
Chez Panisse 4.9 California
Bytes 3.8 California
• • • • • • • • •
⋈o
54
Userview
restaurant
Chez Panisse
Bytes
• • •
restaurant rating
Chez Panisse 4.8
Chez Panisse 5.0
Chez Panisse 4.9
Bytes 3.6
Bytes 4.0
• • • • • •
restaurant cuisine
Chez Panisse French
Chez Panisse California
Bytes California
Bytes California
• • • • • •
• • • • • •
Anchor
Dependent Dependent
1. Fetch2. Resolve3. Join
Example with a bit more detail:
![Page 28: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/28.jpg)
28
Fetch[n]Fetch[ln]Fetch
[ln,c]Scan
D1(n,l)ScanA(n)
55
Join
Join
AtLeast [8]SELECT n,l,cFROM countryWHERE l = ‘Spanish’ATLEAST 8
Resolve[m3]Resolve[d.e]
Fetch[nl]
ScanD2(n,c)
Resolve[m3]
Fetch[nl,c]
Many Query Processing Challenges
Filter [l=‘Spanish’]
Fetch[nl,c]
Deco Prototype V1.0
56
![Page 29: Using CrowdSourcing for Data Analytics - Drexel CCIcci.drexel.edu/bigdata/bigdata2013/Using CrowdSourcing for Data Analytics.pdf · Using CrowdSourcing for Data Analytics Hector Garcia-Molina](https://reader030.fdocuments.us/reader030/viewer/2022040208/5e3351bef0a13913233064cf/html5/thumbnails/29.jpg)
29
Conclusion
• Crowdsourcing is important formanaging data!
• Still many challenges ahead!
57
58