CrowdMining Mining association rules from the...

39
Crowd Mining Tova Milo

Transcript of CrowdMining Mining association rules from the...

Page 1: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Crowd Mining

Tova Milo

Page 2: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Data Everywhere

The amount and diversity of Data being generated and collected is exploding

Web pages, Sensors data, Satellite pictures, DNA sequences, …

2 Crowd Mining

Page 3: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

From Data to Knowledge

3 Crowd Mining

Buried in this flood of data are the keys to

- New economic opportunities

- Discoveries in medicine, science and the humanities - Improving productivity & efficiency

However, raw data alone is not sufficient!!!

We can only make sense of our world by

turning this data into knowledge and insight.

Page 4: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

The research frontier

4 Crowd Mining

• Knowledge representation.

• knowledge collection, transformation, integration, sharing.

• knowledge discovery.

We will focus today on human knowledge

Think of humanity and its collective mind expanding…

Page 5: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

The engagement of crowds of Web users for data procurement

9/1/2013 5

Background - Crowd (Data) sourcing

5 Crowd Mining

Page 6: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

General goals of crowdsourcing

• Main goal – “Outsourcing” a task (data collection & processing)

to a crowd of users

• Kinds of tasks – Can be performed by a computer, but inefficiently

– Can be performed by a computer, but inaccurately

– Tasks that can’t be performed by a computer

6 Crowd Mining

So, are we done?

Page 7: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

MoDaS (Mob Data Sourcing)

• Crowdsourcing has tons of potential, yet limited triumph…

– Difficulty of managing huge volumes of data and users of questionable quality and reliability.

• Every single initiative had to battle, almost from scratch, the

same non-trivial challenges

• The ad hoc solutions, even when successful, are application specific and rarely sharable

=> Solid scientific foundations for Web-scale data sourcing With focus on knowledge management using the crowd

7 MoDaS

EU-FP7 ERC

MoDaS Mob Data Sourcing

Page 8: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Challenges – (very) brief overview

• What questions to ask? [SIGMOD13, VLDB13]

• How to define & determine correctness of answers? [WWW12, ICDE11]

• Who to ask? how many people? How to best use the resources? [ICDE12, VLDB13, ICDT13, ICDE13]

8 Crowd Mining

Data Mining

Data Cleaning

Probabilistic Data

Optimizations and Incremental Computation

Page 9: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

A simple example – crowd data sourcing (Qurk)

• ?

9 Crowd Mining 9

Picture name

Lucy

Don

Ken

… …

The goal: Find the names of all the women in the people table

SELECT name FROM people p WHERE isFemale(p)

isFemale(%name, %photo) Question: “Is %name a female?”,

%photo Answers: “Yes”/ “No”

Page 10: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

A simple example – crowd data sourcing

10 Crowd Mining 10 10

Picture name

Lucy

Don

Ken

… …

Page 11: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Crowd Mining: Crowdsourcing in an open world

• Human knowledge forms an open world

• Assume we want to find out what is interesting and important in some domain area

Folk medicine, people’s habits, …

• What questions to ask?

11 Crowd Mining

Page 12: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Back to classic databases...

• Significant data patterns are identified using data mining techniques.

• A useful type of pattern: association rules

– E.g., stomach ache chamomile

• Queries are dynamically constructed in the learning process

• Is it possible to mine the crowd?

12 Crowd Mining

Page 13: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Turning to the crowd

Let us model the history of every user as a personal database

• Every case = a transaction consisting of items

• Not recorded anywhere – a hidden DB

– It is hard for people to recall many details about many transactions!

– But … they can often provide summaries, in the form of personal rules

“To treat a sore throat I often use garlic”

13 Crowd Mining

Treated a sore throat with garlic and oregano leaves…

Treated a sore throat and low fever with garlic and ginger …

Treated a heartburn with water, baking soda and lemon…

Treated nausea with ginger, the patient experienced sleepiness… …

Page 14: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Two types of questions

• Free recollection (mostly simple, prominent patterns)

Open questions

• Concrete questions (may be more complex)

Closed questions

We use the two types interleavingly.

When a patient has both headaches and fever, how often do you use a willow tree bark infusion?

Tell me how you treat a particular illness

“I typically treat nausea with ginger infusion”

14 Crowd Mining

Page 15: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Contributions (at a very high level)

• Formal model for crowd mining; allowed questions and the answers interpretation; personal rules and their overall significance.

• A Framework of the generic components required for mining the crowd

• Significance and error estimations. [and, how will this change if we ask more questions…]

• Crowd-mining algorithms

• [Implementation & benchmark. both synthetic & real data.]

15 Crowd Mining

Page 16: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

The model: User support and confidence

• A set of users U

• Each user u U has a (hidden!) transaction database Du

• Each rule X Y is associated with:

16

user support

user confidence

Crowd Mining

Page 17: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Model for closed and open questions

• Closed questions: X ? Y – Answer: (approximate) user support and confidence

• Open questions: ? ? ? – Answer: an arbitrary rule with its user support and confidence

“I typically have a headache once a week. In 90% of the

times, coffee helps.

17 Crowd Mining

Page 18: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Significant rules

• Significant rules: Rules were the mean user support and confidence are above some specified thresholds Θs, Θc.

• Goal: estimating rule significance while asking the smallest possible number of questions to the crowd

18 Crowd Mining

Page 19: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Framework components

• One generic framework for crowd-mining

• One particular choice of implementation of all black boxes

• We do not claim any optimality

• But we validate by experiments

19 Crowd Mining

Page 20: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

• Treating the current answers as a random sample of a hidden distribution , we can approximate the distribution of the hidden mean

• μ – the sample average

• Σ – the sample covariance

• K – the number of collected samples

• In a similar manner we estimate the hidden distribution

9/1/2013 20

Estimating the mean distribution

20 Crowd Mining

Page 21: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

• Define Mr as the probability mass above both thresholds for rule r

• r is significant if Mr is greater than 0.5

• The error prob. is the remaining mass

• Estimate how error will change if another question is asked

• Choose rule with largest error reduction

9/1/2013 21

Rule Significance and error probability

21 Crowd Mining

Page 22: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Completing the picture (first attempt…)

• Which rules should be considered?

Similarly to classic data mining (e.g. Apriori)

Start with small rules, then expend to rules similar to significant rules

• Should we ask an open or closed question?

Similarly to sequential sampling

Use some fixed ratio of open/closed questions to balance the tradeoff between precision and recall

22 Crowd Mining

Page 23: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Completing the picture (second attempt…)

How to measure the efficiency of Crowd Mining Algorithms ???

• Two distinguished cost factors: – Crowd complexity: # of crowd queries used by the algorithm

– Computational complexity: the complexity of computing the crowd queries and processing the answers

[Crowd comp. lower bound is a trivial computational comp. lower bound]

• There exists a tradeoff between the complexity measures – Naïve questions selection -> more crowd questions

23 Crowd Mining

Page 24: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Semantic knowledge can save work

24 Crowd Mining

Page 25: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Semantic knowledge can save work

Given a taxonomy of is-a relationships among items, e.g. espresso is a coffee

frequent({headache, espresso}) frequent({headache, coffee})

Advantages

• Allows inference on itemset frequencies

• Allows avoiding semantically equivalent itemsets {espresso}, {espresso, coffee}, {espresso, beverage}…

25 Crowd Mining

Page 26: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Complexity boundaries

Given a taxonomy Ψ we measure complexity it terms of

• the input Ψ: – its size, shape, width…

• the output frequent itemsets, represented compactly by the – Maximal Frequent Itemsets (MFI)

– Minimal Infrequent Itemsets (MII)

26 Crowd Mining

Page 27: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Complexity boundaries

• Notations:

– |Ψ| - the taxonomy size

– |I(Ψ)| - the number of itemsets (modulo equivalences)

– |S(Ψ)| - the number of possible solutions

27 Crowd Mining

Page 28: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Now, back to the bigger picture…

28 Crowd Mining

“I’m looking for activities to do in a child-friendly attraction in New York,

and a good restaurant near by”

“You can go bike riding in Central Park and eat at Maoz Vegetarian. Tips: Rent bikes at the boathouse”

“You can go visit the Bronx Zoo and eat at Pine Restaurant. Tips: Order antipasti at Pine. Skip dessert and go for ice cream across the street”

The user’s question in natural language:

Some of the answers:

Page 29: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge
Page 30: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

ID Transaction Details

1 <Football> doAt <Central_Park>

2 <Biking> doAt <Central_Park>. <BaseBall> doAt <Central_Park> <Rent_Bikes> doAt <Boathouse>

3 <Falafel> eatAt <Maoz Veg.>

4 <Antipasti> eatAt <Pine>

5 <Visit> doAt <Bronx_Zoo>. <Antipasti> eatAt <Pine>

Page 31: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Crowd Mining Query language (based on SPARQL and DMQL)

31 Crowd Mining

FIND association rules RELATED TO x, y+, z, u?, p?, v? WHERE {$x instanceOf <Attraction>. $x inside <NYC>. $x hasLabel “Child Friendly”. $y subClassOf* <Activity>. $z instanceOf <Restaurant>. $z nearBy $x. ...} MATCHING ( {} => {([] eatAt $z.)} WITH support THRESHOLD = 0.007 ) AND ( {([] doAt $x)} => {($y doAt $x),($u $p $v)} WITH support THRESHOLD = 0.01 WITH confidence THRESHOLD = 0.2 RETURN MFI )

Determine group size for each variable,

SPARQL-like where clause, $x is a variable

Path of length 0 or more

The left-hand and right-hand parts of the rule are defined as SPARQL patterns

Mining parameters: thresholds, output granularity, etc.

Several rule patterns can be mined, joint by AND/OR

Page 32: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

(Closed) Questions to the crowd

32 Crowd Mining

How often you do something in Central Park? { ( [] doAt <Central_Park> )

How often do you eat at Maoz Vegeterian? { ( [] eatAt <Maoz_Vegeterian> ) }

How often do you swim in Central Park? { ( <Swim> doAt <Central_Park> ) }

Supp=0.7

Supp=0.1

Supp=0.3

Page 33: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

(Open) Questions to the crowd

33 Crowd Mining

What do you do in Central Park ? { ( $y doAt <Central_Park>) }

What else do you do when biking in Central Park? { (<Biking> doAt <Central_Park>) , ($y doAt <Central_Park>) }

Complete: When I go biking in Central Park… { (<Biking> doAt <Central_Park>), ( $u $p $v ) }

$y = Football, supp = 0.6

$y = Football, supp = 0.6

$u = <Rent_Bikes>, $p = doAt, $v = <BoatHouse>, supp = 0.6

Page 34: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Example of extracted association rules

• < Ball Games, doAt , Central_Park > (Supp:0.4 Conf:1)

< [ ] , eatAt , Maoz_Vegeterian > (Supp:0.2 Conf:0.2)

• < visit , doAt , Bronx_Zoo > (Supp: 0.2 Conf: 1)

< [ ] , eatAt , Pine_Restaurant > (Supp: 0.2 Conf: 0.2)

34 Crowd Mining

Page 35: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Can we trust the crowd ?

35 Crowd Mining 35 35 35

Common solution: ask multiple times

We may get different answers Legitimate diversity

Wrong answers/lies

Page 36: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

36

Can we trust the crowd ?

36 Crowd Mining

Page 37: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Things are non trivial …

37 Crowd Mining 37 37 37 37

• Different experts for different areas

• “Difficult” questions vs. “simple” questions

• Data in added and updated all the time

• Optimal use of resources… (both machines and human)

Solutions based on - Statistical mathematical models - Declarative specifications - Provenance

Page 38: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Summary

38 Crowd Mining

The crowd is an incredible resource! Many challenges: • (very) interactive computation • A huge amount of data • Varying quality and trust

“Computers are useless, they can only give you answers” - Pablo Picasso

But, as it seems, they can also ask us questions!

Page 39: CrowdMining Mining association rules from the crowddbweb.enst.fr/events/dbcrowd2013/slides/keynote2.pdf · The research frontier Crowd Mining 4 • Knowledge representation. • knowledge

Thanks

39 Crowd Mining

Antoine Amarilli, Yael Amsterdamer, Rubi Boim, Susan Davidson, Ohad Greenshpan, Benoit Gross, Yael Grossman, Ezra Levin, Ilia Lotosh, Slava Novgordov, Neoklis Polyzotis, Sudeepa Roy, Pierre Senellart, Amit Somech, Wang-Chiew Tan…

EU-FP7 ERC

MoDaS Mob Data Sourcing