Post on 09-Feb-2022
Statistical Learning Theory Meets Big Data
Randomized algorithms for frequent itemsets
Eli Upfal
Brown University
Data, data, data●“In God we trust, all others (must) bring data”
Prof. W.E. Deming, Statistician, (attributed)
●“Every two days now we create as much information as we did from the dawn of civilization up until 2003”
Eric Schmidt, Google Exec. Chairman, 2009
●“Data explosion is bigger than Moore's Law”Marissa Mayer, Yahoo! CEO, 2009
Big Data
●“Every two days now we create as much information as we did from the dawn of civilization up until 2003”
Eric Schmidt, Google Exec. Chairman, 2009
●“Data explosion is bigger than Moore's Law”
Marissa Mayer, Yahoo! CEO, 2009
Big Data
● Big data requires big storage and long execution time● It's great to have big data – but do we need to process
all of it?● Trade off accuracy of results for execution speed● Can approximation algorithms for data analysis be fast
and still guarantee high-quality results?
Frequent Itemsets● Market Basket Analysis● You own a grocery store● Have copy of receipts from sales
– For each customer, you have the list of products she bought
● Interested in what groups of products are bought together – correlated items – Business decisions– Optimization
Settings● Transactional dataset D
Figure from Tan et al. - Introduction to Data Mining
Transactions● Transactional dataset D
Figure from Tan et al. - Introduction to Data Mining
transaction
Items● Transactions are built on items from a domain
Figure from Tan et al. - Introduction to Data Mining
items
Itemsets● Sets of items are called itemsets
Figure from Tan et al. - Introduction to Data Mining
Itemsets
Frequency● Frequency of an itemset X in dataset D:
– : fraction of transactions of D containing X– Example: Milk: 4/5, {Bread, Milk}: 3/5
Figure from Tan et al. - Introduction to Data Mining
Frequent Itemsets● Frequent Itemsets mining with respect to
– Find all itemsets X with frequency
Frequent Itemsets● Frequent Itemsets mining with respect to
– Find all itemsets X with frequency Top -K Frequent Itemsets– Find all itemsets at least as frequent as the kth
most frequent● Association Rules
– X in a transaction is correlated with Y in the transaction
Frequent Itemsets mining ● Classical problem in data analysis● There are algorithms to extract exact collection of FI's
– APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]
Frequent Itemsets mining ● Well studied classical problem in data analysis● There are algorithms to extract exact collection of FI's
– APriori [AS94], FPGrowth [HPY00], Eclat [Zaki00]● Running time depends on number of transactions in
the dataset and on the number of frequent itemsets – 10^8 transactions considered “normal” size– 10^4 items → 2^(10^4) itemsets total– A transaction with d items contains 2^d itemsets
How to speed up mining?● Decrease dependency on number of transactions:
How to speed up mining?● Decrease dependency on number of transactions:
1.create random sample of dataset2.extract frequent itemsets from sample with lower
How to speed up mining?● Decrease dependency on number of transactions:
1.create random sample of dataset2.extract frequent itemsets from sample with lower
● Sampling force us to accept approximate results● Need to quantify what is acceptable
How to speed up mining?● Decrease dependency on number of transactions:
1.create random sample of dataset2.extract frequent itemsets from sample with lower
● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of FI's such that:
0 1Frequency
How to speed up mining?● Decrease dependency on number of transactions:
1.create random sample of dataset2.extract frequent itemsets from sample with lower
● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of FI's such that:
0 1
Must be in W
Frequency
How to speed up mining?● Decrease dependency on number of transactions:
1.create random sample of dataset2.extract frequent itemsets from sample with lower
● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of FI's such that:
0 1
Must not be in W Must be in W
Frequency
How to speed up mining?● Decrease dependency on number of transactions:
1.create random sample of dataset2.extract frequent itemsets from sample with lower
● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of FI's such that:
0 1
Must not be in W Must be in WMay be in W
Frequency
22
Observa(on
● Data mining is exploratory in nature● Exact results are usually not needed
We want a good, realistic idea of the process● Fast analysis is more important
... if it still guarantees good-enough results!
Riondato et al., PARMA, CIKM'12
How to speed up mining?● Decrease dependency on number of transactions:
1.create random sample of dataset2.extract frequent itemsets from sample with lower
● Sampling force us to accept approximate results● Need to quantify what is acceptable● Given , extract collection W of itemset from
sample, such that:
● Problem: choose right sample size and right
0 1
Must not be in W Must be in WMay be in W
Frequency
Choosing ● If we have, for all itemsets X simultaneously
then does the trick
Choosing ● If we have, for all itemsets X simultaneously
then does the trick
0 1
Must not be in W Must be in WMay be in W
Frequency
Choosing ● If we have, for all itemsets X simultaneously
then does the trick
● Need sample size |S| s.t., with prob. at least , for all itemsets X, we have
0 1
Must not be in W Must be in WMay be in W
Frequency
Choosing the sample size● Naïve approach● Given itemset X, frequency of X in sample S is
distributed like a binomial: ● Use Chernoff bound to bound
● Apply Union bound over all itemsets to find |S| such that
Choosing the sample size● Naïve approach● Given itemset X, frequency of X in sample S is
distributed like a binomial: ● Use Chernoff bound to bound
● Apply Union bound over all itemsets to find |S| such that
● Problem: There's an exponential number of itemsets– Sample size would too large (linear the in number
of elements)–
Vapnik-Chervonenkis Dimension
● Combinatorial property of a collection of subsets from a domain
● Measures the “richness” or “expressivity” of the collection of subsets
● If we know the VC-dim of a collection of subsets, we can compute the minimum sample size needed to approximate the sizes of all the subsets using the sample
Range spaces● VC-Dimension is defined on range spaces● (B,R): range space
– B: domain– R: collection of subsets from B (ranges)
● No restrictions:– B can be infinite– R can be infinite– R can contain infinitely-large subsets of B
Vapnik-Chervonenkis Dimension● Range space ● For any , define
● C is shattered if ● The VC-Dimension of (B,R) is the size of the largest
shattered subset of B
Example of VC-Dimension● B =
x
y
Example of VC-Dimension● B = ● R = all axis-aligned rectangles in x
y
Example of VC-Dimension● B = ● R = all axis-aligned rectangles in x
y
Example of VC-Dimension● B = ● R = all axis-aligned rectangles in
● Shattering 4 points: Easy– Take any 4 points s.t. no 3 of them are aligned
x
y
Example of VC-Dimension● B = ● R = all axis-aligned rectangles in
● Shattering 4 points: Easy– Take any 4 points s.t. no 3 of them are aligned
x
y
x
y
Example of VC-Dimension● B = ● R = all axis-aligned rectangles in
● Shattering 4 points: Easy– Take any 4 points s.t. no 3 of them are aligned
x
y
x
y
Need 16 rectangles
Example of VC-Dimension● B = ● R = all axis-aligned rectangles in
● Shattering 5 points?
x
y
Example of VC-Dimension● B = ● R = all axis-aligned rectangles in
● Shattering 5 points: impossible
x
y
Example of VC-Dimension● B = ● R = all axis-aligned rectangles in
● Shattering 5 points: impossible– Take any 5 points– One of them that is contained in all rectangles
containing the other four– Impossible to find a rectangle containing only the
other four● VC(B,R)=4
x
y
Approximating sizes of ranges● Sample Theorem:
– Let (B,R) have VC(B,R)≤ d. Given , let S be a collection of points from B sampled uniformly at random. If
then,
● Can approximate sizes of all simultaneously– No need of union bound
VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)
VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X
VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X
VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●
VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●
VC-Dimension for Frequent Itemsets● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●
To approximate using the sample theorem, needupper bound to VC(B,R)
Upper Bound to VC-Dim for FI's● B = dataset D (set of transactions)● Itemset X, = transactions of D containing X●
● Theorem: [The d-index of a dataset]– Let d be the maximum integer such that D
contains at least d transactions of length at least d. Then
VC(B,R) ≤ d
49
The d-‐index of the dataset
● d-index d: the maximum integer such that D contains at least d transactions of length at least d
● Example: this dataset has d-index d=3
● Independent of |D|● Can be computed with a single scan of the dataset● d is an upper bound to the VC-dimension
bread beer milk coffeechips coke pastabread coke chipsmilk coffeepasta milk
Choosing the right sample size● Theorem
– Let – Let D be a dataset and d be the max integer such
that D contains at least d transactions of length at least d
– Let S be a collection of transactions of D sampled independently and uniformly at random, with
– Then
Algorithm● To extract approximate collection of freq itemsets:
1.Compute d for the dataset D 2.Compute sample size (function of d)3.Create random sample S from D4.Extract Freq Itemsets from S using
Algorithm● To extract approximate collection of Freq. Itemsets:
1.Compute d for the dataset D 2.Compute sample size given d3.Create random sample S from D4.Extract Frequent Itemsets from S using
● Theorem: With probability at least , the returned set W of itemsets satisfies the desired property
0 1
Must not be in W Must be in WMay be in WFrequency
A closer look...● Expression of sample size:
A closer look...● Expression of sample size:
● d: the maximum integer such that D contains at least d transactions of length at least d
A closer look...● Expression of sample size:
● d: the maximum integer such that D contains at least d transactions of length at least d– does not depend on |D|– does not depend on– does not depend on number of itemsets
A closer look...● Expression of sample size:
● d: the maximum integer such that D contains at least d transactions of length at least d– does not depend on |D|– does not depend on– does not depend on number of itemsets
● The time to mine S does not depend on |D| !!!
Proof (Intuition)● For a set of k transactions to be shattered, each
transaction must appear in 2^(k-1) different 's where X is an itemset
● A transaction only appears in the 's of the itemsets X it contains
● A transaction of length w contains 2^(w)-1 itemsets● Need w≥k for the transaction to belong to a shattered
set of size k● To shatter k transactions they must all have length ≥k● Max k for which it happens is upper bound to VC-Dim
Experiments● Sample always fits into main memory● Output always satisfies required approx. guarantees
– Frequency accuracy even better than guaranteed● Mining time significantly improved
ConclusionsWe showed how VC-dimension, a highly theoretical concept from statistical learning theory, can help in developing an efficient algorithm to approximate the collection of frequent itemsets, addressing one of the Big Data challenges through samp Riondato & Upfal: Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML/PKDD (1) 20121
.
PARMA: A Randomized Algorithm for Approximate Associa;on Rules Mining
in MapReduce
Matteo Riondato, Justin A. DeBrabant, Rodrigo Fonseca, Eli Upfal
61
Goals
● Extract high quality approximation of FI(D,θ) by mining several small samples of D in parallel
● Adapt to available computational resources● Minimize data replication and communication● Achieve high parallelism
Riondato et al., PARMA, CIKM'12
62
Why parallelizing?
● For a very high quality approximation, the sample would still be large
● Sample creation time depends on size of dataset● Parallelization allows us to achieve:
higher accuracy (lower ε) higher confidence (lower δ) faster results
● Makes sense when dataset is stored on a distributed filesystem
Riondato et al., PARMA, CIKM'12
63
MapReduce
● framework for easy parallelization on large clusters of commodity machines
● Allows very fast analysis of large datasets● Algorithms expressed through two functions:
Map: (k,v) (k1,v1),(k2,v2),... Reduce: (r,(w1,w2,...)) (k1,v1),(k2,v2),...
● It is expensive to move data around (shuffling) Design goal: avoid data replication
Riondato et al., PARMA, CIKM'12
64
The op'miza'on framework
● Sample should fit into local memory● No more samples than number of processors● Exploit available computational resources to
maximize quality of approximation● Constraints and Objective function expressed in
a Mixed Integer Non Linear Program Convex feasibility region and convex obj. function Easy and fast to solve with global MINLP solver
Riondato et al., PARMA, CIKM'12
65
The Algorithm: PARMA
● Very simple design 2 MapReduce Rounds
● 1st round Map: create samples Reduce: mine each sample
● 2nd round Map: identity function Reduce: aggregation/filtering of results
Riondato et al., PARMA, CIKM'12
66
1st Round: Sampling and Mining● Parallelization of the sequential algorithm● Map: Sample Creation
Random sampling with replacement ... in parallel!– Input: (transaction id, transaction)– Output: (sample id, transaction)
● Reduce: Mining of each sample in parallel using lowered frequency threshold θ-ε/2 Can use any sequential exact mining algorithm (APriori,
FPGrowth, ...)– Input: (sample id, (transaction1, transaction2, ...))– Output: (itemset, (frequency, confidence interval))
Riondato et al., PARMA, CIKM'12
67
2nd Phase: Aggrega:on
● Map: Identity function● Reduce: Filtering and Aggregation
– Input: (itemset,((freq1,conf_int1),(freq2,conf_int2),...) – Output: (itemset, (frequency, confidence interval))
● An itemset is sent in output only if it was frequent in a sufficient number of samples (determined by the optimization framework)
● Confidence interval: smallest interval containing a sufficient number of estimations freq1,freq2,...
● Frequency: median of confidence interval
Riondato et al., PARMA, CIKM'12
68
Analysis
● Theorem: With probability at least 1-δ, output is an ε-approximation
to FI(D,θ)● Proof intuition: “Boosting-like” approach
Outputs of the reducers in stage 1 are ε-approximations for the local samples, with some probability 1-φ
If an itemset appears in enough ε-approximations, then it is probably frequent in the dataset
● Consequences: better approximation, fewer false positives
Riondato et al., PARMA, CIKM'12
69
Implementa:on
● PARMA implemented as Java library for Hadoop Easy to run, possible integration with Apache Mahout
● FP-Growth used for the sample mining phase PARMA is agnostic to the choice of mining algorithm We used an existing Java implementation
● 1400 lines of code● Available on github
Riondato et al., PARMA, CIKM'12
70
Experiments
● Run on Amazon Elastic MapReduce Up to 16 machines, with 17GB of RAM
● Artificial datasets using standard benchmark generators Various parameters (transaction size, items, ...) Size between 5 to 50 million transactions
● Compared performances of PARMA with PFP and naive distributed counting PARMA performs significantly better (1.5x to 5x speed
improvement) in all tests
Riondato et al., PARMA, CIKM'12
71
Run-me Analysis● Runtime does not grow with size of data
Because individual sample size does not grow!● Much faster than PFP, faster as data grows!● PARMA gets faster with more machines, even with
more data.
Riondato et al., PARMA, CIKM'12
72
Speedup
● Quasi linear speedup● Stage 1 very close to optimal● Stage 2 no speedup
Running time depends on number of itemsets– No way to control the number of itemsets– Common in frequent itemsets mining algorithms
2 4 8
number of instances
1
2
3
4
5
6
7
spee
dup
idealtotalStage 1Stage 2
Riondato et al., PARMA, CIKM'12
Accuracy
● Output always ε-approximation to FI(D,θ)● Output not much larger than real collection● Better frequency accuracy than guaranteed● Smaller confidence intervals than guaranteed
Riondato et al., PARMA, CIKM'12
74
Conclusions
We showed how to efficiently use MapReduce to mine a high-quality approximation of the frequent
itemsets and association rules from a large dataset with minimum data replication and
quasi-linear speedup
Riondato et al., PARMA, CIKM'12
Publications
● Riondato, Upfal. Efficient Discovery of Association Rules and Frequent Itemsets through Sampling with Tight Performance Guarantees. ECML PKDD 2012
● Riondato, DeBrabant, Fonseca, Upfal. PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce. CIKM 2012