Detecting Novel Associations in Large Data Sets
description
Transcript of Detecting Novel Associations in Large Data Sets
Detecting Novel Associations in Large Data Sets
Sean Patrick [email protected]
A pragmatic discussion of
by David N. Reshef, Yakir Reshef, Hilary Finucane, Sharon Grossman, Gilean McVean, Peter Turnbaugh, Eric Lander, Michael Mitzenmacher, and Pardis Sabeti
Getting Started
• Blog overview - http://theoreticalecology.wordpress.com/2011/12/16/the-maximal-information-coefficient/
• MINE code (Java-based with python and R wrappers) http://www.exploredata.net/Downloads/MINE-Application
• MINE homepage - http://www.exploredata.net/• Science article and supplemental information -
http://www.sciencemag.org/content/334/6062/1518.abstract
• http://andrewgelman.com/2011/12/mr-pearson-meet-mr-mandelbrot-detecting-novel-associations-in-large-data-sets/
So who actually read the paper?
Outline
1.Motivation2.Explanation3.Application
The Problem• 10,000+ variables• Hundreds, thousands, millions of observations• Your boss wants you to find all possible
relationships between all different variable pairs …
• Where do you start?
Motivation
Scatter Plots?
Motivation
50 Variables 1225 different scatter plots to examine!
Motivation
Other Options?
• Correlation Matrix • Factor Analysis/Principal Component Analysis• Audience recommendations?
Motivation
Possible Problems
• A large number of possible relationships• Each has a different statistical test• Need to have a hypothesis about the
relationship that might be present in the data
Motivation
Desired Properties
• Generality – the correlation coefficient should be sensitive to a wide range of possible dependencies, including superpositions of functions.
• Equitability – the score of the coefficient should be influenced by noise, but not by the form of the dependency between variables
Motivation
Enter the Maximal Information Coefficient (MIC)
Explanation
Algorithm IntuitionExplanation
x
y
We have a dataset D
Explanation
Explanation
Definition of mutual information (for discrete random variables)
Explanation
MI = 0.5
MI = 0.6
MI = 0.7
Maximum mutual information
Explanation
Characteristic MatrixExplanation
We have to normalize by min {log x, log y} to enable comparison across grids.
2x3Explanation
MI = 0.65
MI = 0.56
MI = 0.71
Characteristic MatrixExplanation
Characteristic MatrixExplanation
This highest value is the Maximal Information Coefficient (MIC)
This surface is just a 3D representation of the characteristic matrix.
1. Every entry of the characteristic matrix is between 0 and 1, inclusive
2. MIC(X,Y) = MIC(Y,X) – symmetric
3. MIC is invariant under order preserving transformations of the axis
Explanation
How Big is the Characteristic Matrix?
• Technically, infinite in size• This is unwieldy• So we set bounds
on xy < B(n) = n0.6
n = number of data points• This is an empirically set value
Explanation
How Do We Compute the Maximum Information for a Particular xy Grid?
• Heuristic-based, dynamic programming• Pseudo-code in supplemental materials• Only approximate solution, seems to work• Authors acknowledge better algorithm should
be found• At the moment, mostly irrelevant as the
authors have released a Java implementation of the algorithm
Explanation
With probability approaching 1 as sample size grows(i) MIC assigns scores that tend to 1 for all never-
constant noiseless functional relationships(ii) MIC assigns scores that tend to 1 for a larger class
of noiseless relationships (including superpositions of noiseless functional relationships)
(iii) MIC assigns scores that tend to 0 to statistically independent variables
Useful Properties of the MIC StatisticApplication
MICApplication
Application
So what does the MIC mean?
• Uncorrected p-value tables are available to download for various sample sizes of data
• Null hypothesis is variables are statistically independent
• http://www.exploredata.net/Downloads/P-Value-Tables
Application
MINE = Maximal Information-based Nonparametric Exploration
Hopefully this part is self explanatory now
Nonparametric vs parametric could be a session unto itself.
Here, we do not rely on assumptions that the data in question are drawn from a specific probability distribution (such as the normal distribution).
Application
MINE statistics leverage the extra information captured by the characteristic matrix to offer more insight into the relationships between variables.
Minimum Cell Number (MCN) - measures the complexity of an association in terms of the number of cells required
Application
Maximum Edge Value (MEV <= MIC) – measures closeness to being a function (vertical line test )
Maximum Asymmetry Score (MAS<= MIC) – measures deviations from monotonicity
Application
MAS – monotonicityMEV – vertical line testMCN – complexity
Application
http://www.exploredata.net/Usage-instructions
this takes too long … change it first
R: MINE(“MLB2008.csv”,”one.pair”,var1.id=2,var2.id=12)Java: java -jar MINE.jar MLB2008.csv -onePair 2 12Seeks relationships between salary and home runs, 338 pairs
Usage
Notes
• Does not work on textual data (must be numeric)
• Long execution times• Outputs MIC and other mentioned MINE
statistics, not the Characteristic Matrix• Output is .csv, a row per variable pair
Application
Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License
You are free to:• to copy, distribute and transmit the work
With the following conditions:• Attribution — You must attribute the work in the manner
specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work).
• Noncommercial — You may not use this work for commercial purposes.
• No Derivative Works — You may not alter, transform, or build upon this work.
Application
Now What? Data Triage PipelineApplication
Complex Data Set MIC
Ranked list of variable relationships to examine in more depth with the tool(s) of your choice
Lingering Questions• Can this be extended to higher-dimensional relationships?• Just how approximate is the current MIC algorithm? • Who wants to develop an open source implementation?• What other MINE statistics are waiting for discovery?• Execution time – the algorithm is embarrassingly parallel –
easily HADOOPified• Many tests reported by the paper only introduced vertical
noise into the data?• There is also some question as to its power vs Pearson and
Dcor (http://www-stat.stanford.edu/~tibs/reshef/comment.pdf)
Comment by N. Simon and R. Tibshiran
http://www-stat.stanford.edu/~tibs/reshef/script.R
Noise Level Noise Level
Pow
erPo
wer
Pow
erPo
wer
Backup Slides