An Overview of Essential Concepts in Data...
Transcript of An Overview of Essential Concepts in Data...
Alok Choudhary Henry and Isabel Dever Professor
EECS and Kellogg School of Management Northwestern University
Ankit Agrawal Research Associate Professor
EECS Northwestern University
An Overview of Essential Concepts in Data Mining
(A)
Data Management
Data Reduction, Query
Data Visualization
Data Sharing
Transactional: Data
Generation
Historical: Data Processing,
transformation, approximation
Data Mining, analytics,
unsupervised learning
Discovery, Insights, Feedback
Instruments, sensors supercomputers
Data Driven Science –Strategic View
© Alok Choudhary 2 Historical
data Learning Models
Trigger/questions Predict
Data Driven Science : Thinking about Data Mining?
• The science of extracting useful knowledge from huge data repositories • Makes use of wealth of historical observational and simulation data • Accelerate Time-to-Discovery and Actionable Insights
1 © Alok Choudhary
The Unknown
© Alok Choudhary 4
As we know,
There are known knowns. There are things we know we know.
• High Humidity results in outbreak of Meningitis • Customers switch carriers when contract is over
Conventional Wisdom
• Nuclear Reaction happens under these conditions • Did combustion occur at the expected parameter values
Validate Hypothesis
e.g., Statistics, Query, Transformation, Viz
© Alok Choudhary 5
The Unknown As we know,
There are known knowns. There are things we know we know.
We also know There are known unknowns.
That is to say We know there are some things
We do not know.
• Will this hurricane strike the Atlantic coast? • What is the likelihood of this patient to develop cancer
• Will this customer buy a new smart phone?
Top-‐‑Down Discovery -‐‑ We know the question
to ask
Predictive Modeling...; e.g., SVM, Decision Trees
© Alok Choudhary 6
The Unknown As we know,
There are known knowns. There are things we know we know.
We also know There are known unknowns.
That is to say We know there are some things
We do not know. But there are also unknown unknowns,
The ones we don't know We don't know.
• Wow! I found a new galaxy? • Switch C fails when switch A fails followed by switch B failing
• On Thursday people buy beer and diaper together.
• The ratio K/P > X is an indicator of onset of diabetes.
BoZom up Discovery -‐‑ We don’t know the question to ask
Relationship Mining, Clustering etc.. -‐‑ ARM
A Typical Knowledge Discovery Workflow
Data Preparation
Dimensionality Reduction
Modeling +Learning
Model Evaluation
Databases/Streams
Knowledge
l Data preprocessing l Consistency and Right representations l E.g., numeric, ordinal, nominal, binary, interval, ratio
l Incorporate domain knowledge l add/derive more attributes l Possibly a non-linear function of existing attributes
l E.g. Body Mass Index (BMI)
l Handle missing data l Eliminate instances/attributes with missing values l Use average/default value l Predict based on rest of the data(e.g., reanalysis
data in climate)
© Alok Choudhary 8
Data Preparation
Example Types of data sets • Record
o Data Matrix o Document Data o Transaction Data
• Graph o World Wide Web o Molecular Structures
• Ordered o Spatial Data o Temporal Data o Sequential Data o Genetic Sequence Data
5
2
1 2
5
Document 1
season
timeout
lost
win
game
score
ball
play
coach
team
Document 2
Document 3
3 0 5 0 2 6 0 2 0 2
0
0
7 0 2 1 0 0 3 0 0
1 0 0 1 2 2 0 3 0
Tid Other SVC
Marital Status
Income Buy
1 Yes Single 125K No
2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
5 No Divorced 95K Yes
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No
10 No Single 90K Yes 10
Nominal aZribute: distinctness Ordinal aZribute: distinctness & order Interval aZribute: distinctness, order & addition Ratio aZribute: all 4 properties
l Feature Selection Ø To chose (an optimal) subset of features where the predictive power of the subset is almost as good as the original set of features
l Feature Extraction Ø To map a highly dimensional dataset to a lower dimensional representation
Dimensionality/Data Reduction Methods
Feature Selection Example: Microstructure Representation
Microstructure image samples
Quantitative descriptors
Volume fraction, Particle number/size, Cluster number, Pore size, Orientation, Roundness, Elongation ratio, Filler Surface Quantity, Matrix Surface Quantity…
A small set of key descriptors? Can we reduce it from ~100 to ~10?
Hundreds of aZributes
Structure-Property Optimization – Try optimization for 10^3 dimensions
L
J © Alok Choudhary 12
0 10 20 30 40 50 60 70 803.5E−6
3.52E−6
3.54E−6
3.56E−6
3.58E−6
3.6E−6
3.62E−06
3.64E−06
Number of Variables
Opt
imum
Sol
utio
n
Experiment Result: Solution found/Performance vs. Number of Variables
0 10 20 30 40 50 60 70 800
5000
10000
15000
Tim
e C
onsu
mpt
ion
(s)
Optimum foundTime Consumed
Accelerating Time to Insights
Time consumed Optimum found
*
© Alok Choudhary 13
l Correlation-based: Ø Evaluates the importance of a subset of attributes by considering the individual predictive ability of each feature taking in consideration the degree of redundancy between the subset
l Relief-based: Ø Evaluates the importance of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the different classes
l Information Gain-based: Ø Evaluates the worth of an attribute by measuring the information gain (an entropy-based metric) with respect to the class
l Wrapper-based: Ø Evaluates attribute sets by using a learning scheme. Cross validation is used to estimate the accuracy of the learning scheme for a set of attributes.
Feature Selection Methods
Illustrative Feature Extraction Methods
Lee, George, Carlos Rodriguez, and Anant Madabhushi. "ʺAn empirical comparison of dimensionality reduction methods for classifying gene and protein expression datasets."ʺ Bioinformatics Research and Applications. Springer Berlin Heidelberg, 2007. 170-‐‑181.
PCA LDA
LLE GE
l PCA (Principal Component Analysis) Ø Use orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components Ø Usually unsupervised
l LDA (Linear Discriminant Analysis) Ø Find a linear combination of features which characterizes or separates two or more classes of objects or events Ø Usually supervised
l GE (Graph Embedding) Ø partition the data into clusters by a series of normalized cuts Ø points within a cluster are deemed similar and points belonging to separate clusters are deemed dissimilar
l LLE(Locally Linear Embedding) Ø compute a set of weights for each point that best describe it as a
linear combination of its nearest neighbors, and find a low-dimensional embedding of points
Ø Tries to preserve the local geometry of the data
Feature Extraction Methods
Northwestern University © Alok Choudhary 17
Predictive Modeling
Tridas Vickie Mike
Honest
Barney Waldo Wally
Crooked
Which characteristics distinguish the two groups?
Northwestern University © Alok Choudhary 18
Learned Rules in Predictive Modeling
Tridas Vickie Mike
Honest = has round eyes and a smile
Discovering Materials : Simulations à Analytics
Construc)on*of*FE*predic)on*database*
• Consists(of(compounds(with(known(forma4on(energy((FE)(• Empiric(periodic(table(informa4on(added((e.g.(electro(nega4vity,(mass,(atomic(radii,(#(valence(s,(p,(d,(f(electrons)(
Predic)ve*Modeling*
• Construct(data(mining(models(to(predict(forma4on(energy(using(chemical(formula(and(derivable(empirical(informa4on(
Model*Evalua)on*
• Test(model(on(unseen(data(• 10Efold(cross(valida4on((data(divided(into(10(segments,(model(built(on(9(segments(and(tested(on(remaining(1(segment;(process(repeated(10(4mes(with(different(test(segment)(
Large*scale*FE*predic)on*
• Run(combinatorial(list(of(compounds(through(the(FE(model(
Screening*
• Thermodynamic(stability(and(heuris4cs(
Valida)on*
• Structure(predic4on(• Quantum(mechanical(modeling(
Combinatorial+list+of+ternary+compounds+
List+of+predic5ons+
Shortlisted+high9
poten5al+candidates+
FE+model+
Stable+discovered+structures+
(a)+
(b)+© Alok Choudhary 19
l Predictive Ø Classification: learning a model to classify new records based on training data (e.g., decision trees, NN, SVM, etc.) Ø Regression: learning a function to model the data while minimizing the error Ø Anomaly Detection: Identification of outlier records that might lead to interesting discoveries
l Descriptive Ø Clustering: Discovering groups of records that have similarities Ø Association Rule Mining: Discovering relations between different attributes of the dataset
Modeling Methods
l Cross Validation l Test every instance in the dataset using a model that has not seen that instance l Goal is to estimate the performance of the model on future instances while eliminating the chances of over-fitting l Types
l k-fold cross validation l Leave-one-out cross-validation (LOOCV)
Model Evaluation
Training split
Testing split
Evaluation Metrics - Classification
FawceZ, Tom (2006); An introduction to ROC analysis, PaZern Recognition LeZers, 27, 861–874.
l Confusion matrix based l Accuracy, precision, recall, F-score,…
l Receiver operating characteristic curve (ROC) based
l Area under the curve (AUC)
Confusion Matrix
u Compare vectors of actual and predicted values l Coefficient of correlation (R) l Coefficient of determination (R2) l Mean Absolute Error (MAE) l Root Mean Squared Error (RMSE) l Standard Deviation of Error (SDE) l Mean Absolute Error Fraction (MAE) l Root Mean Squared Error Fraction (RMSE) l Standard Deviation of Error Fraction (SDE)
Evaluation Metrics - Regression
Modeling Example: Property Prediction
Pilania, Wang, Jiang, Rajasekaran and Ramprasad (2013)
Modeling Example: Property Prediction Results
Pilania, Wang, Jiang, Rajasekaran and Ramprasad (2013)
l Material classification l Classify Zeolite crystals by structure l Structural information is gathered l Random forest is used as the classification system.
l Structure-property mapping l Structure-property map of compound semiconductors l Similar results from
l data mining l large-scale simulations
l Data mining is very fast and robust
Example Problems in Materials Science
Carr et al. (2009) MMM, 17(1-‐‑2):339-‐‑349
Rajan (2005) Materials Today, 8(10):38-‐‑45
© Alok Choudhary 27
What Else you may find!
The Unknown Unknown
Strong Affinity
Data-‐‑ Driven Science: It is a process
Anomaly time series
Correlation between anomaly time series Stat. significant
correlations
Edge weights: significant correlations
Climate Network
Nodes in the graph: grid points on the globe
Climate Data
SLP SST
VWS
Multivariate Networks
Extreme Phase
Normal Phase
Multiphase Networks © Alok Choudhary 28
Thank You! Alok Choudhary and Ankit Agrawal
Dept. of Electrical Engineering and Computer Science Northwestern University
[email protected] 312 515 2562
29 © Alok Choudhary
l Lee, George, Carlos Rodriguez, and Anant Madabhushi. "An empirical comparison of dimensionality reduction methods for classifying gene and protein expression datasets." Bioinformatics Research and Applications. Springer Berlin Heidelberg, 2007. 170-181.
l Chakrabarti, S. et al., Data mining curriculum: A proposal (Version 1.0). Intensive Working Group of ACM SIGKDD Curriculum Committee, April 2006, retrieved December 1, 2010, from http://www.sigkdd.org/curriculum/CURMay06.pdf
l Url: Principal component analysis, wikipedia, http://en.wikipedia.org/wiki/Principal_component_analysis, accessed January 15, 2014.
l Url: Linear discriminant analysis, wikipedia, http://en.wikipedia.org/wiki/Linear_discriminant_analysis, accessed January 15, 2014.
l Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874.
l D. Andrew Carr, Mohammed Lach-hab, Shujiang Yang, Iosif I. Vaisman, Estela Blaisten-Barojas, “Machine learning approach for structure-based zeolite classification”, Microporous and Mesoporous Materials, Volume 117, Issues 1–2, 1 January 2009, Pages 339–349, http://dx.doi.org/10.1016/j.micromeso.2008.07.027
l Krishna Rajan, “Materials informatics”, Materials Today, Volume 8, Issue 10, October 2005, Pages 38–45, http://dx.doi.org/10.1016/S1369-7021(05)71123-8
l Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials property predictions using machine learning. Scientific Reports 3, Article number: 2810 doi:10.1038/srep02810
References