An Overview of Essential Concepts in Data...

30
Alok Choudhary Henry and Isabel Dever Professor EECS and Kellogg School of Management Northwestern University [email protected] Ankit Agrawal Research Associate Professor EECS Northwestern University [email protected] An Overview of Essential Concepts in Data Mining

Transcript of An Overview of Essential Concepts in Data...

Page 1: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Alok Choudhary Henry and Isabel Dever Professor

EECS and Kellogg School of Management Northwestern University

[email protected]

Ankit Agrawal Research Associate Professor

EECS Northwestern University

[email protected]

An Overview of Essential Concepts in Data Mining

Page 2: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

(A)

Data  Management

Data  Reduction,  Query

Data  Visualization

Data  Sharing

Transactional:  Data  

Generation

Historical:  Data  Processing,  

transformation,  approximation

Data  Mining,  analytics,  

unsupervised  learning

Discovery,  Insights,    Feedback

Instruments,  sensors supercomputers

Data  Driven  Science  –Strategic  View

© Alok Choudhary 2 Historical  

data Learning  Models

Trigger/questions Predict

Page 3: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Data  Driven  Science  :  Thinking  about  Data  Mining?

•  The science of extracting useful knowledge from huge data repositories •  Makes use of wealth of historical observational and simulation data •  Accelerate Time-to-Discovery and Actionable Insights

1 © Alok Choudhary

Page 4: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

The Unknown

© Alok Choudhary 4

As we know,

There are known knowns. There are things we know we know.

• High  Humidity  results  in  outbreak  of  Meningitis   • Customers  switch  carriers  when  contract  is  over

Conventional  Wisdom

• Nuclear  Reaction  happens  under  these  conditions • Did  combustion  occur  at  the  expected  parameter  values

Validate  Hypothesis

e.g.,  Statistics,  Query,  Transformation,  Viz

Page 5: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

© Alok Choudhary 5

The  Unknown  As  we  know,    

There  are  known  knowns.    There  are  things  we  know  we  know.    

We  also  know    There  are  known  unknowns.    

That  is  to  say    We  know  there  are  some  things    

We  do  not  know.      

• Will  this  hurricane  strike  the  Atlantic  coast? • What  is  the  likelihood  of  this  patient  to  develop  cancer  

• Will  this  customer  buy  a  new  smart  phone?

Top-­‐‑Down  Discovery  -­‐‑  We  know  the  question  

to  ask

Predictive  Modeling...;  e.g.,  SVM,  Decision  Trees

Page 6: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

© Alok Choudhary 6

The  Unknown  As  we  know,    

There  are  known  knowns.    There  are  things  we  know  we  know.    

We  also  know    There  are  known  unknowns.    

That  is  to  say    We  know  there  are  some  things    

We  do  not  know.    But  there  are  also  unknown  unknowns,    

The  ones  we  don't  know    We  don't  know.    

 

• Wow!  I  found  a  new  galaxy? • Switch  C  fails  when  switch  A  fails  followed  by  switch  B  failing

• On  Thursday  people  buy  beer  and  diaper  together.

• The  ratio  K/P  >  X  is  an  indicator  of  onset  of  diabetes.

BoZom  up  Discovery  -­‐‑  We  don’t  know  the  question  to  ask

Relationship  Mining,  Clustering  etc..  -­‐‑      ARM

Page 7: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

A Typical Knowledge Discovery Workflow

Data  Preparation

Dimensionality  Reduction

Modeling +Learning

Model  Evaluation

Databases/Streams

Knowledge

Page 8: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

l  Data preprocessing l  Consistency and Right representations l  E.g., numeric, ordinal, nominal, binary, interval, ratio

l  Incorporate domain knowledge l  add/derive more attributes l  Possibly a non-linear function of existing attributes

l  E.g. Body Mass Index (BMI)

l  Handle missing data l  Eliminate instances/attributes with missing values l  Use average/default value l  Predict based on rest of the data(e.g., reanalysis

data in climate)

© Alok Choudhary 8

Data Preparation

Page 9: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Example Types of data sets •  Record

o  Data Matrix o  Document Data o  Transaction Data

•  Graph o  World Wide Web o  Molecular Structures

•  Ordered o  Spatial Data o  Temporal Data o  Sequential Data o  Genetic Sequence Data

5

2

1 2

5

Document 1

season

timeout

lost

win

game

score

ball

play

coach

team

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Tid Other SVC

Marital Status

Income Buy

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Nominal  aZribute:  distinctness Ordinal  aZribute:  distinctness  &  order Interval  aZribute:  distinctness,  order  &  addition Ratio  aZribute:  all  4  properties

Page 10: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

l  Feature Selection Ø To chose (an optimal) subset of features where the predictive power of the subset is almost as good as the original set of features

l  Feature Extraction Ø To map a highly dimensional dataset to a lower dimensional representation

Dimensionality/Data Reduction Methods

Page 11: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Feature Selection Example: Microstructure Representation

Microstructure  image  samples

Quantitative  descriptors

Volume  fraction,   Particle  number/size, Cluster  number,  Pore  size,  Orientation,  Roundness,  Elongation  ratio,  Filler  Surface  Quantity,  Matrix  Surface  Quantity…

A  small  set  of  key  descriptors? Can  we  reduce  it  from  ~100  to  ~10?

Hundreds  of  aZributes

Page 12: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Structure-Property Optimization – Try optimization for 10^3 dimensions

L

J © Alok Choudhary 12

Page 13: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

0 10 20 30 40 50 60 70 803.5E−6

3.52E−6

3.54E−6

3.56E−6

3.58E−6

3.6E−6

3.62E−06

3.64E−06

Number of Variables

Opt

imum

Sol

utio

n

Experiment Result: Solution found/Performance vs. Number of Variables

0 10 20 30 40 50 60 70 800

5000

10000

15000

Tim

e C

onsu

mpt

ion

(s)

Optimum foundTime Consumed

Accelerating Time to Insights

Time  consumed Optimum  found

*

© Alok Choudhary 13

Page 14: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

l  Correlation-based: Ø Evaluates the importance of a subset of attributes by considering the individual predictive ability of each feature taking in consideration the degree of redundancy between the subset

l  Relief-based: Ø Evaluates the importance of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the different classes

l  Information Gain-based: Ø Evaluates the worth of an attribute by measuring the information gain (an entropy-based metric) with respect to the class

l  Wrapper-based: Ø Evaluates attribute sets by using a learning scheme. Cross validation is used to estimate the accuracy of the learning scheme for a set of attributes.

Feature Selection Methods

Page 15: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Illustrative Feature Extraction Methods

Lee,  George,  Carlos  Rodriguez,  and  Anant  Madabhushi.  "ʺAn  empirical  comparison  of  dimensionality  reduction  methods  for  classifying  gene  and  protein  expression  datasets."ʺ  Bioinformatics  Research  and  Applications.  Springer  Berlin  Heidelberg,  2007.  170-­‐‑181.

PCA LDA

LLE GE

Page 16: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

l  PCA (Principal Component Analysis) Ø Use orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components Ø Usually unsupervised

l  LDA (Linear Discriminant Analysis) Ø Find a linear combination of features which characterizes or separates two or more classes of objects or events Ø Usually supervised

l  GE (Graph Embedding) Ø partition the data into clusters by a series of normalized cuts Ø points within a cluster are deemed similar and points belonging to separate clusters are deemed dissimilar

l  LLE(Locally Linear Embedding) Ø  compute a set of weights for each point that best describe it as a

linear combination of its nearest neighbors, and find a low-dimensional embedding of points

Ø  Tries to preserve the local geometry of the data

Feature Extraction Methods

Page 17: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Northwestern University © Alok Choudhary 17

Predictive Modeling

Tridas Vickie Mike

Honest

Barney Waldo Wally

Crooked

Which characteristics distinguish the two groups?

Page 18: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Northwestern University © Alok Choudhary 18

Learned  Rules  in  Predictive  Modeling

Tridas Vickie Mike

Honest = has round eyes and a smile

Page 19: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Discovering  Materials  :  Simulations  à  Analytics  

Construc)on*of*FE*predic)on*database*

• Consists(of(compounds(with(known(forma4on(energy((FE)(• Empiric(periodic(table(informa4on(added((e.g.(electro(nega4vity,(mass,(atomic(radii,(#(valence(s,(p,(d,(f(electrons)(

Predic)ve*Modeling*

• Construct(data(mining(models(to(predict(forma4on(energy(using(chemical(formula(and(derivable(empirical(informa4on(

Model*Evalua)on*

• Test(model(on(unseen(data(• 10Efold(cross(valida4on((data(divided(into(10(segments,(model(built(on(9(segments(and(tested(on(remaining(1(segment;(process(repeated(10(4mes(with(different(test(segment)(

Large*scale*FE*predic)on*

• Run(combinatorial(list(of(compounds(through(the(FE(model(

Screening*

• Thermodynamic(stability(and(heuris4cs(

Valida)on*

• Structure(predic4on(• Quantum(mechanical(modeling(

Combinatorial+list+of+ternary+compounds+

List+of+predic5ons+

Shortlisted+high9

poten5al+candidates+

FE+model+

Stable+discovered+structures+

(a)+

(b)+© Alok Choudhary 19

Page 20: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

l  Predictive Ø Classification: learning a model to classify new records based on training data (e.g., decision trees, NN, SVM, etc.) Ø Regression: learning a function to model the data while minimizing the error Ø Anomaly Detection: Identification of outlier records that might lead to interesting discoveries

l  Descriptive Ø Clustering: Discovering groups of records that have similarities Ø Association Rule Mining: Discovering relations between different attributes of the dataset

Modeling Methods

Page 21: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

l  Cross Validation l Test every instance in the dataset using a model that has not seen that instance l Goal is to estimate the performance of the model on future instances while eliminating the chances of over-fitting l Types

l k-fold cross validation l Leave-one-out cross-validation (LOOCV)

Model Evaluation

Training split

Testing split

Page 22: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Evaluation Metrics - Classification

FawceZ,  Tom  (2006);  An  introduction  to  ROC  analysis,  PaZern  Recognition  LeZers,  27,  861–874.

l  Confusion matrix based l Accuracy, precision, recall, F-score,…

l  Receiver operating characteristic curve (ROC) based

l Area under the curve (AUC)

Confusion  Matrix

Page 23: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

u  Compare vectors of actual and predicted values l Coefficient of correlation (R) l Coefficient of determination (R2) l Mean Absolute Error (MAE) l Root Mean Squared Error (RMSE) l Standard Deviation of Error (SDE) l Mean Absolute Error Fraction (MAE) l Root Mean Squared Error Fraction (RMSE) l Standard Deviation of Error Fraction (SDE)

Evaluation Metrics - Regression

Page 24: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Modeling Example: Property Prediction

Pilania,  Wang,  Jiang,  Rajasekaran  and  Ramprasad  (2013)

Page 25: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Modeling Example: Property Prediction Results

Pilania,  Wang,  Jiang,  Rajasekaran  and  Ramprasad  (2013)

Page 26: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

l  Material classification l Classify Zeolite crystals by structure l Structural information is gathered l Random forest is used as the classification system.

l  Structure-property mapping l Structure-property map of compound semiconductors l Similar results from

l data mining l large-scale simulations

l Data mining is very fast and robust

Example Problems in Materials Science

Carr  et  al.  (2009)  MMM,  17(1-­‐‑2):339-­‐‑349

Rajan  (2005)  Materials  Today,  8(10):38-­‐‑45

Page 27: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

© Alok Choudhary 27

What  Else  you  may  find!

The Unknown Unknown

Strong  Affinity

Page 28: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Data-­‐‑  Driven  Science:  It  is  a  process

Anomaly time series

Correlation between anomaly time series Stat. significant

correlations

Edge weights: significant correlations

Climate Network

Nodes in the graph: grid points on the globe

Climate Data

SLP SST

VWS

Multivariate Networks

Extreme Phase

Normal Phase

Multiphase Networks © Alok Choudhary 28

Page 29: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

Thank  You! Alok Choudhary and Ankit Agrawal

Dept. of Electrical Engineering and Computer Science Northwestern University

[email protected] 312 515 2562

29 © Alok Choudhary

Page 30: An Overview of Essential Concepts in Data Miningmuri.materials.cmu.edu/data/ReviewMeeting_2014_01/... · Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials

l  Lee, George, Carlos Rodriguez, and Anant Madabhushi. "An empirical comparison of dimensionality reduction methods for classifying gene and protein expression datasets." Bioinformatics Research and Applications. Springer Berlin Heidelberg, 2007. 170-181.

l  Chakrabarti, S. et al., Data mining curriculum: A proposal (Version 1.0). Intensive Working Group of ACM SIGKDD Curriculum Committee, April 2006, retrieved December 1, 2010, from http://www.sigkdd.org/curriculum/CURMay06.pdf

l  Url: Principal component analysis, wikipedia, http://en.wikipedia.org/wiki/Principal_component_analysis, accessed January 15, 2014.

l  Url: Linear discriminant analysis, wikipedia, http://en.wikipedia.org/wiki/Linear_discriminant_analysis, accessed January 15, 2014.

l  Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874.

l  D. Andrew Carr, Mohammed Lach-hab, Shujiang Yang, Iosif I. Vaisman, Estela Blaisten-Barojas, “Machine learning approach for structure-based zeolite classification”, Microporous and Mesoporous Materials, Volume 117, Issues 1–2, 1 January 2009, Pages 339–349, http://dx.doi.org/10.1016/j.micromeso.2008.07.027

l  Krishna Rajan, “Materials informatics”, Materials Today, Volume 8, Issue 10, October 2005, Pages 38–45, http://dx.doi.org/10.1016/S1369-7021(05)71123-8

l  Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials property predictions using machine learning. Scientific Reports 3, Article number: 2810 doi:10.1038/srep02810

References