An Overview of Essential Concepts in Data...

Alok Choudhary Henry and Isabel Dever Professor

EECS and Kellogg School of Management Northwestern University

[email protected]

Ankit Agrawal Research Associate Professor

EECS Northwestern University

[email protected]

An Overview of Essential Concepts in Data Mining

(A)

Data Management

Data Reduction, Query

Data Visualization

Data Sharing

Transactional: Data

Generation

Historical: Data Processing,

transformation, approximation

Data Mining, analytics,

unsupervised learning

Discovery, Insights, Feedback

Instruments, sensors supercomputers

Data Driven Science –Strategic View

© Alok Choudhary 2 Historical

data Learning Models

Trigger/questions Predict

Data Driven Science : Thinking about Data Mining?

•  The science of extracting useful knowledge from huge data repositories •  Makes use of wealth of historical observational and simulation data •  Accelerate Time-to-Discovery and Actionable Insights

1 © Alok Choudhary

The Unknown

© Alok Choudhary 4

As we know,

There are known knowns. There are things we know we know.

• High Humidity results in outbreak of Meningitis • Customers switch carriers when contract is over

Conventional Wisdom

• Nuclear Reaction happens under these conditions • Did combustion occur at the expected parameter values

Validate Hypothesis

e.g., Statistics, Query, Transformation, Viz

© Alok Choudhary 5

The Unknown As we know,


We also know There are known unknowns.

That is to say We know there are some things

We do not know.

• Will this hurricane strike the Atlantic coast? • What is the likelihood of this patient to develop cancer

• Will this customer buy a new smart phone?

Top-‐‑Down Discovery -‐‑ We know the question

to ask

Predictive Modeling...; e.g., SVM, Decision Trees

© Alok Choudhary 6

The Unknown As we know,


We also know There are known unknowns.

That is to say We know there are some things

We do not know. But there are also unknown unknowns,

The ones we don't know We don't know.

• Wow! I found a new galaxy? • Switch C fails when switch A fails followed by switch B failing

• On Thursday people buy beer and diaper together.

• The ratio K/P > X is an indicator of onset of diabetes.

BoZom up Discovery -‐‑ We don’t know the question to ask

Relationship Mining, Clustering etc.. -‐‑ ARM

A Typical Knowledge Discovery Workflow

Data Preparation

Dimensionality Reduction

Modeling +Learning

Model Evaluation

Databases/Streams

Knowledge

l  Data preprocessing l  Consistency and Right representations l  E.g., numeric, ordinal, nominal, binary, interval, ratio

l  Incorporate domain knowledge l  add/derive more attributes l  Possibly a non-linear function of existing attributes

l  E.g. Body Mass Index (BMI)

l  Handle missing data l  Eliminate instances/attributes with missing values l  Use average/default value l  Predict based on rest of the data(e.g., reanalysis

data in climate)

© Alok Choudhary 8

Data Preparation

Example Types of data sets •  Record

o  Data Matrix o  Document Data o  Transaction Data

•  Graph o  World Wide Web o  Molecular Structures

•  Ordered o  Spatial Data o  Temporal Data o  Sequential Data o  Genetic Sequence Data

5

2

1 2

5

Document 1

season

timeout

lost

win

game

score

ball

play

coach

team

Document 2

Document 3

3 0 5 0 2 6 0 2 0 2

0

0

7 0 2 1 0 0 3 0 0

1 0 0 1 2 2 0 3 0

Tid Other SVC

Marital Status

Income Buy

1 Yes Single 125K No

2 No Married 100K No

3 No Single 70K No

4 Yes Married 120K No

5 No Divorced 95K Yes

6 No Married 60K No

7 Yes Divorced 220K No

8 No Single 85K Yes

9 No Married 75K No

10 No Single 90K Yes 10

Nominal aZribute: distinctness Ordinal aZribute: distinctness & order Interval aZribute: distinctness, order & addition Ratio aZribute: all 4 properties

l  Feature Selection Ø To chose (an optimal) subset of features where the predictive power of the subset is almost as good as the original set of features

l  Feature Extraction Ø To map a highly dimensional dataset to a lower dimensional representation

Dimensionality/Data Reduction Methods

Feature Selection Example: Microstructure Representation

Microstructure image samples

Quantitative descriptors

Volume fraction, Particle number/size, Cluster number, Pore size, Orientation, Roundness, Elongation ratio, Filler Surface Quantity, Matrix Surface Quantity…

A small set of key descriptors? Can we reduce it from ~100 to ~10?

Hundreds of aZributes

Structure-Property Optimization – Try optimization for 10^3 dimensions

L

J © Alok Choudhary 12

0 10 20 30 40 50 60 70 803.5E−6

3.52E−6

3.54E−6

3.56E−6

3.58E−6

3.6E−6

3.62E−06

3.64E−06

Number of Variables

Opt

imum

Sol

utio

n

Experiment Result: Solution found/Performance vs. Number of Variables

0 10 20 30 40 50 60 70 800

5000

10000

15000

Tim

e C

onsu

mpt

ion

(s)

Optimum foundTime Consumed

Accelerating Time to Insights

Time consumed Optimum found

*

© Alok Choudhary 13

l  Correlation-based: Ø Evaluates the importance of a subset of attributes by considering the individual predictive ability of each feature taking in consideration the degree of redundancy between the subset

l  Relief-based: Ø Evaluates the importance of an attribute by repeatedly sampling an instance and considering the value of the given attribute for the nearest instance of the different classes

l  Information Gain-based: Ø Evaluates the worth of an attribute by measuring the information gain (an entropy-based metric) with respect to the class

l  Wrapper-based: Ø Evaluates attribute sets by using a learning scheme. Cross validation is used to estimate the accuracy of the learning scheme for a set of attributes.

Feature Selection Methods

Illustrative Feature Extraction Methods

Lee, George, Carlos Rodriguez, and Anant Madabhushi. "ʺAn empirical comparison of dimensionality reduction methods for classifying gene and protein expression datasets."ʺ Bioinformatics Research and Applications. Springer Berlin Heidelberg, 2007. 170-‐‑181.

PCA LDA

LLE GE

l  PCA (Principal Component Analysis) Ø Use orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components Ø Usually unsupervised

l  LDA (Linear Discriminant Analysis) Ø Find a linear combination of features which characterizes or separates two or more classes of objects or events Ø Usually supervised

l  GE (Graph Embedding) Ø partition the data into clusters by a series of normalized cuts Ø points within a cluster are deemed similar and points belonging to separate clusters are deemed dissimilar

l  LLE(Locally Linear Embedding) Ø  compute a set of weights for each point that best describe it as a

linear combination of its nearest neighbors, and find a low-dimensional embedding of points

Ø  Tries to preserve the local geometry of the data

Feature Extraction Methods

Northwestern University © Alok Choudhary 17

Predictive Modeling

Tridas Vickie Mike

Honest

Barney Waldo Wally

Crooked

Which characteristics distinguish the two groups?

Northwestern University © Alok Choudhary 18

Learned Rules in Predictive Modeling

Tridas Vickie Mike

Honest = has round eyes and a smile

Discovering Materials : Simulations à Analytics

Construc)on*of*FE*predic)on*database*

• Consists(of(compounds(with(known(forma4on(energy((FE)(• Empiric(periodic(table(informa4on(added((e.g.(electro(nega4vity,(mass,(atomic(radii,(#(valence(s,(p,(d,(f(electrons)(

Predic)ve*Modeling*

• Construct(data(mining(models(to(predict(forma4on(energy(using(chemical(formula(and(derivable(empirical(informa4on(

Model*Evalua)on*

• Test(model(on(unseen(data(• 10Efold(cross(valida4on((data(divided(into(10(segments,(model(built(on(9(segments(and(tested(on(remaining(1(segment;(process(repeated(10(4mes(with(different(test(segment)(

Large*scale*FE*predic)on*

• Run(combinatorial(list(of(compounds(through(the(FE(model(

Screening*

• Thermodynamic(stability(and(heuris4cs(

Valida)on*

• Structure(predic4on(• Quantum(mechanical(modeling(

Combinatorial+list+of+ternary+compounds+

List+of+predic5ons+

Shortlisted+high9

poten5al+candidates+

FE+model+

Stable+discovered+structures+

(a)+

(b)+© Alok Choudhary 19

l  Predictive Ø Classification: learning a model to classify new records based on training data (e.g., decision trees, NN, SVM, etc.) Ø Regression: learning a function to model the data while minimizing the error Ø Anomaly Detection: Identification of outlier records that might lead to interesting discoveries

l  Descriptive Ø Clustering: Discovering groups of records that have similarities Ø Association Rule Mining: Discovering relations between different attributes of the dataset

Modeling Methods

l  Cross Validation l Test every instance in the dataset using a model that has not seen that instance l Goal is to estimate the performance of the model on future instances while eliminating the chances of over-fitting l Types

l k-fold cross validation l Leave-one-out cross-validation (LOOCV)

Model Evaluation

Training split

Testing split

Evaluation Metrics - Classification

FawceZ, Tom (2006); An introduction to ROC analysis, PaZern Recognition LeZers, 27, 861–874.

l  Confusion matrix based l Accuracy, precision, recall, F-score,…

l  Receiver operating characteristic curve (ROC) based

l Area under the curve (AUC)

Confusion Matrix

u  Compare vectors of actual and predicted values l Coefficient of correlation (R) l Coefficient of determination (R2) l Mean Absolute Error (MAE) l Root Mean Squared Error (RMSE) l Standard Deviation of Error (SDE) l Mean Absolute Error Fraction (MAE) l Root Mean Squared Error Fraction (RMSE) l Standard Deviation of Error Fraction (SDE)

Evaluation Metrics - Regression

Modeling Example: Property Prediction

Pilania, Wang, Jiang, Rajasekaran and Ramprasad (2013)

Modeling Example: Property Prediction Results

Pilania, Wang, Jiang, Rajasekaran and Ramprasad (2013)

l  Material classification l Classify Zeolite crystals by structure l Structural information is gathered l Random forest is used as the classification system.

l  Structure-property mapping l Structure-property map of compound semiconductors l Similar results from

l data mining l large-scale simulations

l Data mining is very fast and robust

Example Problems in Materials Science

Carr et al. (2009) MMM, 17(1-‐‑2):339-‐‑349

Rajan (2005) Materials Today, 8(10):38-‐‑45

Data-‐‑ Driven Science: It is a process

Anomaly time series

Correlation between anomaly time series Stat. significant

correlations

Edge weights: significant correlations

Climate Network

Nodes in the graph: grid points on the globe

Climate Data

SLP SST

VWS

Multivariate Networks

Extreme Phase

Normal Phase

Multiphase Networks © Alok Choudhary 28

Thank You! Alok Choudhary and Ankit Agrawal

Dept. of Electrical Engineering and Computer Science Northwestern University

[email protected] 312 515 2562

29 © Alok Choudhary

l  Lee, George, Carlos Rodriguez, and Anant Madabhushi. "An empirical comparison of dimensionality reduction methods for classifying gene and protein expression datasets." Bioinformatics Research and Applications. Springer Berlin Heidelberg, 2007. 170-181.

l  Chakrabarti, S. et al., Data mining curriculum: A proposal (Version 1.0). Intensive Working Group of ACM SIGKDD Curriculum Committee, April 2006, retrieved December 1, 2010, from http://www.sigkdd.org/curriculum/CURMay06.pdf

l  Url: Principal component analysis, wikipedia, http://en.wikipedia.org/wiki/Principal_component_analysis, accessed January 15, 2014.

l  Url: Linear discriminant analysis, wikipedia, http://en.wikipedia.org/wiki/Linear_discriminant_analysis, accessed January 15, 2014.

l  Fawcett, Tom (2006); An introduction to ROC analysis, Pattern Recognition Letters, 27, 861–874.

l  D. Andrew Carr, Mohammed Lach-hab, Shujiang Yang, Iosif I. Vaisman, Estela Blaisten-Barojas, “Machine learning approach for structure-based zeolite classification”, Microporous and Mesoporous Materials, Volume 117, Issues 1–2, 1 January 2009, Pages 339–349, http://dx.doi.org/10.1016/j.micromeso.2008.07.027

l  Krishna Rajan, “Materials informatics”, Materials Today, Volume 8, Issue 10, October 2005, Pages 38–45, http://dx.doi.org/10.1016/S1369-7021(05)71123-8

l  Pilania G, Wang C, Jiang X, Rajasekaran S, Ramprasad R. Accelerating materials property predictions using machine learning. Scientific Reports 3, Article number: 2810 doi:10.1038/srep02810

References

An Overview of Essential Concepts in Data...

Documents

Transcript of An Overview of Essential Concepts in Data...