CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity...

16
CSE 5243 INTRO. TO DATA MINING Review Session for Midterm Yu Su, CSE@The Ohio State University

Transcript of CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity...

Page 1: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

CSE 5243 INTRO. TO DATA MINING

Review Session for MidtermYu Su, CSE@The Ohio State University

Page 2: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

2

Notes

¨ Time: 02/26/2020 (Wed), 9:35 – 10:55 AM

¨ Location: Caldwell Lab 171

¨ One-page cheat sheet: both sides allowed

¨ Calculator: allowed

¨ I will update the slides over the weekend to make them cleaner. No major changes.

Page 3: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

3

Agenda

¨ Summary of key concepts and equations

¨ HW1 Discussion (TA)

¨ HW2 Discussion (TA)

Page 4: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

4

Probability and Statistics4

¨ Bayes rule: prior, likelihood, marginal probability, posterior

¨ Chain rule

Page 5: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

5

Probability and Statistics5

¨ Bayes rule: prior, likelihood, marginal probability, posterior

¨ Chain rule¨ Maximum Likelihood Estimation (MLE)

¤ Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model.

Page 6: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

6

Data Preprocessing6

¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning

Page 7: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

7

Data Preprocessing7

¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis

¤ 𝜒! test for discrete random variables

Page 8: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

8

Data Preprocessing8

¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis

¤ 𝜒! test for discrete random variables¤ Correlation/covariance for continuous random variables

Page 9: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

9

Data Preprocessing9

¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis¨ Reduction: Types of data reduction methods

¤ Regression, sampling, histogram, dimensionality reduction, clustering

Page 10: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

10

Data Preprocessing10

¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis¨ Reduction: Types of data reduction methods¨ Transformation

¤ Normalization: min-max, z-score, L2 norm¤ Discretization: general concept

Page 11: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

11

Classification11

¨ Decision tree¤ How to construct a decision tree given a dataset¤ Attribute selection measures: information gain, gain ratio, Gini index¤ Categorical attribute vs. continuous attribute¤ What is pruning and why?

Page 12: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

12

Classification12

¨ Decision tree¨ Classifier evaluation

¤ Metrics: confusion matrix/accuracy/error rate/precision/recall/F-measure/ROC curve

¤ Methods: Holdout/cross validation

Page 13: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

13

Classification13

¨ Decision tree¨ Classifier evaluation¨ Practical issues: overfitting/underfitting

¤ Concepts¤ What could cause that? How to detect? How to fix?

Page 14: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

14

Classification14

¨ Decision tree¨ Classifier evaluation¨ Practical issues: overfitting/underfitting¨ Naïve Bayes classifier (zero-probability problem)¨ Ensemble methods: general concepts. Why ensemble often improves

performance?¨ K-nearest neighbor classifier¨ Neural network and SVM

¤ general concepts, e.g., what are support vectors? What is maximum marginal hyperplane? What is back propagation?

Page 15: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

15

Clustering15

¨ Distance and similarity measures (in the Statistics review lecture)¨ Partitioning-based

¤ K-means: algorithm, objective, complexity¤ K-medoids

¨ Hierarchical clustering¤ Dendrogram¤ MIN (single linkage), MAX (complete linkage)

¨ Density-based¤ DBSCAN: general concepts like core/border/noise, density-reachable/connected

¨ Cluster evaluation¤ Similarity matrix, silhouette coefficient

Page 16: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,

16

Good luck!