CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity...
Transcript of CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity...
![Page 1: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/1.jpg)
CSE 5243 INTRO. TO DATA MINING
Review Session for MidtermYu Su, CSE@The Ohio State University
![Page 2: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/2.jpg)
2
Notes
¨ Time: 02/26/2020 (Wed), 9:35 – 10:55 AM
¨ Location: Caldwell Lab 171
¨ One-page cheat sheet: both sides allowed
¨ Calculator: allowed
¨ I will update the slides over the weekend to make them cleaner. No major changes.
![Page 3: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/3.jpg)
3
Agenda
¨ Summary of key concepts and equations
¨ HW1 Discussion (TA)
¨ HW2 Discussion (TA)
![Page 4: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/4.jpg)
4
Probability and Statistics4
¨ Bayes rule: prior, likelihood, marginal probability, posterior
¨ Chain rule
![Page 5: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/5.jpg)
5
Probability and Statistics5
¨ Bayes rule: prior, likelihood, marginal probability, posterior
¨ Chain rule¨ Maximum Likelihood Estimation (MLE)
¤ Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model.
![Page 6: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/6.jpg)
6
Data Preprocessing6
¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning
![Page 7: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/7.jpg)
7
Data Preprocessing7
¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis
¤ 𝜒! test for discrete random variables
![Page 8: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/8.jpg)
8
Data Preprocessing8
¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis
¤ 𝜒! test for discrete random variables¤ Correlation/covariance for continuous random variables
![Page 9: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/9.jpg)
9
Data Preprocessing9
¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis¨ Reduction: Types of data reduction methods
¤ Regression, sampling, histogram, dimensionality reduction, clustering
![Page 10: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/10.jpg)
10
Data Preprocessing10
¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis¨ Reduction: Types of data reduction methods¨ Transformation
¤ Normalization: min-max, z-score, L2 norm¤ Discretization: general concept
![Page 11: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/11.jpg)
11
Classification11
¨ Decision tree¤ How to construct a decision tree given a dataset¤ Attribute selection measures: information gain, gain ratio, Gini index¤ Categorical attribute vs. continuous attribute¤ What is pruning and why?
![Page 12: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/12.jpg)
12
Classification12
¨ Decision tree¨ Classifier evaluation
¤ Metrics: confusion matrix/accuracy/error rate/precision/recall/F-measure/ROC curve
¤ Methods: Holdout/cross validation
![Page 13: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/13.jpg)
13
Classification13
¨ Decision tree¨ Classifier evaluation¨ Practical issues: overfitting/underfitting
¤ Concepts¤ What could cause that? How to detect? How to fix?
![Page 14: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/14.jpg)
14
Classification14
¨ Decision tree¨ Classifier evaluation¨ Practical issues: overfitting/underfitting¨ Naïve Bayes classifier (zero-probability problem)¨ Ensemble methods: general concepts. Why ensemble often improves
performance?¨ K-nearest neighbor classifier¨ Neural network and SVM
¤ general concepts, e.g., what are support vectors? What is maximum marginal hyperplane? What is back propagation?
![Page 15: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/15.jpg)
15
Clustering15
¨ Distance and similarity measures (in the Statistics review lecture)¨ Partitioning-based
¤ K-means: algorithm, objective, complexity¤ K-medoids
¨ Hierarchical clustering¤ Dendrogram¤ MIN (single linkage), MAX (complete linkage)
¨ Density-based¤ DBSCAN: general concepts like core/border/noise, density-reachable/connected
¨ Cluster evaluation¤ Similarity matrix, silhouette coefficient
![Page 16: CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity measures (in the Statistics review lecture) ¨Partitioning-based ¤K-means: algorithm,](https://reader033.fdocuments.us/reader033/viewer/2022042305/5ed075b6b1d7a045c41552d1/html5/thumbnails/16.jpg)
16
Good luck!