CSE 5243 INTRO. TO DATA MINING

Review Session for MidtermYu Su, CSE@The Ohio State University

¨ Time: 02/26/2020 (Wed), 9:35 – 10:55 AM

¨ Location: Caldwell Lab 171

¨ One-page cheat sheet: both sides allowed

¨ Calculator: allowed

¨ I will update the slides over the weekend to make them cleaner. No major changes.

Agenda

¨ Summary of key concepts and equations

¨ HW1 Discussion (TA)

¨ HW2 Discussion (TA)

Probability and Statistics4

¨ Bayes rule: prior, likelihood, marginal probability, posterior

¨ Chain rule

Probability and Statistics5

¨ Bayes rule: prior, likelihood, marginal probability, posterior

¨ Chain rule¨ Maximum Likelihood Estimation (MLE)

¤ Obtain parameter estimates that maximize the probability that the sample data occurs for the specific model.

Data Preprocessing6

¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning

Data Preprocessing7

¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis

¤ 𝜒! test for discrete random variables

Data Preprocessing8

¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis

¤ 𝜒! test for discrete random variables¤ Correlation/covariance for continuous random variables

Data Preprocessing9

¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis¨ Reduction: Types of data reduction methods

¤ Regression, sampling, histogram, dimensionality reduction, clustering

Data Preprocessing10

¨ Major tasks: cleaning, integration, reduction, and transformation¨ Cleaning: Smoothing noisy data by binning¨ Integration: Detecting redundant attributes by correlation analysis¨ Reduction: Types of data reduction methods¨ Transformation

¤ Normalization: min-max, z-score, L2 norm¤ Discretization: general concept

Classification11

¨ Decision tree¤ How to construct a decision tree given a dataset¤ Attribute selection measures: information gain, gain ratio, Gini index¤ Categorical attribute vs. continuous attribute¤ What is pruning and why?

Classification12

¨ Decision tree¨ Classifier evaluation

¤ Metrics: confusion matrix/accuracy/error rate/precision/recall/F-measure/ROC curve

¤ Methods: Holdout/cross validation

Classification13

¨ Decision tree¨ Classifier evaluation¨ Practical issues: overfitting/underfitting

¤ Concepts¤ What could cause that? How to detect? How to fix?

Classification14

¨ Decision tree¨ Classifier evaluation¨ Practical issues: overfitting/underfitting¨ Naïve Bayes classifier (zero-probability problem)¨ Ensemble methods: general concepts. Why ensemble often improves

performance?¨ K-nearest neighbor classifier¨ Neural network and SVM

¤ general concepts, e.g., what are support vectors? What is maximum marginal hyperplane? What is back propagation?

Clustering15

¨ Distance and similarity measures (in the Statistics review lecture)¨ Partitioning-based

¤ K-means: algorithm, objective, complexity¤ K-medoids

¨ Hierarchical clustering¤ Dendrogram¤ MIN (single linkage), MAX (complete linkage)

¨ Density-based¤ DBSCAN: general concepts like core/border/noise, density-reachable/connected

¨ Cluster evaluation¤ Similarity matrix, silhouette coefficient

Good luck!

CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity...

Documents

Transcript of CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 15 Clustering 15 ¨Distance and similarity...

Aquifer Storage Properties CVEG 5243 Ground Water Hydrology T. Soerens.

Partitioning& Re Partitioning

CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2019/FPM-basic... · 2019-10-24 · CSE 5243 INTRO. TO DATA MINING Slides adapted from Prof. Jiawei Han @UIUC,

CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2018/...CSE 5243 INTRO. TO DATA MINING Slides adapted from UIUC CS412, Fall 2017, by Prof. Jiawei Han Classification

CSE 5243 INTRO. TO DATA MINING - GitHub Pages · 2020-03-05 · CSE 5243 INTRO. TO DATA MINING Classification (Basic Concepts) Yu Su, CSE@TheOhio State University Slides adapted from

CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2019/FPM-basic...CSE 5243 INTRO. TO DATA MINING Slides adapted from Prof. Jiawei Han @UIUC, Prof. Srinivasan

CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2018/FPM-basic... · 2018-08-18 · CSE 5243 INTRO. TO DATA MINING. Slides adapted from Prof. JiaweiHan @UIUC,

ME 5243: ADVANCED MECHANISM DESIGN · ME 5243: ADVANCED MECHANISM DESIGN ... • Topological Synthesis • Topological Analysis. 7 Type Synthesis Questions 1. How many links and joints

ME 5243: ADVANCED MECHANISM DESIGNME 5243: ADVANCED MECHANISM DESIGN •Quiz • Cam Sizing Exercises – Double Dwell – Conveyor Chase Class #23 Cam Sizing

CSE 5243 INTRO. TO DATA MININGweb.cse.ohio-state.edu/~sun.397/courses/au2017/FPM-basic-osu-11… · CSE 5243 INTRO. TO DATA MINING. Slides adapted from Prof. JiaweiHan @UIUC, Prof.

Partitioning Systems and Doors - RIBA Product Selector Partitioning the SAS Partitioning Range We manufacture and supply a range of demountable, relocatable partitioning systems and

MA/CS 471 Lecture 15, Fall 2002 Introduction to Graph Partitioning.

CSE 5243 INTRO. TO DATA MINING - GitHub Pages · CSE 5243 INTRO. TO DATA MINING Classification (Basic Concepts) Yu Su, CSE@TheOhio State University Slides adapted from UIUC CS412

CS267 L15 Graph Partitioning II.1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 15: Graph Partitioning - II James Demmel demmel/cs267_Spr99.

Special Methods of Instruction I CIED 5243 Dr. Bowles, Instructor

Tuning small analytics on Big Data Data partitioning and ...aabello/publications/15.IS.pdf · Tuning small analytics on Big Data: Data partitioning and secondary indexes in the Hadoop

CSE 5243 INTRO. TO DATA MINING · 2020-04-15 · CSE 5243 INTRO. TO DATA MINING Classification (Basic Concepts & Advanced Methods) Yu Su, CSE@TheOhio State University Slides adapted

Automatic Memory Partitioning and Scheduling for ...cadlab.cs.ucla.edu/~cong/papers/memPar.pdfAutomatic Memory Partitioning and Scheduling for Throughput and Power Optimization 15:3

Genetic Algorithms for Graph Partitioning and Incremental Graph Partitioning