5 . Machine Learning

14
5. Machine Learning Prof. Tudor Dumitraș Assistant Professor, ECE University of Maryland, College Park ENEE 759D | ENEE 459D | CMSC 858Z http://ter.ps/ 759d https://www.facebook.com/SDSAtUMD

description

ENEE 759D | ENEE 459D | CMSC 858Z. 5 . Machine Learning. Prof. Tudor Dumitraș. Assistant Professor, ECE University of Maryland, College Park. http://ter.ps/ 759d https://www.facebook.com/SDSAtUMD. Today’s Lecture. Where we’ve been Big Data Statistics MapReduce - PowerPoint PPT Presentation

Transcript of 5 . Machine Learning

Page 1: 5 . Machine Learning

5. Machine Learning

Prof. Tudor DumitrașAssistant Professor, ECEUniversity of Maryland, College Park

ENEE 759D | ENEE 459D | CMSC 858Z

http://ter.ps/759d https://www.facebook.com/SDSAtUMD

Page 2: 5 . Machine Learning

Today’s Lecture• Where we’ve been– Big Data– Statistics– MapReduce– Interpretation of results

• Where we’re going today– Machine learning

• Where we’re going next– Part 2 of course: Security and InSecurity in the Real World– 2 readings each lecture

2

Page 3: 5 . Machine Learning

Machine Learning

• Supervised learning: have inputs and associated outputs– Learn relationships between them using available training data (also called

“labeled data”, “ground truth”)– Predict future values– Classification: The output (learned attribute) is categorical – Regression: The output (learned attribute) is numeric

• Unsupervised learning: have only inputs– Learn “latent” labels– Clustering: Identify natural groups in the data 3

Systems that automatically learn programs from dataP. Domingos, CACM 2012

Page 4: 5 . Machine Learning

Rules

outlook temp humidity windy playovercast cool normal TRUE yesovercast hot high FALSE yesovercast hot normal FALSE yesovercast mild high TRUE yesrainy cool normal TRUE norainy mild high TRUE norainy cool normal FALSE yesrainy mild high FALSE yesrainy mild normal FALSE yessunny hot high FALSE nosunny hot high TRUE nosunny mild high FALSE nosunny cool normal FALSE yessunny mild normal TRUE yes

4

• Want to decide when to play– Create rules based on attributes

• Example: 1 attributeif (outlook == “rainy”)

then play = “no”else play = “yes”

– Errors: 6/14

• Can refine rule by adding conditions on other attributes– Create a decision tree

Weather and golf

Page 5: 5 . Machine Learning

Entropy

• Consider two sequences of coin flips – How much information do we get after flipping each coin once? – We want some function “Information” that satisfies:

Information1&2(p1p2) = Information1(p1) + Information2(p2) – Expected Information = “Entropy”

• Examples– Flipping a coin! In learning the outcome of the coin flip we learned 1 bit of information

– Rolling a fair die! A die is more unpredictable than a coin

– Rolling a weighted die with p1..5=0.1, p6=0.5! A weighted die is less unpredictable than a fair die

Which attribute do we choose at each level?

Page 6: 5 . Machine Learning

Decision Tree

outlook temp humidity windy playovercast cool normal TRUE yesovercast hot high FALSE yesovercast hot normal FALSE yesovercast mild high TRUE yesrainy cool normal TRUE norainy mild high TRUE norainy cool normal FALSE yesrainy mild high FALSE yesrainy mild normal FALSE yessunny hot high FALSE nosunny hot high TRUE nosunny mild high FALSE nosunny cool normal FALSE yessunny mild normal TRUE yes

6

• At each level, choose the attribute with the highest information gain– The one that reduces the unpredictability

the most

• Before: 9/14 “yes” outcomes => H=0.94– Outlook: H=0.69

• 4/4 “yes” for overcast (H=0)• 3/5 “yes” for rainy (H=0.97)• 2/5 “yes” for sunny (H=0.97)

– Temperature: H=0.91– Humidity: H=0.94– Windy: 0.87

• Outlook provides highest information gain: 0.94 – 0.69 = 0.25

Weather and golf

Page 7: 5 . Machine Learning

Resulting Decision Tree• Putting the decision tree together – Choose the attribute with the highest Information Gain – Create branches for each value of attribute – Discretize continuous attributes (choose partition with highest gain)– R package: rpart

• Not a perfect classification (still makes some incorrect decisions)

7

Page 8: 5 . Machine Learning

Overfitting• Low error on training data and high error on test data– “If the knowledge and data we have

are not sufficient to completely determine the correct classifier, […] we run the risk of just hallucinating a classifier that […] simply encodes random quirks in the data.”

– P. Domingos, CACM’12

• Some algorithms can prune the tree to avoid overfitting

8

Underfitting

Overfitting

Page 9: 5 . Machine Learning

Confusion Matrix

True - True +

Predicted - True Negative (TN)Correct decision

False Negative (FN)Type 2 error

Predicted + False Positive (FP)Type 1 error

True Positive (TP)Correct decision

9

How to determine if the classifier does a good job?• You need a training set (ground truth) and a testing set– Or you can split your ground truth into two data sets– Even better: K-fold cross-validation

• Select K samples without replacement and train classifier multiple times

• You can make a mistake in two different ways

Page 10: 5 . Machine Learning

Evaluating Results

• There is usually a trade-off between FPs and FNs– Reducing type 1 errors causes more type 2 errors, and vice-versa

• Sensitivity = TP / (TP+FN)– Ability to identify true positives– Also called true positive rate

• Specificity = TN / (FP + TN)– Ability to rule out true negatives – Also called true negative rate

• Can plot a Receiver Operating Characteristic (ROC) curve– R package: ROCR

10

Is it better to have low FPs or low FNs?

Evaluating keystroke dynamics[Killourhy & Maxion, DSN’09]

FP rate (1 – Specificity)

TP ra

te (S

ensiti

vity

)

Page 11: 5 . Machine Learning

Unsupervised Learning• Agglomerative hierarchical clustering (R: hclust)– No ground truth; goal is to identify patterns that describe the data– Start from individual points and progressively merge nearby clusters – Distance metric (e.g. Euclidian, rank correlation, Gower)– Linkage: how to aggregate pairwise point distances into cluster distances

• Average? Minimum (single)? Maximum (complete)? Variance decrease (Ward)?

! Choose classification or clustering features carefully

Dendrogram of 1970 cars (features: MPG, weight, drive ratio)

Page 12: 5 . Machine Learning

Additional Machine Learning Resources• Classification– We saw: decision trees– Other classifiers: naïve Bayes, Support Vector Machines (SVM)– Natural language processing

• Text mining (R package: tm)• Sentiment analysis (annotated English wordlist:

http://www2.imm.dtu.dk/pubdb/views/publication_details.php?id=6010)

• Clustering– We saw: hierarchical clustering– Other clustering techniques: k-means, k-medoids, time series clustering– Dimensionality reduction: principal component analysis (PCA)

• Machine learning tools – For R: http://cran.r-project.org/web/views/MachineLearning.html – For Hadoop: Mahout (http://mahout.apache.org/)

12

Page 13: 5 . Machine Learning

Project Peer-Reviews• Pilot project reports– Reports due today

• Discuss hypothesis (security problem and data analyzed to solve it)• Feasibility study• Report data volume, velocity, variety and quality

– Post report on Piazza

• Pilot project peer reviews– Review at least 2 project reports from other students

• Use skills learned from paper reviews– Peer reviews are a part of your grade– Post reviews on Piazza (as follow-ups to report posts) by Monday

13

Page 14: 5 . Machine Learning

Review of Lecture• What did we learn?– Classification– Clustering

• What’s next?– Paper discussion: ‘Sex, Lies and Cyber-crime Surveys’– Next lecture: start of part 2 of course – 2 readings / lecture

• Deadline reminders– Pilot project reports due today– Pilot project reviews due Monday– Group project proposals due Monday, 09/30

14