Introduction to Machine Learning & Classification
-
Upload
christopher-sharkey -
Category
Technology
-
view
329 -
download
1
Transcript of Introduction to Machine Learning & Classification
Machine LearningChris Sharkeytoday @shark2900
What do you think of when we say machine
learning?
big words• Hadoop• Terabyte• Petabyte• NoSQL• Data Science• D3• Visualization• Machine learning
What is machine learning?
“Predictive or descriptive modeling which learns from past experience or data to build models which can predict the future”
Past Data (known outcome) Machine Learning
ModelNew Data (unknown outcome) Predicted Outcome
Will John play golf?Date Weather Temperat
ureSally going?
Did John Golf ?
Sept 1 Sunny 92o F Yes YesSept 2 Cloudy 84o F No NoSept 3 Raining 84o F No YesSept 4 Sunny 95o F Yes Yes
Date Weather Temperature
Sally going?
Will John Golf ?
Sept 5 Cloudy 87o F No ?
We want a model based on John’s past behavior to predict what he will do in the future. Can we use ML?
Yes. This is a classification problem
ZeroR
Establishes a base line
Naïve Bayes
Probabilistic model
OneR
Single Rule
J4.5 / C4.5
Decision Tree
Upgrade our example
age blood pressure
specific gravity
albumin sugar
red blood cells
pus cell pus cell clumps
potassium blood glucose
blood urea serum creatinine
sodium hemoglobin packed cell volume
white blood cell count
red blood cell count
hypertension diabetes mellitus
coronary artery
Heart disease appetite pedal edema anemia stage
Data Set • 319 instances or people • 25 attributes or variables
Machine Learning• ZeroR • OneR • Naïve Bayes • J4.5 / C4.5
Model
Blood test data for new
individuals with unknown disease
status
Predict if induvial has CKD and if so the stage of there disease
status
ZeroR
Past data (known outcome)
New instance
Classified
Classify new data as the most
‘popular’ class
Build frequency table
Choice ‘most popular’ or most frequent class
How did ZeroR do? • Correctly classified 28.2% of the time • Rule: always guess a new instance (person) has stage three kidney disease
• 28.2% correct classfication rate is our base line • Correct classification rates above 28.2% are better than guessing
OneR
Past data (known outcome)
New instance
Classified
Choose attribute which rule has the
highest correct classification rate
Build frequency table for each attribute. This
generates a rule for value of each
attribute.
How did OneR do? • Correctly classified 80.2% of the time • Rule based on serum creatinine
• < 0.85 is healthy • < 1.15 is stage 2 • < 2.25 is stage 3 • > = 2.25 is stage 5
• Single rule is created and responsible for classification• High classification rate indicates a single value has high influence in predicting class
Naïve Bayes
Past data (known outcome)
New instance
Classified
For each attribute multiply
conditional probability for
each of the values with probability of
value
Multiply all prior calculated
probabilities
Choose most probable class
Build frequency table for each
attribute.
Determine probabilities for values of each
attribute.
Determine conditional
probabilities for values of each
attribute.
How did Naïve Bayes do? • Correctly classified 56.6% of the time • Conditional and overall probabilities constitute a rule • High classification rate indicates attributes have ‘equaler’ influence • No iterative process, faster on larger data sets
J4.5 / C4.5
Past data (known outcome)
New instance
Classified
Follow decision tree to a leaf or class
Top down recursive algorithm
determining splitting points
based on information gains
How did J4.5 do? • Correctly classified 88.4% of the time • Decision tree generated • Balance between discrimination of OneR and fairness of Naïve Bayes
• Decision trees are popular, intuitive, easy to create and easy to interpret
• People like decision trees. They tell a nice story
ZeroR • Correct classification rate – 28.2% • Established base line accuracy • Always guess stage 3 ckd
Naïve Bayes • Correct classification rate – 56.6% • Established over all probabilities to pick most probable class
OneR • Correct classification rate – 80.2% • Serum Creatinine • < 0.85 – Healthy • < 1.15 – Stage 2 • < 2.25 – Stage 3 • > = 2.25 – Stage 5
J4.5 / C4.5 • Correct classification rate – 88.4%
Does this make sense?
Other important concepts in machine learning.
Cross Validation• Hold out one of ten slices and build the model on the other nine slices • Test on the ‘held out’ slice• Hold out a different slice, build the models on the now other nine slices and test on the new ‘held out’ slice
Overfitting • Classification rule that is ‘over fit’ or so specific to the training data set that it does not generalize to the broader population
• Limiting the complexity or rules can help prevent overfitting • Large representative data sets can help fight overfitting • A problem in machine learning • Must be a suspicious data scientist
Question?