Data Science in Industry Applying Machine Learning to
Real-world Challenges
obtained Ph.D. indata mining and machine learning
worked in both academia and industry
Not just a researcher,but a coder & hacker
What is data science?
data is everywhere...
data science helps
extract knowledge from data...
Data scientists investigate complex data problems
find and interpret rich data sources
Visualize the data
get insights from data
from insights….
Questions?
Now is the fun part...
Data Science techniques!
Data science 101
● regression● classification● clustering● ranking (not covered in this lecture)● recommendation (not covered in this lecture)
Regression
What is regression?
A bit formal definition….
models a functional relationship between
an input variable x and
a response variable y
x
y
find the equation
What else can regression do?
Predicting who may change jobs!
x
y
Recap - regression
classification
identify to which of a set of categories a new data point belongs
Spam or Not spam?
Credit approve or not?
Optical character recognition
Document classification
SVM
Decision tree
Use classification to...
find who you are in social networks
classification
classification
Missing data
Outdated data
Non-standard data
Why we want to classify?
Understanding users’ social roles is crucial to many
social network applications
including advertising targeting,
marketing, personalization,
recommendation, etc.
Finding out who you really are...
manually labeling is time-consuming
and error prone
Human learning
Machine learning
SVM
Decision tree
How accurate can we get?
Can we further improve?
Clustering
grouping a set of data points
data points in the same group ( cluster) are more similar to each other
than to those in other groups (clusters)
k-means clustering algorithm
k clusters
k = 3
step 1:randomly select k points
as centroids
3 random centroids
step 2:assign every data point to
the nearest centroid
step 3:calculate mean of each cluster
as the new centroid
repeatassign clusters based on
the new centroids
How to use clustering to solve big data problem?
Machine data is massive
1 Tb/day is normal
no one has time to read all data...
Clustering comes to rescue!
clustering algorithm summarizesbig data to a few groups
each group representsa number of similar data points
investigating data pointsone by one
just investigating the clusters!
Things to considerin practice...
scalability
velocity
variety
real-time
What’s next?
Recap
● regression
● classification
● clustering
This presentation was initially created for a guest lecture at Utah State University for teaching and education purposes.
Thanks!
Top Related