2013 - Andrei Zmievski: Machine learning para datos
-
Upload
php-conference-argentina -
Category
Technology
-
view
175 -
download
0
description
Transcript of 2013 - Andrei Zmievski: Machine learning para datos
![Page 1: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/1.jpg)
Small Data Machine LearningAndrei Zmievski
The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic.Questions - now and later
![Page 2: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/2.jpg)
WORKWe are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
![Page 3: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/3.jpg)
WORKWe are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
![Page 4: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/4.jpg)
TRAVEL
![Page 5: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/5.jpg)
TAKE PHOTOS
![Page 6: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/6.jpg)
DRINK BEER
![Page 7: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/7.jpg)
MAKE BEER
![Page 8: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/8.jpg)
MATH
![Page 9: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/9.jpg)
MATHSOME
![Page 10: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/10.jpg)
MATHSOMEAWE
![Page 11: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/11.jpg)
@a
For those of you who don’t know me..Acquired in October 2008Had a different account earlier, but then @k asked if I wanted it..Know many other single-letter Twitterers.
![Page 12: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/12.jpg)
FAME
Advantages
![Page 13: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/13.jpg)
FAMEFORTUNE
![Page 14: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/14.jpg)
FAMEFORTUNE
Wall Street Journal?!
![Page 15: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/15.jpg)
FAMEFORTUNEFOLLOWERS
![Page 16: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/16.jpg)
lol, what?!
FAMEFORTUNEFOLLOWERS
![Page 17: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/17.jpg)
140-length(“@a “)=137
MAXIMUM REPLY SPACE!
![Page 18: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/18.jpg)
CONS
DisadvantagesVisual filtering is next to impossibleCould be a set of hard-coded rules derived empirically
![Page 19: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/19.jpg)
CONS
DisadvantagesVisual filtering is next to impossibleCould be a set of hard-coded rules derived empirically
![Page 20: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/20.jpg)
CONS
I hate humanity
DisadvantagesVisual filtering is next to impossibleCould be a set of hard-coded rules derived empirically
![Page 21: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/21.jpg)
ADD
![Page 22: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/22.jpg)
AnnoyanceDrivenDevelopment
Best way to learn something is to be annoyed enough to create a solution based on the tech.
![Page 23: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/23.jpg)
Machine Learningto the Rescue!
![Page 24: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/24.jpg)
REPLYCLEANEREven with false negatives, reduces garbage to where visual filtering is possible- uses trained model to classify tweets into good/bad- blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
![Page 25: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/25.jpg)
REPLYCLEANEREven with false negatives, reduces garbage to where visual filtering is possible- uses trained model to classify tweets into good/bad- blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
![Page 26: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/26.jpg)
REPLYCLEANER
![Page 27: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/27.jpg)
REPLYCLEANER
![Page 28: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/28.jpg)
REPLYCLEANER
![Page 29: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/29.jpg)
REPLYCLEANER
![Page 30: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/30.jpg)
![Page 31: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/31.jpg)
I still hate humanity
![Page 32: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/32.jpg)
I still hate humanity
I still hate humanity
![Page 33: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/33.jpg)
Machine Learning
A branch of Artificial IntelligenceNo widely accepted definition
![Page 34: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/34.jpg)
“Field of study that gives computers the ability to learn without being explicitly programmed.”
— Arthur Samuel (1959)
concerns the construction and study of systems that can learn from data
![Page 35: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/35.jpg)
SPAM FILTERING
![Page 36: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/36.jpg)
RECOMMENDATIONS
![Page 37: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/37.jpg)
TRANSLATION
![Page 38: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/38.jpg)
CLUSTERINGAnd many more: medical diagnoses, detecting credit card fraud, etc.
![Page 39: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/39.jpg)
supervisedunsupervised
Labeled dataset, training maps input to desired outputsExample: regression - predicting house prices, classification - spam filtering
![Page 40: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/40.jpg)
supervised
unsupervisedno labels in the dataset, algorithm needs to find structureExample: clusteringWe will be talking about classification, a supervised learning process.
![Page 41: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/41.jpg)
Featureindividual measurable property of the phenomenon under observation
usually numeric
![Page 42: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/42.jpg)
![Page 43: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/43.jpg)
Feature Vectora set of features for an observation
Think of it as an array
![Page 44: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/44.jpg)
features
# of roomssq. m2
house ageyard?
feature vector and weights vector1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation)dot product produces a linear predictor
![Page 45: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/45.jpg)
features
102.30.94-10.183.0
parameters
# of roomssq. m2
house ageyard?
feature vector and weights vector1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation)dot product produces a linear predictor
![Page 46: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/46.jpg)
features
102.30.94-10.183.0
parameters
# of roomssq. m2
house ageyard?
1 45.7
feature vector and weights vector1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation)dot product produces a linear predictor
![Page 47: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/47.jpg)
features
102.30.94-10.183.0
parameters
# of roomssq. m2
house ageyard?
1 45.7
feature vector and weights vector1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation)dot product produces a linear predictor
![Page 48: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/48.jpg)
features
102.30.94-10.183.0
=parameters prediction
# of roomssq. m2
house ageyard?
1 45.7
feature vector and weights vector1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation)dot product produces a linear predictor
![Page 49: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/49.jpg)
features
102.30.94-10.183.0
=parameters prediction
# of roomssq. m2
house ageyard?
1 45.7
758,013
feature vector and weights vector1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation)dot product produces a linear predictor
![Page 50: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/50.jpg)
X =⇥1 x1 x2 . . .
⇤
✓ =⇥✓0 ✓1 ✓2 . . .
⇤
dot product
X - input feature vectortheta - weights
![Page 51: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/51.jpg)
X =⇥1 x1 x2 . . .
⇤
✓ =⇥✓0 ✓1 ✓2 . . .
⇤
✓ ·X = ✓0 + ✓1x1 + ✓2x2 + . . .
dot product
X - input feature vectortheta - weights
![Page 52: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/52.jpg)
training data
learning algorithm
hypothesisHypothesis (decision function): what the system has learned so farHypothesis is applied to new data
![Page 53: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/53.jpg)
hθ(X)
The task of our algorithm is to determine the parameters of the hypothesis.
![Page 54: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/54.jpg)
hθ(X) input data
The task of our algorithm is to determine the parameters of the hypothesis.
![Page 55: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/55.jpg)
hθ(X)
parameters
input data
The task of our algorithm is to determine the parameters of the hypothesis.
![Page 56: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/56.jpg)
hθ(X)
parameters
input data
prediction y
The task of our algorithm is to determine the parameters of the hypothesis.
![Page 57: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/57.jpg)
LINEAR REGRESSION
5 10 15 20 25 30 35
40
80
120
160
200
whisky age
whisky price $
Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price.Linear regression does not work well for classification because its output is unbounded.Thresholding on some value is tricky and does not produce good results.
![Page 58: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/58.jpg)
LINEAR REGRESSION
5 10 15 20 25 30 35
40
80
120
160
200
whisky age
whisky price $
Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price.Linear regression does not work well for classification because its output is unbounded.Thresholding on some value is tricky and does not produce good results.
![Page 59: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/59.jpg)
LINEAR REGRESSION
5 10 15 20 25 30 35
40
80
120
160
200
whisky age
whisky price $
Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price.Linear regression does not work well for classification because its output is unbounded.Thresholding on some value is tricky and does not produce good results.
![Page 60: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/60.jpg)
LOGISTIC REGRESSION
g(z) =1
1 + e�z
0.5
1
0z
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.
![Page 61: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/61.jpg)
LOGISTIC REGRESSION
g(z) =1
1 + e�z
0.5
1
0z
z = ✓ ·X
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.
![Page 62: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/62.jpg)
h✓(X) =1
1 + e�✓·X
Probability that y=1 for input X
LOGISTIC REGRESSIONIf hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70% chance it’s spam. Thresholding on that is up to you.
![Page 63: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/63.jpg)
Building the Tool
![Page 64: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/64.jpg)
Corpuscollection of source data used for training and testing the model
![Page 65: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/65.jpg)
Twitter MongoDBphirehose
hooks into streaming API
![Page 66: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/66.jpg)
Twitter MongoDBphirehose
8500 tweetshooks into streaming API
![Page 67: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/67.jpg)
FeatureIdentification
![Page 68: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/68.jpg)
independent&
discriminant
Independent: feature A should not co-occur (correlate) with feature B highly.Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts with is not a good feature).
![Page 69: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/69.jpg)
‣ @a at the end of the tweet‣ @a...‣ length < N chars‣ # of user mentions in the tweet‣ # of hashtags‣ language!‣ @a followed by punctuation and a word
character (except for apostrophe)‣ …and more
possible features
![Page 70: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/70.jpg)
feature = extractor(tweet)
For each feature, write a small function that takes a tweet and returns a numeric value (floating-point).
![Page 71: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/71.jpg)
corpus
extractors
feature vectorsRun the set of these functions over the corpus and build up feature vectorsArray of arraysSave to DB
![Page 72: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/72.jpg)
LanguageMatters
high correlation between the language of the tweet and its category (good/bad)
![Page 73: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/73.jpg)
Indonesian or Tagalog?Garbage.
![Page 74: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/74.jpg)
id Indonesian 3548
en English 1804
tl Tagalog 733
es Spanish 329
so Somalian 305
ja Japanese 300
pt Portuguese 262
ar Arabic 256
nl Dutch 150
it Italian 137
sw Swahili 118
fr French 92
Top 12 Languages
I guarantee you people aren’t tweeting at me in Swahili.
![Page 75: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/75.jpg)
Language Detection
Can’t trust the language field in user’s profile data.Used character N-grams and character sets for detection.Has its own error rate, so needs some post-processing.
![Page 76: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/76.jpg)
Language Detection
Text_LanguageDetecttextcatpecl /
pear /
Can’t trust the language field in user’s profile data.Used character N-grams and character sets for detection.Has its own error rate, so needs some post-processing.
![Page 77: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/77.jpg)
✓ Clean-up text (remove mentions, links, etc)✓ Run language detection✓ If unknown/low weight, pretend it’s English, else:✓ If not a character set-determined language, try harder:
✓ Tokenize into words✓ Difference with English vocabulary✓ If words remain, run parts-of-speech tagger on each✓ For NNS, VBZ, and VBD run stemming algorithm✓ If result is in English vocabulary, remove from remaining✓ If remaining list is not empty, calculate:
unusual_word_ratio = size(remaining)/size(words)✓ If ratio < 20%, pretend it’s English
EnglishNotEnglish
A lot of this is heuristic-based, after some trial-and-error.Seems to help with my corpus.
![Page 78: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/78.jpg)
BINARY CLASSIFICATION
Grunt workBuilt a web-based tool to display tweets a page at a time and select good ones
![Page 79: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/79.jpg)
feature vectors
labels (good/bad)
INP
UT
OUTP
UT
Had my input and output
![Page 80: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/80.jpg)
BIASCORRECTION
One more thing to address
![Page 81: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/81.jpg)
BIASCORRECTION
BAD GOOD
99% = bad (less < 100 tweets were good)Training a model as-is would not produce good resultsNeed to adjust the bias
![Page 82: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/82.jpg)
BIASCORRECTION
BAD GOOD
![Page 83: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/83.jpg)
O V E RS A M P L I N G
Oversampling: use multiple copies of good tweets to equalize with badProblem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
![Page 84: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/84.jpg)
O V E RS A M P L I N G
Oversampling: use multiple copies of good tweets to equalize with badProblem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
![Page 85: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/85.jpg)
O V E RS A M P L I N G
UNDERUndersampling: drop most of the bad tweets to equalize with goodProblem: total corpus ends up being < 200 tweets, not enough for training
![Page 86: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/86.jpg)
S A M P L I N G
UNDERUndersampling: drop most of the bad tweets to equalize with goodProblem: total corpus ends up being < 200 tweets, not enough for training
![Page 87: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/87.jpg)
OVERSAMPLINGSynthetic
Synthesize feature vectors by determining what constitutes a good tweet and do weighted random selection of feature values.
![Page 88: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/88.jpg)
chance feature90% “good” language70%25%5%
no hashtags1 hashtag
2 hashtags2% @a at the end
85% rand length > 10
The actual synthesis is somewhat more complex and was also trial-and-error basedSynthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
![Page 89: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/89.jpg)
chance feature90% “good” language70%25%5%
no hashtags1 hashtag
2 hashtags2% @a at the end
85% rand length > 10
1
The actual synthesis is somewhat more complex and was also trial-and-error basedSynthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
![Page 90: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/90.jpg)
chance feature90% “good” language70%25%5%
no hashtags1 hashtag
2 hashtags2% @a at the end
85% rand length > 10
1
2
The actual synthesis is somewhat more complex and was also trial-and-error basedSynthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
![Page 91: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/91.jpg)
chance feature90% “good” language70%25%5%
no hashtags1 hashtag
2 hashtags2% @a at the end
85% rand length > 10
1
2
0
The actual synthesis is somewhat more complex and was also trial-and-error basedSynthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
![Page 92: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/92.jpg)
chance feature90% “good” language70%25%5%
no hashtags1 hashtag
2 hashtags2% @a at the end
85% rand length > 10
1
2
0
77
The actual synthesis is somewhat more complex and was also trial-and-error basedSynthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
![Page 93: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/93.jpg)
ModelTraining
We have the hypothesis (decision function) and the training set, How do we actually determine the weights/parameters?
![Page 94: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/94.jpg)
COSTFUNCTION
Measures how far the prediction of the system is from the reality.The cost depends on the parameters.The less the cost, the closer we’re to the ideal parameters for the model.
![Page 95: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/95.jpg)
REALITY
PREDICTION
COSTFUNCTION
Measures how far the prediction of the system is from the reality.The cost depends on the parameters.The less the cost, the closer we’re to the ideal parameters for the model.
![Page 96: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/96.jpg)
COST
FUNC
TION
J(✓) =1
m
mX
i=1
Cost(h✓(x), y)
Measures how far the prediction of the system is from the reality.The cost depends on the parameters.The less the cost, the closer we’re to the ideal parameters for the model.
![Page 97: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/97.jpg)
Cost(h✓(x), y) =
(� log (h✓(x)) if y = 1
� log (1� h✓(x)) if y = 0
LOGISTIC COST
![Page 98: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/98.jpg)
10
y=1 y=0
10
Correct guess
Incorrect guess
Cost = 0
Cost = huge
LOGISTIC COST
When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.
![Page 99: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/99.jpg)
minimize costO V E R θ
Finding the best values of Theta that minimize the cost
![Page 100: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/100.jpg)
GRADIENT DESCENTRandom starting point.Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat.Imagine a ball rolling down from a hill.
![Page 101: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/101.jpg)
✓i ↵= �✓i@J(✓)
@✓i
GRADIENT DESCENTEach step adjusts the parameters according to the slope
![Page 102: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/102.jpg)
↵= �
each parameter
✓i ✓i@J(✓)
@✓i
Have to update them simultaneously (the whole vector at a time).
![Page 103: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/103.jpg)
✓i = �✓i
learning rate
↵@J(✓)
@✓i
Controls how big a step you takeIf α is big have an aggressive gradient descentIf α is small take tiny stepsIf too small, tiny steps, takes too longIf too big, can overshoot the minimum and fail to converge
![Page 104: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/104.jpg)
✓i ↵= �✓i
derivativeaka
“the slope”
@J(✓)
@✓i
The slope indicates the steepness of the descent step for each weight, i.e. direction.Keep going for a number of iterations or until cost is below a threshold (convergence).Graph the cost function versus # of iterations and see where it starts to approach 0, past that are diminishing returns.
![Page 105: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/105.jpg)
✓i = ✓i � ↵
mX
j=1
(h✓(xj)�y
j)xji
THE UPDATE ALGORITHMDerivative for logistic regression simplifies to this term.Have to update the weights simultaneously!
![Page 106: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/106.jpg)
X1 = [1 12.0] X2 = [1 -3.5]
Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.
![Page 107: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/107.jpg)
y1 = 1y2 = 0
X1 = [1 12.0] X2 = [1 -3.5]
Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.
![Page 108: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/108.jpg)
y1 = 1y2 = 0
X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]
Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.
![Page 109: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/109.jpg)
y1 = 1y2 = 0
X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]
↵ = 0.05
Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.
![Page 110: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/110.jpg)
y1 = 1y2 = 0
h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786
X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]
↵ = 0.05
Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.
![Page 111: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/111.jpg)
y1 = 1y2 = 0
h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]
↵ = 0.05
Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.
![Page 112: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/112.jpg)
y1 = 1y2 = 0
h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]
T0
↵ = 0.05
Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.
![Page 113: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/113.jpg)
y1 = 1y2 = 0
= 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]
T0
↵ = 0.05
Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.
![Page 114: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/114.jpg)
y1 = 1y2 = 0
= 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)
h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]
T0
↵ = 0.05
Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.
![Page 115: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/115.jpg)
y1 = 1y2 = 0
= 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)= 0.088
h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]
T0
↵ = 0.05
Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.
![Page 116: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/116.jpg)
y1 = 1y2 = 0
= 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)= 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5)= 0.305
h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438
X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]
T1
↵ = 0.05
Note that the hypotheses don’t change within the iteration.
![Page 117: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/117.jpg)
y1 = 1y2 = 0
X1 = [1 12.0] X2 = [1 -3.5]
θ = [T0 T1]
↵ = 0.05
Replace parameter (weights) vector with the temporaries.
![Page 118: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/118.jpg)
y1 = 1y2 = 0
X1 = [1 12.0] X2 = [1 -3.5]
θ = [0.088 0.305]
↵ = 0.05
Do next iteration
![Page 119: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/119.jpg)
Trai ningCR
OSS
Used to assess the results of the training.
![Page 120: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/120.jpg)
DATA
![Page 121: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/121.jpg)
DATA
TRAINING
![Page 122: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/122.jpg)
DATA
TRAINING TEST
Train model on training set, then test results on test set.Rinse, lather, repeat feature selection/synthesis/training until results are "good enough".Pick the best parameters and save them (DB, other).
![Page 123: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/123.jpg)
Putting It All Together
Let’s put our model to use, finally.The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain error handling, etc. Once we get the actual tweet though..
![Page 124: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/124.jpg)
Load the modelThe weights we have calculated via training
Easiest is to load them from DB (can be used to test different models).
![Page 125: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/125.jpg)
HARDCODEDRULES
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
![Page 126: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/126.jpg)
HARDCODEDRULES
SKIPtruncated retweets: "RT @A ..."
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
![Page 127: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/127.jpg)
HARDCODEDRULES
SKIPtruncated retweets: "RT @A ..."
@ mentions of friends
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
![Page 128: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/128.jpg)
HARDCODEDRULES
SKIPtruncated retweets: "RT @A ..."
tweets from friends
@ mentions of friends
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
![Page 129: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/129.jpg)
Classifying Tweets
This is the moment we’ve been waiting for.
![Page 130: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/130.jpg)
Classifying Tweets
GOOD
This is the moment we’ve been waiting for.
![Page 131: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/131.jpg)
Classifying Tweets
GOOD BAD
This is the moment we’ve been waiting for.
![Page 132: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/132.jpg)
h✓(X) =1
1 + e�✓·X
Remember this?
First is our hypothesis.
![Page 133: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/133.jpg)
h✓(X) =1
1 + e�✓·X
Remember this?
✓ ·X = ✓0 + ✓1X1 + ✓2X2 + . . .
First is our hypothesis.
![Page 134: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/134.jpg)
Finally
h✓(X) =1
1 + e�(✓0+✓1X1+✓2X2+... )
If h > threshold , tweet is bad, otherwise good
Remember that the output of h() is 0..1 (probability).Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.
![Page 135: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/135.jpg)
extract features
3 simple stepsInvoke the feature extractor to construct the feature vector for this tweet.Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation).Use the output of the classifier.
![Page 136: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/136.jpg)
extract featuresrun the model
3 simple stepsInvoke the feature extractor to construct the feature vector for this tweet.Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation).Use the output of the classifier.
![Page 137: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/137.jpg)
extract featuresrun the model
act on the result3 simple stepsInvoke the feature extractor to construct the feature vector for this tweet.Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation).Use the output of the classifier.
![Page 138: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/138.jpg)
BAD?blockuser!
Also save the tweet to DB for future analysis.
![Page 139: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/139.jpg)
Lessons Learned
Blocking is the only option (and is final)Streaming API delivery is incompleteReplyCleaner judged to be ~80% effective
Twitter API is a pain in the rear
![Page 140: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/140.jpg)
Lessons Learned
Blocking is the only option (and is final)Streaming API delivery is incompleteReplyCleaner judged to be ~80% effective
Twitter API is a pain in the rear
-Connection handling, backoff in case of problems, undocumented API errors, etc.
![Page 141: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/141.jpg)
Lessons Learned
Blocking is the only option (and is final)Streaming API delivery is incompleteReplyCleaner judged to be ~80% effective
Twitter API is a pain in the rear
-No way for blocked person to get ahold of you via Twitter anymore, so when training the model, err on the side of caution.
![Page 142: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/142.jpg)
Lessons Learned
Blocking is the only option (and is final)Streaming API delivery is incompleteReplyCleaner judged to be ~80% effective
Twitter API is a pain in the rear
-Some tweets are shown on the website, but never seen through the API.
![Page 143: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/143.jpg)
Lessons Learned
Blocking is the only option (and is final)Streaming API delivery is incompleteReplyCleaner judged to be ~80% effective
Twitter API is a pain in the rear
-Lots of room for improvement.
![Page 144: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/144.jpg)
Lessons Learned
Blocking is the only option (and is final)Streaming API delivery is incompleteReplyCleaner judged to be ~80% effective
Twitter API is a pain in the rear
PHP sucks at math-y stuff
-Lots of room for improvement.
![Page 145: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/145.jpg)
NEXTSTEPS
★ Realtime feedback★ More features★ Grammar analysis★ Support Vector Machines or
decision trees★ Clockwork Raven for manual
classification★ Other minimization algos:
BFGS, conjugate gradient★ Wish pecl/scikit-learn existed
Click on the tweets that are bad and it immediately incorporates them into the model.Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.SVMs more appropriate for biased data sets.Farm out manual classification to Mechanical Turk.May help avoid local minima, no need to pick alpha, often faster than GD.
![Page 146: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/146.jpg)
TOOLS
★ MongoDB★ pear/Text_LanguageDetect★ English vocabulary corpus★ Parts-Of-Speech tagging★ SplFixedArray★ phirehose★ Python’s scikit-learn (for
validation)★ Code sample
MongoDB (great fit for JSON data)English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/SplFixedArray in PHP (memory savings and slightly faster)
![Page 147: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/147.jpg)
LEARN★ Coursera.org ML course★ Ian Barber’s blog★ FastML.com
Click on the tweets that are bad and it immediately incorporates them into the model.Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.SVMs more appropriate for biased data sets.Farm out manual classification to Mechanical Turk.May help avoid local minima, no need to pick alpha, often faster than GD.
![Page 148: 2013 - Andrei Zmievski: Machine learning para datos](https://reader033.fdocuments.us/reader033/viewer/2022052316/558df5df1a28aba7598b47e8/html5/thumbnails/148.jpg)
Questions?