2013 - Andrei Zmievski: Machine learning para datos

Small Data Machine LearningAndrei Zmievski

The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic.Questions - now and later

WORKWe are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.

TRAVEL

TAKE PHOTOS

DRINK BEER

MAKE BEER

MATHSOME

MATHSOMEAWE

@a

For those of you who don’t know me..Acquired in October 2008Had a different account earlier, but then @k asked if I wanted it..Know many other single-letter Twitterers.

FAME

Advantages

FAMEFORTUNE

FAMEFORTUNE

Wall Street Journal?!

FAMEFORTUNEFOLLOWERS

lol, what?!

FAMEFORTUNEFOLLOWERS

140-length(“@a “)=137

MAXIMUM REPLY SPACE!

CONS

DisadvantagesVisual filtering is next to impossibleCould be a set of hard-coded rules derived empirically

CONS

I hate humanity

DisadvantagesVisual filtering is next to impossibleCould be a set of hard-coded rules derived empirically

AnnoyanceDrivenDevelopment

Best way to learn something is to be annoyed enough to create a solution based on the tech.

Machine Learningto the Rescue!

REPLYCLEANEREven with false negatives, reduces garbage to where visual filtering is possible- uses trained model to classify tweets into good/bad- blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline

REPLYCLEANER

I still hate humanity

Machine Learning

A branch of Artificial IntelligenceNo widely accepted definition

“Field of study that gives computers the ability to learn without being explicitly programmed.”

— Arthur Samuel (1959)

concerns the construction and study of systems that can learn from data

SPAM FILTERING

RECOMMENDATIONS

TRANSLATION

CLUSTERINGAnd many more: medical diagnoses, detecting credit card fraud, etc.

supervisedunsupervised

Labeled dataset, training maps input to desired outputsExample: regression - predicting house prices, classification - spam filtering

supervised

unsupervisedno labels in the dataset, algorithm needs to find structureExample: clusteringWe will be talking about classification, a supervised learning process.

Featureindividual measurable property of the phenomenon under observation

usually numeric

Feature Vectora set of features for an observation

Think of it as an array

features

# of roomssq. m2

house ageyard?

feature vector and weights vector1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation)dot product produces a linear predictor

features

102.30.94-10.183.0

parameters

# of roomssq. m2

house ageyard?


features

102.30.94-10.183.0

parameters

# of roomssq. m2

house ageyard?

1 45.7


features

102.30.94-10.183.0

=parameters prediction

# of roomssq. m2

house ageyard?

1 45.7


features

102.30.94-10.183.0

=parameters prediction

# of roomssq. m2

house ageyard?

1 45.7

758,013


X =⇥1 x1 x2 . . .

⇤

✓ =⇥✓0 ✓1 ✓2 . . .

⇤

dot product

X - input feature vectortheta - weights

X =⇥1 x1 x2 . . .

⇤

✓ =⇥✓0 ✓1 ✓2 . . .

⇤

✓ ·X = ✓0 + ✓1x1 + ✓2x2 + . . .

dot product

X - input feature vectortheta - weights

training data

learning algorithm

hypothesisHypothesis (decision function): what the system has learned so farHypothesis is applied to new data

hθ(X)

The task of our algorithm is to determine the parameters of the hypothesis.

hθ(X) input data


hθ(X)

parameters

input data


hθ(X)

parameters

input data

prediction y


LINEAR REGRESSION

5 10 15 20 25 30 35

40

80

120

160

200

whisky age

whisky price $

Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price.Linear regression does not work well for classification because its output is unbounded.Thresholding on some value is tricky and does not produce good results.

LOGISTIC REGRESSION

g(z) =1

1 + e�z

0.5

1

0z

Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.

LOGISTIC REGRESSION

g(z) =1

1 + e�z

0.5

1

0z

z = ✓ ·X

Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.

h✓(X) =1

1 + e�✓·X

Probability that y=1 for input X

LOGISTIC REGRESSIONIf hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70% chance it’s spam. Thresholding on that is up to you.

Building the Tool

Corpuscollection of source data used for training and testing the model

Twitter MongoDBphirehose

hooks into streaming API

Twitter MongoDBphirehose

8500 tweetshooks into streaming API

FeatureIdentification

independent&

discriminant

Independent: feature A should not co-occur (correlate) with feature B highly.Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts with is not a good feature).

‣ @a at the end of the tweet‣ @a...‣ length < N chars‣ # of user mentions in the tweet‣ # of hashtags‣ language!‣ @a followed by punctuation and a word

character (except for apostrophe)‣ …and more

possible features

feature = extractor(tweet)

For each feature, write a small function that takes a tweet and returns a numeric value (floating-point).

corpus

extractors

feature vectorsRun the set of these functions over the corpus and build up feature vectorsArray of arraysSave to DB

LanguageMatters

high correlation between the language of the tweet and its category (good/bad)

Indonesian or Tagalog?Garbage.

id Indonesian 3548

en English 1804

tl Tagalog 733

es Spanish 329

so Somalian 305

ja Japanese 300

pt Portuguese 262

ar Arabic 256

nl Dutch 150

it Italian 137

sw Swahili 118

fr French 92

Top 12 Languages

I guarantee you people aren’t tweeting at me in Swahili.

Language Detection

Can’t trust the language field in user’s profile data.Used character N-grams and character sets for detection.Has its own error rate, so needs some post-processing.

Language Detection

Text_LanguageDetecttextcatpecl /

pear /

Can’t trust the language field in user’s profile data.Used character N-grams and character sets for detection.Has its own error rate, so needs some post-processing.

✓ Clean-up text (remove mentions, links, etc)✓ Run language detection✓ If unknown/low weight, pretend it’s English, else:✓ If not a character set-determined language, try harder:

✓ Tokenize into words✓ Difference with English vocabulary✓ If words remain, run parts-of-speech tagger on each✓ For NNS, VBZ, and VBD run stemming algorithm✓ If result is in English vocabulary, remove from remaining✓ If remaining list is not empty, calculate:

unusual_word_ratio = size(remaining)/size(words)✓ If ratio < 20%, pretend it’s English

EnglishNotEnglish

A lot of this is heuristic-based, after some trial-and-error.Seems to help with my corpus.

BINARY CLASSIFICATION

Grunt workBuilt a web-based tool to display tweets a page at a time and select good ones

feature vectors

labels (good/bad)

INP

UT

OUTP

UT

Had my input and output

BIASCORRECTION

One more thing to address

BIASCORRECTION

BAD GOOD

99% = bad (less < 100 tweets were good)Training a model as-is would not produce good resultsNeed to adjust the bias

BIASCORRECTION

BAD GOOD

O V E RS A M P L I N G

Oversampling: use multiple copies of good tweets to equalize with badProblem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category

O V E RS A M P L I N G

UNDERUndersampling: drop most of the bad tweets to equalize with goodProblem: total corpus ends up being < 200 tweets, not enough for training

S A M P L I N G

UNDERUndersampling: drop most of the bad tweets to equalize with goodProblem: total corpus ends up being < 200 tweets, not enough for training

OVERSAMPLINGSynthetic

Synthesize feature vectors by determining what constitutes a good tweet and do weighted random selection of feature values.

chance feature90% “good” language70%25%5%

no hashtags1 hashtag

2 hashtags2% @a at the end

85% rand length > 10

The actual synthesis is somewhat more complex and was also trial-and-error basedSynthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)





1






1

2






1

2

0






1

2

0

77


ModelTraining

We have the hypothesis (decision function) and the training set, How do we actually determine the weights/parameters?

COSTFUNCTION

Measures how far the prediction of the system is from the reality.The cost depends on the parameters.The less the cost, the closer we’re to the ideal parameters for the model.

REALITY

PREDICTION

COSTFUNCTION


COST

FUNC

TION

J(✓) =1

m

mX

i=1

Cost(h✓(x), y)


Cost(h✓(x), y) =

(� log (h✓(x)) if y = 1

� log (1� h✓(x)) if y = 0

LOGISTIC COST

10

y=1 y=0

10

Correct guess

Incorrect guess

Cost = 0

Cost = huge

LOGISTIC COST

When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.

minimize costO V E R θ

Finding the best values of Theta that minimize the cost

GRADIENT DESCENTRandom starting point.Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat.Imagine a ball rolling down from a hill.

✓i ↵= �✓i@J(✓)

@✓i

GRADIENT DESCENTEach step adjusts the parameters according to the slope

↵= �

each parameter

✓i ✓i@J(✓)

@✓i

Have to update them simultaneously (the whole vector at a time).

✓i = �✓i

learning rate

↵@J(✓)

@✓i

Controls how big a step you takeIf α is big have an aggressive gradient descentIf α is small take tiny stepsIf too small, tiny steps, takes too longIf too big, can overshoot the minimum and fail to converge

✓i ↵= �✓i

derivativeaka

“the slope”

@J(✓)

@✓i

The slope indicates the steepness of the descent step for each weight, i.e. direction.Keep going for a number of iterations or until cost is below a threshold (convergence).Graph the cost function versus # of iterations and see where it starts to approach 0, past that are diminishing returns.

✓i = ✓i � ↵

mX

j=1

(h✓(xj)�y

j)xji

THE UPDATE ALGORITHMDerivative for logistic regression simplifies to this term.Have to update the weights simultaneously!

X1 = [1 12.0] X2 = [1 -3.5]

Hypothesis for each data point based on current parameters.Each parameter is updated in order and result is saved to a temporary.

y1 = 1y2 = 0

X1 = [1 12.0] X2 = [1 -3.5]


y1 = 1y2 = 0

X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]


y1 = 1y2 = 0

X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]

↵ = 0.05


y1 = 1y2 = 0

h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786

X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]

↵ = 0.05


y1 = 1y2 = 0

h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438

X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]

↵ = 0.05


y1 = 1y2 = 0

h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438

X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]

T0

↵ = 0.05


y1 = 1y2 = 0

= 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)

h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438

X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]

T0

↵ = 0.05


y1 = 1y2 = 0

= 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)

h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438

X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]

T0

↵ = 0.05


y1 = 1y2 = 0

= 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)= 0.088

h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438

X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]

T0

↵ = 0.05


y1 = 1y2 = 0

= 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)= 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5)= 0.305

h(X1) = 1 / (1 + e-(0.1 • 1 + 0.1 • 12.0)) = 0.786h(X2) = 1 / (1 + e-(0.1 • 1 + 0.1 • -3.5)) = 0.438

X1 = [1 12.0] X2 = [1 -3.5]θ = [0.1 0.1]

T1

↵ = 0.05

Note that the hypotheses don’t change within the iteration.

y1 = 1y2 = 0

X1 = [1 12.0] X2 = [1 -3.5]

θ = [T0 T1]

↵ = 0.05

Replace parameter (weights) vector with the temporaries.

y1 = 1y2 = 0

X1 = [1 12.0] X2 = [1 -3.5]

θ = [0.088 0.305]

↵ = 0.05

Do next iteration

Trai ningCR

OSS

Used to assess the results of the training.

DATA

TRAINING

DATA

TRAINING TEST

Train model on training set, then test results on test set.Rinse, lather, repeat feature selection/synthesis/training until results are "good enough".Pick the best parameters and save them (DB, other).

Putting It All Together

Let’s put our model to use, finally.The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain error handling, etc. Once we get the actual tweet though..

Load the modelThe weights we have calculated via training

Easiest is to load them from DB (can be used to test different models).

HARDCODEDRULES

We apply some hardcoded rules to filter out the tweets we are certain are good or bad.The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.

HARDCODEDRULES

SKIPtruncated retweets: "RT @A ..."


HARDCODEDRULES


@ mentions of friends


HARDCODEDRULES


tweets from friends

@ mentions of friends


Classifying Tweets

This is the moment we’ve been waiting for.

Classifying Tweets

GOOD


Classifying Tweets

GOOD BAD


h✓(X) =1

1 + e�✓·X

Remember this?

First is our hypothesis.

h✓(X) =1

1 + e�✓·X

Remember this?

✓ ·X = ✓0 + ✓1X1 + ✓2X2 + . . .

First is our hypothesis.

Finally

h✓(X) =1

1 + e�(✓0+✓1X1+✓2X2+... )

If h > threshold , tweet is bad, otherwise good

Remember that the output of h() is 0..1 (probability).Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.

extract features

3 simple stepsInvoke the feature extractor to construct the feature vector for this tweet.Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation).Use the output of the classifier.

extract featuresrun the model

3 simple stepsInvoke the feature extractor to construct the feature vector for this tweet.Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation).Use the output of the classifier.

extract featuresrun the model

act on the result3 simple stepsInvoke the feature extractor to construct the feature vector for this tweet.Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation).Use the output of the classifier.

BAD?blockuser!

Also save the tweet to DB for future analysis.

Lessons Learned

Blocking is the only option (and is final)Streaming API delivery is incompleteReplyCleaner judged to be ~80% effective

Twitter API is a pain in the rear

Lessons Learned



-Connection handling, backoff in case of problems, undocumented API errors, etc.

Lessons Learned



-No way for blocked person to get ahold of you via Twitter anymore, so when training the model, err on the side of caution.

Lessons Learned



-Some tweets are shown on the website, but never seen through the API.

Lessons Learned



-Lots of room for improvement.

Lessons Learned



PHP sucks at math-y stuff

-Lots of room for improvement.

NEXTSTEPS

★ Realtime feedback★ More features★ Grammar analysis★ Support Vector Machines or

decision trees★ Clockwork Raven for manual

classification★ Other minimization algos:

BFGS, conjugate gradient★ Wish pecl/scikit-learn existed

Click on the tweets that are bad and it immediately incorporates them into the model.Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.SVMs more appropriate for biased data sets.Farm out manual classification to Mechanical Turk.May help avoid local minima, no need to pick alpha, often faster than GD.

TOOLS

★ MongoDB★ pear/Text_LanguageDetect★ English vocabulary corpus★ Parts-Of-Speech tagging★ SplFixedArray★ phirehose★ Python’s scikit-learn (for

validation)★ Code sample

MongoDB (great fit for JSON data)English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/SplFixedArray in PHP (memory savings and slightly faster)

http://pear.php.net/package/Text_LanguageDetect

http://pear.php.net/package/Text_LanguageDetect

http://corpus.byu.edu/

http://corpus.byu.edu/

http://phpir.com/part-of-speech-tagging/

http://phpir.com/part-of-speech-tagging/

http://php.net/SplFixedArray

http://php.net/SplFixedArray

https://github.com/fennb/phirehose

https://github.com/fennb/phirehose

https://gist.github.com/4642209

https://gist.github.com/4642209

LEARN★ Coursera.org ML course★ Ian Barber’s blog★ FastML.com

Click on the tweets that are bad and it immediately incorporates them into the model.Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.SVMs more appropriate for biased data sets.Farm out manual classification to Mechanical Turk.May help avoid local minima, no need to pick alpha, often faster than GD.

http://livepage.apple.com/

http://livepage.apple.com/

http://phpir.com/linear-regression-in-php

http://phpir.com/linear-regression-in-php

http://fastml.com/

http://fastml.com/

Questions?

2013 - Andrei Zmievski: Machine learning para datos

Technology

Transcript of 2013 - Andrei Zmievski: Machine learning para datos