Machine learning overview (with SAS software)

C op yr i g h t © 2012 , SAS Ins t i t u te Inc . A l l r i g h ts r eser v ed .

MACHINE LEARNING WITH SAS WORKSHOP

GETTING THE MOST OUT OF YOUR DATA

Longhow Lam


AGENDA AND SOME READING MATERIAL

Intro & positioning of Machine learning

SAS platform for Machine learning

Overview of Specific methods

Some examples

Further reading

An experimental comparison of classification techniques for imbalanced

credit scoring data sets using SAS® Enterprise Miner

http://support.sas.com/resources/papers/proceedings12/129-2012.pdf

Benchmarking state-of-the-art classification algorithms for credit scoring: A ten-year update

http://www.business-school.ed.ac.uk/waf/crc_archive/2013/42.pdf

An absolute recommender for more detail:

The elements of statistical learning, Hasting, Tibshirani & Friedman

http://www-stat.stanford.edu/~tibs/ElemStatLearn/

http://support.sas.com/resources/papers/proceedings12/129-2012.pdf

http://www.business-school.ed.ac.uk/waf/crc_archive/2013/42.pdf

http://www-stat.stanford.edu/~tibs/ElemStatLearn/


LONGHOW LAM SHORT BIO

MSc Mathematics (1995) Vrije Universiteit Amsterdam (drs. wiskunde)

MTD Applied Statistics (1997) Technical University Delft (twee jarige AIO toegepaste statistiek)

10+ year SAS experience (Base / Stat / Guide/ Miner / VA / VS)

10+ year R experience ( An introduction to R)

10 + year predictive modeling experience

ABNAMRO – Risk modeler

Basel, Credit risk, ALM models

Business&Decision – Quantitative consultant

ING Belgium, Fortis

Leaseplan, Belgium Post

Experian – data mininer

Collection Score, Delphi credit score, consulting

@longhowlamFollow me:

http://cran.xl-mirror.nl/manuals.html


INTRO MACHINE LEARNING

Wikipedia:

“Machine learning is a scientific discipline that deals with the construction

and study of algorithms that can learn from data. Such algorithms operate by

building a model based on inputs and using that to make predictions or

decisions, rather than following only explicitly programmed instructions.”


MACHINE LEARNING AND SOME OTHER TERMS YOU OFTEN HEAR

Statistical

modeling

Supervised

Learning

Clustering

Unsupervised

Learning

Data mining

Machine

learningDimension

reduction

Association

rulesRecommender

Auto

encoders

Self

organizing

maps


SAS SOFTWARE

FOR MACHINE LEARNING (AND DATA MINING)


IDENTIFY /

FORMULATE

PROBLEM

DATA

PREPARATION

DATA

EXPLORATION

TRANSFORM

& SELECT

BUILD

MODEL

VALIDATE

MODEL

DEPLOY

MODEL

EVALUATE /

MONITOR

RESULTSSAS In-Database Scoring

SAS Decision Manager

BUSINESS

MANAGER

SAS Model Manager

IT SYSTEMS /

MANAGEMENT

SAS Enterprise Guide

BUSINESS

ANALYST

Enterprise Miner / Text Miner

SAS IMSTAT / Recommender

DATA MINER /

DATA SCIENTIST

THE ANALYTICS

LIFECYCLE

SAS Visual Analytics

SAS Visual Statistics


EASY TO USE GUI FOR MACHINE LEARNING COMBINED WITH CODE LIBRARIES

PROC hpbnet data = creditdata

structure = markovblanket;

model default = x1 LTV income age;

selction = Y

RUN;


MACHINE LEARNING

Machine Learning algorithms designed to run on single

blade or multi blade distributed memory environments

HIGH PERFORMANCE


Manage

Rules + Data + Models

Deployment flexibility:

Batch

Real Time

Stored Process

In Database

Drive Reuse and

Consistency

EASY DEPLOYABLE

Model

Data

Rules

Model

MACHINE LEARNING WITH SAS


PREDICT SOMEONE’S INCOME

Income = 15.2 + 1.102 × Age

Age

Income

Predict someones income from his/her age

Collect some data

Plot the data

Analytical Base Table

IS THIS MACHINE LEARNING?


MACHINE LEARNING ADDRESSING SOME MODELING ISSUES

The problem may not be linear: X2, X3, Log(X), Sqrt(X), 1/X ,…….?

You do not have one input variable: X1, X2, X3,……X567

Interactions en correlations between input variables

age

income

male

female

Analytical base table Derived inputs


MACHINE LEARNING WHY IT CAN MATTER € € €

Suppose we have an untargeted direct mailing of 100.000 ‘letters’ to randomly

sampled prospects:

Conversion rate is around 1%. Profit per conversion €80, Cost per mailing is €0.70

Total ROI = 100.000 X 1% X € 80 − 100.000 X € 0.70 = € 10,000

Now we have a targeted mailing with a machine learning predictive model, that uses

prospect input data that can distinguish between high / low responders.



Decile N Conversion Profit Cumulative

1 10.000 2.00% 9.000 9.000

2 10.000 1.50% 5.000 14.000

3 10.000 1.00% 1.000 15.000

4 10.000 1.00% 1.000 16.000

5 10.000 1.00% 1.000 17.000

6 10.000 1.00% 1.000 18.000

7 10.000 1.00% 1.000 19.000

8 10.000 0.80% -600 18.400

9 10.000 0.50% -3.000 15.400

10 10.000 0.20% -5.400 10.000

The profit by using a model to sent

letters only to the first 7 deciles is now:

€ 19.000 (instead of € 10.000)

If you have 100 of such campaigns a

year that means an increase of

€ 0.9 mln !!




1 10.000 3.00% 17.000 17.000

2 10.000 2.00% 9.000 26.000

3 10.000 1.40% 4.200 30.200

4 10.000 1.15% 2.200 32.400

5 10.000 1.00% 1.000 33.400

6 10.000 0.60% -2.200 31.200

7 10.000 0.40% -3.800 27.400

8 10.000 0.30% -4.600 22.800

9 10.000 0.10% -6.200 16.600

10 10.000 0.05% -6.600 10.000

The profit by using a much better model

to sent letters only to the first 5 deciles

is now:

€ 33.400 (instead of € 10.000)



€ 2.34 mln !!


MACHINE LEARNING WHY IT CAN MATTER? € € €


1 10.000 3.35% 19.800 19.800

2 10.000 2.23% 10.840 30.640

3 10.000 1.30% 3.400 34.040

4 10.000 1.10% 1.800 35.840

5 10.000 1.00% 1.000 36.840

6 10.000 0.55% -2.600 34.240

7 10.000 0.28% -4.760 29.480

8 10.000 0.25% -5.000 24.480

9 10.000 0.05% -6.600 17.880

10 10.000 0.02% -6.840 11.040

Now lets suppose we have even a

slightly better model than the last one

€ 36.840



€ 2.68 mln !!


OVERVIEW OF SPECIFIC

MACHINE LEARNING METHODS

Classical regression

Decision trees

Dimension reduction

Bagging & Boosting

Support vector machines

K-Nearest Neighbour

Neural networks / deep learning

Bayesian networks

Text mining

Recommendation engine


“CLASSICAL” REGRESSION


LINEAR & LOGISTIC REGRESSION

Income = a + b × Age

Age

Income

Age

P(Churn)

1

0

P(Churn) = 1

1+𝐸𝑋𝑃(𝑎+𝑏 × Age)

Numeric target variable Binairy target variable


SPLINE REGRESSION MODELING NON LINEARITIES

Often there is a non linear relation

• Transformation of inputs: X2 , X3 , log(X) etc…

• Buckets / binning of variables

Y / logit(y)

X

Smoothing Splines


SPLINE REGRESSION MODELING NON LINEARITIES

Smoothing Splines: Piecewise polynomials that are glued together at knots

Two special cases for λ:

λ = 0 Any function that interpolates the data

λ = ∞ Simple Least square line fit

Choose λ by cross validation


OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

Extracted data from car sales site. For many cars we have the

kilometres driven and the car price. For the Opel Astra we have 2360 cars:

What is the relation between km driven and car sales price?

Too much smoothing and too little smoothing


OPEL ASTRA CAR EXAMPLESPLINE REGRESSION

0.2 is the optimal smoothing paramter


Some other car make/models with

spline estimates of car depreciation

versus kilometres driven.

Hmmm.. my Renault Clio looks nice

but after 50.000 km I only have 46%

of the original value left…


MODELING NON LINEARITIES

In SAS we have TPSLINE, LOESS and the ADAPTIVEREG procedure

to fit multivariate regression splines

Supports:

More than one input

linear, logistic, Poisson, GLM regressions

combines both regression splines and model selection methods.

supports partitioning of data into training, validation, and testing roles

SPLINE REGRESSION


DECISION TREES


DECISION TREES

How does it work? A simple example

Suppose we have the following group of people

50% Response

50% No Response

We have/know Age and Marital Status

50%

50%

Age≤ 45 Age> 45

30%

70%

60%

40%

Married

Divorced UnMarried

20%

80%

60%

40%


DECISION TREES REGRESSION & CLASSIFICATION

Target X1 X2 X3 X4 X5

Y 12 A 456 1.2 X

N 21 B 456 1.5 X

Y 32 A 545 1.3 U

Y 34 C 443 1.1 U

N 23 A 345 1.7 U

N 13 B 567 1.2 X

N 45 A 654 1.9 X

… … … … … …

… … … … … …

Y 46 A 657 2.1 X

A recursive splitting algorithm:

1. Loop trough all inputs

2. Determine per input how to split

3. Take the best input to split

4. On the two new data sets apply 1,2,3 again….

5. Stop somewhere….

• How to split X1 or X2 ?

• When to stop?


DECISION TREES

How to split?Number is usualy 2 or 3.

More splits will exhaust the data too fast

Why split X1 <t1 beter dan X1 <s1?

Regression: Mean squared Error

Classification:

Mis-classification rate,

Cross-entropy, Chi-Squared

Regression tree: Mean square error

..

...

.. . .

...

.. .

.

Split s1 Split t1x

Y Y

x

REGRESSION & CLASSIFICATION


DECISION TREES

How to split?Number is usualy 2 or 3.

More splits will exhaust the data too fast

Why split X1 <t1 beter dan X1 <s1?

Regression: Mean squared Error

Classification:

Mis-classification rate,

Cross-entropy, Chi-Squared

Classification tree: Mis classificatie rate

xSplit s1 Split t1

REGRESSION & CLASSIFICATION


Decision trees (regressie & classificatie)

When to stop?

Not too early not too late!

Pruning

Remove parts the tree


DECISION TREES SOME COMMON TYPES

CHAID (chi-squared automatic interaction detection)

C4.5 / C5.0

CART (Classification and Regression)

The difference is mainly in the different splitting options


Decision trees pros and cons

pros

Interaction between variables

Interpretable rules Missing values easy to incorporate.

cons

Unstable

“Lack-of-Smoothnes” Fit of obvious (non)linear relations

man vrouw

Inkomen < 45 K Leeftijd < 33

Response rate

Opel Astras


DIMENSION REDUCTION


PRINCIPLE

COMPONENTSANALYSIS

Linear transformation of data to uncorrelated data

The transformation W is such that

The largest variance is in the first coordinate

The second largets variance is in the second coordinate

Etc…


PRINCIPLE

COMPONENTSANALYSIS

X1

X2

x x x x x x x

x

x

x

x

x

x

x

x


PRINCIPLE

COMPONENTSANALYSIS


PRINCIPLE

COMPONENTSANALYSIS

The Math behind

P = X W

𝑝11 𝑝21...

.

.

.𝑝1𝑛 𝑝2𝑛

=

𝑥11 𝑥21...

.

.

.𝑥1𝑛 𝑥2𝑛

𝑤11 𝑤21

𝑤12 𝑤22

w11 and w12 are the loadings corresponding to the first principle component.

w21 and w22 are the loadings corresponding to the second principle component.

With two dimensions In general

It turns out that the columns of W

Are the eigenvalue vectors of the matrix XTX


PRINCIPLE

COMPONENTSANALYSIS

Scaling the inputs is important here

Applications of PCA

Dimension reduction

Visualisation

Outlier / anomalie detectie

PCA regression

Use PC instead of the original inputs


PRINCIPLE

COMPONENTSDIMENSION REDUCTION

P = X WNow only take the first L columns of W

PL = X WL

For example for visualization only use the first

2 or 3 columns so that PL only has 2 or 3

columns that can be visualized in scatter or

contour plots

X

W

P=

XWL

PL

=

(10000 by 100 ) (100 by 100)(10000 by 100 )

(10000 by100 ) (100 by2)(10000 by 2)


SINGULAR VALUE DECOMPOSITION

Matrix SVD decomposition:

Diagonal with r singular values

[ could be a large number]UA

VT

═ Σ


SINGULAR VALUE DECOMPOSITION

A datapoint d can now be represented by k dimensional point

Matrix SVD decomposition:


[ could be a large number]UA

VT

═ Σ

Take only k << r singular values

Uk

Ak

VTk

═

Σk


SVD EXAMPLE USING MY SON AS AN EXPERIMENT

Original

2448 X 3264 ~ 8 mln numbers



SVD: 15 largest SV’s

1% of the data



SVD: 75 largest V’s

5% of the data


VARIABLE

CLUSTERING TO REDUCE THE DIMENSION

Variabele selection

I have 500 inputs but maybe there are only ten clusters of inputs

Within 1 cluster the variables are (strongly) correlated.

Then use only 1 input per cluster for predictive modeling

X1, X2, X3, ….., X500

X1, X21, X35, X430,….. X35

X17, X29, X353, X490,…. X29

X37, X95, X251, X393,…. X251


VARIABLE

CLUSTERING TO REDUCE THE DIMENSION


BAGGING & BOOSTING


COMBINE MODELS BAGGING & BOOSTING

If one model is not good enough: let multiple models vote for a prediction

Bootstrap Aggregation (Bagging)

This makes only sense if underlying models are different enough and have some predictive power

Random

sample

Final

modeldata


Bagging & Boosting: Random Forests

Random forests ≈ Bagging with trees

Apply underlying steps repeatedly

1. Generate a bootstrap sample

2. Choose randomly m inputs m << P

3. Fit a tree on the bootstrap sample with the m inputs (do not prune)

In case of a classification tree:

The random forest prediction is the majority vote of all trees

In case of a regression tree:

The random forest prediction is the average of all trees


FOREST VS TREE EXAMPLE ON SIMULATED DATA

Decision tree and Random forest (100

sub trees) fitted on the simulated data


FOREST VS TREE EXAMPLE ON SIMULATED DATA

It is clear to see that the forest can produce much smoother predictions.


GRADIENT BOOSTING DON’T LET THE FORMULAS INTIMIDATE YOU


GRADIENT

BOOSTINGSCHEMATIC OVERVIEW

Gradient Boosting, M iterations m = 1,2,…,M

Inputs

xr1

Final

model FM… M

At each succesive iteration a base learner hm(which is a decision tree) is fit on the pseudo residuals

using inputs x to “correct” the previous learner.

Pseudo residuals rim at each step

r2rM

Inputs

x

Inputs

x

Fm = Fm-1 + γ·hm


SUPPORT VECTOR MACHINES


Support vector machines (SVM)

Suppose we have a separable classification problem.

Find a linear decision boundary between the two groups with

maxium margin M. So green line would be better than blue line.

If not separable you have to allow that some points are on the

wrong side. These points are penalized. SVM still maximizes the

margin M, but with the constraint that total penalty is smaller than

C.

The input space might not be linear. We could apply non linear

mappings to the inputs: I.e. x2 , x3 , of spline(x).

The beauty of SVM is that in the calculations of the decision

boundary we do not need to explicitly use these transformations

“The kernel trick”


SVM UNDERLYING MATHEMATICAL OPTIMIZATION PROBLEMS

Separable classification

Non Separable classification

Non Separable classification rewritten using

Lagrange Dual problem

Kernels to model nonlinear behaviour


https://www.youtube.com/watch?v=3liCbRZPrZA

Linear not separable, but in 3D space they are!

https://www.youtube.com/watch?v=3liCbRZPrZA


K – NEAREST NEIGHBOUR


K-NN METHOD

• No model is fitted. Given a query point x0 , find the k points x1, x2,..., xk that are

closest in distance to x0.

• Classify x0 using the majority vote among the k neighbours

x05 nearest neighbours of x0

3 of them are red

2 of them are green

so we predict x0 to be red


K-NN METHOD

1 nearest neighbour 15 nearest neighbour


K-NN METHOD

Use different numbers k of nearest neighbours test and traning errors

Despite its simplicity, k-nearest-neighbors has been

successful used in problems like

• handwritten digits,

• Satellite image scenes

• EKG patterns


K-NN EXAMPLE DUTCH HOUSE PRICES

Extract house for sale prices from a Dutch housing site

For 108K Dutch postal codes (out of 463K) there are one or more houses for sale.

How can we estimate the house value for the postal codes without a house price?

For a Postal code with no price estimate the price

by taking the k closest house for sale prices.


Comparing different nearest neighbours in SAS Enterprise Miner


K-NN EXAMPLE DUTCH HOUSE PRICES

30% of the data was used as validation set

In Enterprise Miner different values for k were used

k=5 nearest neighboor has the lowest Average squared error


NEURAL NETWORKS

DEEP LEARNING


NEURAL NETWORK LINEAR REGRESSION

f Y = f(X,w) = w1 + w2X2 + w3X3 + w4X41

X2

X3

X4w4

w3

w1

w2 Neural network compute node

f is the so-called activation function.

This could be the logit function, but

other choices are possible

There are four weights w’s that have

to be determined


NEURAL NETWORKS MATHEMATICAL FORMULATION

In formula the prediction forumla for a NN is geiven by

Leeftijd

Inkomen

Regio

Geslacht

X1

X2

X3

X4

Z1

Z2

Z3

Y

N

X inputs Hidden layer z outputs

α1

β1

P Y X) = 𝑔 𝑇𝑌

𝑇𝑌 = 𝛽0𝑌 + 𝛽𝑌𝑇𝑍

𝑍𝑚 = 𝜎 𝛼0𝑚 + 𝛼𝑚𝑇 𝑋

De functions g and σ are defined as

𝑔 𝑇𝑌 =𝑒𝑇𝑌

𝑒𝑇𝑁+𝑒𝑇𝑌, 𝜎(𝑥) =

1

1+𝑒−𝑥

In case of a binary classifier 𝑃 𝑁 𝑋 = 1 − 𝑃(𝑌|𝑋)

The model weights α and β have to be estimated from the data


NEURAL NETWORKS ESTIMATING THE WEIGHTS

Back propagation algorithm

Randomly choose small values for all wi’ s

For each data point (observation)

1. Calculate the neural net prediction

2. Calculate the error E (for example: E = (actual – prediction)2)

3. Adjust weights w according to:

4. Stop if error E is small enough.

𝑤𝑖𝑛𝑒𝑤 = 𝑤𝑖 + ∆𝑤𝑖

∆𝑤𝑖 = −𝛼𝜕𝐸

𝜕𝑤𝑖


DEEP LEARNING NEURAL NET WORK WITH MORE THAN 2 HIDDEN LAYERS


NEURAL NETS AUTOENCODERS

http://support.sas.com/resources/papers/proceedings14/SAS313-2014.pdf

Neural networks that use inputs to predict the inputs

X1

X2

X3

X4

X1

X2

X3

X4

ENCODE DECODE

Linear activation function corresponds with 2 dimensional principle components analysis

2 dimensional middle layer

For visualisation



NEURAL NETS AUTOENCODERS


Often more hidden layers with many nodes

ENCODE DECODE

INPUT OUTPUT = INPUT



NEURAL NET CARS EXAMPLE

2 dimensional PCAAutoencoder network

25 – 15 – 2 – 15 – 25


NEURAL NETS AUTOENCODER EXAMPLE

• 1000 images of digits

• Each image has 400 pixels

• So a 400 dimensional input vector X = (x1,…,x400)

• Compare two dimensional PCA with an neural net auto encoder


NEURAL NETS AUTOENCODER EXAMPLE

proc neural

data= autoencoderTraining

dmdbcat= work.autoencoderTrainingCat;

performance compile details cpucount= 12 threads= yes;

/* DEFAULTS: ACT= TANH COMBINE= LINEAR */

/* IDS ARE USED AS LAYER INDICATORS – SEE FIGURE 6 */

/* INPUTS AND TARGETS SHOULD BE STANDARDIZED */

archi MLP hidden= 5;

hidden 300 / id= h1;


hidden 2 / id= h3 act= linear;



input corruptedPixel1 - corruptedPixel400 / id= i level= int std=

std;

target pixel1-pixel400 / act= identity id= t level= int std= std;

/* BEFORE PRELIMINARY TRAINING WEIGHTS WILL BE RANDOM */

initial random= 123;

prelim 10 preiter= 10;

run;


Two dimensional representation of 400 dimensial ‘digit’ data


BAYESIAN NETWORKS


BAYESIAN NETWORKS -- ACYCLIC GRAPHICAL MODELS

• Nodes represent random variables,

• Links between nodes represent conditional dependencies,

• Conditional probabilty tables are derived from training data for each node,

• Random variables are typically

binary or discrete,

• The graph structure can be

learned from the data,


TEXT MINING


TEXT MINING BASICS

“Advanced” word counting

Parse & Filter Part of speech

Entity detection

Mixed / numeric / abbrev.

Stemming

Spell checks, Stop list, Synonim list

Multi-term words

Apply Traditional data mining Clustering

Prediction / machine learning


TEXT MINING BASICS

Document 1: “Ik loop over straat in Amsterdam, 1057DK, met mijn fiets”

Document 2: “Zij liep niet maar fietste met haar blauwe fieets, //bitly.com/sdrtw”

Document 3: “Mijn tweewieler is kapot, wat een slecht stuk ijzer, @#$%$@!”

Terms Doc 1 Doc 2 Doc 3

+Fiets (znmw) 1 1 1

Fietsen (ww) 0 1 0

Blauwe (bvg) 0 1 0

Amsterdam (locatie) 1 0 0

+Lopen (ww) 1 1 0

Straat (znmw) 1 0 0

Kapot (bijw) 0 0 1

Slecht 0 0 1

Stuk Ijzer 0 0 1

1057DK (postcode) 1 0 0

//bitly.com/sdrtw (Internet) 0 1 0

TERM DOCUMENT MATRIX: A

• Each text document is (very) long vector

of word counts (often with many zeros!)

• Apply further mining on this matrix A.


TEXT MINING TERM DOCUMENT MATRIX A

It is not useful to apply data mining techniques directly on the term document

matrix

• Often more terms than documents

• Rows could be strongly correlated

• Matrix is often very sparse

Apply Singular value decomposition first.


TEXT MINING SVD ON THE TERM DOCUMENT MATRIX A

A document d is not a long vector of m word counts but a much shorter vector 𝑑,

say of length 300.

Matrix SVD decompositie:


[ could be many thousands ]UA

VT

═ Σ

take only the first k << r singular values

Uk

Ak

VTk

═

Σk


TEXT MINING APPLICATIONS

Combine customer structured data and unstructured data to better predict behaviour (churn / fraud)

Apply machine learning to create

a model f to predict the target

Automatically generate topics within large document collections

Apply clustering techniques to classify

documents into clusters (topics)

Topic 1 Topic 2 Topic 3


RECOMMENDATION ENGINE

Which product should I recommend my customers?


RECOMMENDATION

ENGINE USER – ITEM MATRIX EXPLICIT RECOMMENDATIONS

Users rated items (products) explicitly

Matrix is often very sparse

1 mln users 100K items ~ 0.01%??

User - Item Matrix – DataItem 1 Item 2 Item 3 Item 4 Item 5

User 1 3 2 5 4 5

User 2 - - - 1 1

User 3 1 - 2 5 -

User 4 - - 1 2 5

User 5 2 1 4 2 3

User 6 2 3 - 5 1

User 7 5 1 - 3 4

User 8 - 1 - 4 1

User 9 2 3 2 4 2

User 10 - 1 3 - 1

User 4's Item RatingsUser 4 - - 1 2 5

After some math…. recommendations are: User 4 3.21 4.82 1 2 5

Recommend item 2!


RECOMMENDATION

ENGINE ALGORITHMS IN PROC RECOMMEND

Memory-based algorithms

Slope one (slope1)

K nearest neighbors (knn)

Model-based algorithms

Matrix factorization (SVD - LBFGS)

Market basket analysis

Association rules mining (arm)

Mixture of different methods

Clustering(cluster)

Ensemble


RE METHODS SLOPE ONE

Y = x + b with slope equal to 1;

See notes

Item-item based

𝑟𝑢𝑖 = 𝑗 𝑤𝑖𝑗𝑟𝑢𝑗

𝑗 𝑤𝑖𝑗

Weight wij: the number of users having rated both items i and j;

Rating ruj : the average rating computed from item j;

Sample rating database

Customer Item A Item B Item C

John 5 3 2

Mark 3 4 ??

Lucy ?? 2 5


RE METHODS K NEAREST NEIGHBORS

The rating rui is determined by the ratings “in the neighborhood”

𝑟𝑢𝑖 = 𝑗∈N 𝑖;𝑢 𝑠𝑖𝑚𝑖𝑗𝑟𝑢𝑗

𝑗∈N 𝑖;𝑢 𝑠𝑖𝑚𝑖𝑗

How to determine the neighbors and how many (k) to use?

How to compute the similarity/distance measure 𝒘𝒊𝒋

• Pearson’s correlation coefficient

• Cosine distance

• Other adjustments

Similarity w

Neighbors N


RE METHODS

PEARSON CORRELATION

𝑎, 𝑏 : users

𝑟𝑎,𝑝 : rating of user 𝑎 for item 𝑝

𝑃 : set of items, rated both by 𝑎 and 𝑏

• Possible similarity values between −1 and 1

𝒔𝒊𝒎 𝒂, 𝒃 = 𝒑 ∈𝑷(𝒓𝒂,𝒑 − 𝒓𝒂)(𝒓𝒃,𝒑 − 𝒓𝒃)

𝒑 ∈𝑷 𝒓𝒂,𝒑 − 𝒓𝒂𝟐

𝒑 ∈𝑷 𝒓𝒃,𝒑 − 𝒓𝒃𝟐


RE METHODS K NEAREST NEIGHBORS METHOD


RE METHODS MATRIX FACTORIZATION

How do we fill in the missing data?

m n

R U=

V

m k k n

Select loss function (squared error)

Select the number of hidden factors k

Optimization problem

L-BFGS

ALS

users

items

𝑅𝑖𝑗 = 𝑈𝑖𝑇𝑉𝑗Predict New Rating R:

Minimize prediction error: min𝑢,𝑣

𝑖,𝑗

(𝑅𝑖𝑗−𝑈𝑖𝑇𝑉𝑗)

2 + 𝜆( 𝑈𝑖2 + 𝑉𝑗

2)


RE METHODS CLUSTER

Knn within

one subgroup

User/item

profile

User/item

rating

Predictions

Clustering


RE METHOD ASSOCIATION RULE MINING (MARKET BASKET ANALYSIS)

Basic steps for assoc rules mining

Identify frequent itemsets (rules) in the transaction data:

IF item A and B THEN item C

IF item X THEN item Y

Not all rules are interesting, use ‘support’ and ‘lift’ to judge importance of a rule

# trxs. {X} {Y}

Total # trxs.

Support (X,Y) =

Lift = Support (X,Y)

Support (X) * Support(Y)

Support & LiftDiapers Beer 0.8%

Diapers Candles 0.018%

For example a lift of 2.5 means:

If people have X they are 2.5 more likely

to buy Y than if they don’t have X


METHOD ENSEMBLE

Linear combination of previous methods

Achieve better performance


PROC RECOMMEND recom = rs.IENS;

* Add a recommendation system;

ADD rs.IENS /item = item user = user rating = rating;

* Add tables;

ADDTABLE LHL1209.IENS_UIR / recom = rs.IENS type = rating vars=(item user rating);

* Method SVD LBFGS met 20 factoren ;

METHOD svd /

factors = 20

label = "svd" fconv = 1e-3

gconv = 1e-3 maxiter = 100

MAXFEVAL = 5000 function = L2

lamda = 0.2

technique = lbfgs;

RUN;

METHOD ARM /

label = "ARM" ;

RUN;

/* information on the recommender system */

INFO;

QUIT;


/** prediction with the SVD method ***/

PROC RECOMMEND recom = rs.IENS;

PREDICT /

method = svd

label = "svd"

Num = 3

users = ("Longhow Lam");

run;

QUIT;


LAST SLIDE


OF MORE MODERN MACHINE LEARNING

CONS Unfamilar with broader audiance, (more) difficult to explain

Black box approach (you are rejected: The computer says NO)

Often relations can already be modeled with classical regression models

It allows you to not think about the business problem

PROS Often less data prep (manual tuning) neccesary (just throw it in the algorithm…)

Interactions often “automatically” taken into account

Superior for Text mining, Image & Speech recognition

Better lift possible (paar procent “gratis”)

It allows you to not think about the business problem

(compared to traditional linear /logistic regression)

PROS AND CONS


WHY SAS FOR MACHINE LEARNING

• Many different techniques

• Easy to use GUI’s combined with flexible coding

• High performance scalability

• Easy Deployable


SOME MACHINE LEARNING EXAMPLES

Text mining

Image recognition

Sound recognition

Strange faces

So can a machine read, see and hear?


PREDICTING SENTIMENT FROM

RESTAURANT REVIEWS


IENS REVIEWS COLLECTED AROUND 16.000 REVIEWS AND THEIR SCORES

Used text miner to parse and filter reviews,

and transform reviews to data points in SVD space.


Predicted review score vs. Given review score

USE MACHINE LEARNING TO PREDICT TARGET WITH THE 300 INPUTS

R2 Linear regression = 0.5

R2 Neural Net = 0.6


IENS REVIEWS APPLY MODEL ON ‘NEW REVIEWS’


MNIST DATA IN SAS

MODIFIED NATIONAL INSTITUTE OF STANDARDS AND TECHNOLOGY


MNIST TRAINING DATA

42.000 pictures of hand-written digits

Each digit is a picture of 28 by 28 pixels

So a 784 dimensional vector

First 100 digits of the MNIST data and there KNOWN labels in red


MNIST DATA TRYING DIFFERENT LEARNING TECHNIQUES

8 – Nearest Neighbour has the lowest misclassification

rate. 3.6% of the digits in the validation set are mis

classified.

70/30 training/validation split

PCA regression on 50 largest PC’s

Seven singel layer neural nets: 3, 6, 12, 24,

48, 100, 200 neurons

Seven multi layer neural nets

Three Random forest: 100, 500 and 1000

trees

8, 16 and 24 nearest neighbors


MNIST DATA APPLY MODEL ON TEST SET

28.000 digits without known labels.

Our best model predicted the label for

these digits.

First 100 predicted digits, together with

the handwritten digits are displayed

here.

Red numbers are predicted labels. We

see obvious some mistakes…..


SPEECH RECOGNITION

DIGITS RECORDED WITH IPHONE

1 2


SPEECH RECOGNITION

WAV files consists of ~ 30.000 points too much redundancy

Use spectral analysis to convert signal to frequency domain

Still too much apply principle components

TRAIN DATA

8 spoken ‘ones’ in wav files

8 spoken ‘twos’ in wav files


SPEECH RECOGNITION


SPEECH RECOGNITION

Zero errors on training data

Zero errors on test data

Also 8 ‘ones’ and 8 ‘twos’

In Enterprise Miner:

Neural network with 9 neurons in one hidden layer


STRANGE FACE

DETECTIONCOMBO OF OPEN API / R & SAS

Little joke on my colleagues….


STRANGE FACE


Get free API key for Face++

Their API returns 83 facial landmarks (in JSON format)

Apply advanced analytics on the ABT

Which faces are look-alikes proc cluster (hierarchical cluster)

Sales faces? Predictive modeling / machine learning

Who is the Brad Pit? Nearest Neighbor

Strange faces? proc neural / auto-encoder

Create R script to

Retrieve the SAS faces from our site

put them trough the Face++ API

Collect JSON results and store them in an ABT

http://www.faceplusplus.com/demo-landmark/

https://drive.google.com/open?id=0B3u5HLwJy75FfkdqdTZyVUY1ZXE4RVJKcHlOZld2d2FKWElSb01kZGdHR2ktY3k4ZFU4QWM&authuser=0


STRANGE FACE

DETECTIONLOOK ALIKE FACES


STRANGE FACE

DETECTIONBRAD PIT LOOK A LIKES


STRANGE FACE

DETECTION

STRANGE FACES

SAS Faces, Actors Faces

Read more on my blog

D:/R_Projects/FaceTest/HTML/SASGEZICHTEN/gezichten.html

https://6f4b40676db23f61461217d665eccbfbde17e1fc.googledrive.com/host/0B3u5HLwJy75FWFR1WUJDS2dfdjQ

https://longhowlam.wordpress.com/2015/05/28/some-simple-facial-analytics-on-actors-and-my-manager/


STRANGE FACE


SAS Faces, Actors Faces

Read more on my blog

D:/R_Projects/FaceTest/HTML/SASGEZICHTEN/gezichten.html

https://6f4b40676db23f61461217d665eccbfbde17e1fc.googledrive.com/host/0B3u5HLwJy75FWFR1WUJDS2dfdjQ

https://longhowlam.wordpress.com/2015/05/28/some-simple-facial-analytics-on-actors-and-my-manager/

Machine learning overview (with SAS software)

Data & Analytics

Transcript of Machine learning overview (with SAS software)