Data-Intensive Distributed Computing - University of Waterloo

34
Data-Intensive Distributed Computing Part 6: Data Mining (1/4) This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States See http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details CS 431/631 451/651 (Winter 2022) Ali Abedi These slides are available at https://www.student.cs.uwaterloo.ca/~cs451 1 1

Transcript of Data-Intensive Distributed Computing - University of Waterloo

Page 1: Data-Intensive Distributed Computing - University of Waterloo

Data-Intensive Distributed Computing

Part 6: Data Mining (1/4)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 431/631 451/651 (Winter 2022)

Ali Abedi

These slides are available at https://www.student.cs.uwaterloo.ca/~cs451

1

1

Page 2: Data-Intensive Distributed Computing - University of Waterloo

Structure of the Course

“Core” framework features

and algorithm design

Analy

zin

gT

ext

Analy

zin

gG

raphs

Analy

zin

g

Rela

tional D

ata

Data

Min

ing

2

2

Page 3: Data-Intensive Distributed Computing - University of Waterloo

Supervised Machine Learning

The generic problem of function induction given sample instances of input and output

Classification: output draws from finite discrete labels

Regression: output is a continuous value

This is not meant to be an exhaustive treatment of machine learning!

3

3

Page 4: Data-Intensive Distributed Computing - University of Waterloo

Source: Wikipedia (Sorting)

Classification

4

4

Page 5: Data-Intensive Distributed Computing - University of Waterloo

Applications

Spam detection

Sentiment analysis

Content (e.g., topic) classification

Link prediction

Document ranking

Object recognition

And much much more!

Fraud detection

5

5

Page 6: Data-Intensive Distributed Computing - University of Waterloo

Applications

Spam detection

Sentiment analysis

Content (e.g., topic) classification

Link prediction

Document ranking

Object recognition

And much much more!

Fraud detection

6

6

Page 7: Data-Intensive Distributed Computing - University of Waterloo

training

Model

Machine Learning Algorithm

testing/deployment

?

Supervised Machine Learning

7

7

Page 8: Data-Intensive Distributed Computing - University of Waterloo

Objects are represented in terms of features:

“Dense” features: sender IP, timestamp, # of recipients, length of message, etc.

“Sparse” features: contains the term “Viagra” in message, contains “URGENT” in subject, etc.

Feature Representations

8

8

Page 9: Data-Intensive Distributed Computing - University of Waterloo

Applications

Spam detection

Sentiment analysis

Content (e.g., genre) classification

Link prediction

Document ranking

Object recognition

And much much more!

Fraud detection

9

9

Page 10: Data-Intensive Distributed Computing - University of Waterloo

Components of a ML Solution

DataFeaturesModel

Optimization

10

Data matters the most. If we throw a lot of data at almost any algorithm it performs

good.

10

Page 11: Data-Intensive Distributed Computing - University of Waterloo

(Banko and Brill, ACL 2001)

(Brants et al., EMNLP 2007)

No data like more data!

11

We have seem these examples before. For example stupid backoff outperforms

other algorithms when it’s trained on a lot of data.

11

Page 12: Data-Intensive Distributed Computing - University of Waterloo

Limits of Supervised Classification?

Why is this a big data problem?Isn’t gathering labels a serious bottleneck?

SolutionsCrowdsourcing

Bootstrapping, semi-supervised techniquesExploiting user behavior logs

12

12

Page 13: Data-Intensive Distributed Computing - University of Waterloo

Supervised Binary Classification

Restrict output label to be binaryYes/No

1/0

Binary classifiers form primitive building blocks for multi-class problems…

13

13

Page 14: Data-Intensive Distributed Computing - University of Waterloo

Binary Classifiers as Building BlocksExample: four-way classification

One vs. rest classifiers

A or not?

B or not?

C or not?

D or not?

A or not?

B or not?

C or not?

D or not?

Classifier cascades

14

14

Page 15: Data-Intensive Distributed Computing - University of Waterloo

The Task

Given:

(sparse) feature vector

label

Induce:Such that loss is minimized

loss function

Typically, we consider functions of a parametric form:

model parameters15

15

Page 16: Data-Intensive Distributed Computing - University of Waterloo

Key insight: machine learning as an optimization problem!

16

16

Page 17: Data-Intensive Distributed Computing - University of Waterloo

Because the closed form solutions generally not possible …

Gradient Descent

17

Page 18: Data-Intensive Distributed Computing - University of Waterloo

Old weights

Update based on gradient

New weights

Intuition behind the math…

18

Gradient Descent: Preliminaries

18

Page 19: Data-Intensive Distributed Computing - University of Waterloo

Linear Regression + Gradient Descent

19

Page 20: Data-Intensive Distributed Computing - University of Waterloo

Gradient Descent: Preliminaries

Rewrite:

Compute gradient:“Points” to fastest increasing “direction”

So, at any point:

20

20

Page 21: Data-Intensive Distributed Computing - University of Waterloo

Gradient Descent: Iterative Update

Start at an arbitrary point, iteratively update:

We have:

21

21

Page 22: Data-Intensive Distributed Computing - University of Waterloo

Gradient Descent: Iterative Update

Start at an arbitrary point, iteratively update:

Lots of details:Figuring out the step size

Getting stuck in local minimaConvergence rate

We have:

22

22

Page 23: Data-Intensive Distributed Computing - University of Waterloo

Repeat until convergence:

Gradient Descent

Note, sometimes formulated as ascent but entirely equivalent23

23

Page 24: Data-Intensive Distributed Computing - University of Waterloo

Even More Details…

Gradient descent is a “first order” optimization techniqueOften, slow convergence

Newton and quasi-Newton methods:Intuition: Taylor expansion

Requires the Hessian (square matrix of second order partial derivatives):impractical to fully compute

24

24

Page 25: Data-Intensive Distributed Computing - University of Waterloo

Logistic Regression: Preliminaries

26

Page 26: Data-Intensive Distributed Computing - University of Waterloo

Logistic Regression: Preliminaries

Given:

Define:

Interpretation:

27

27

Page 27: Data-Intensive Distributed Computing - University of Waterloo

Relation to the Logistic Function

After some algebra:

The logistic function:

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

log

istic(z

)

z28

28

Page 28: Data-Intensive Distributed Computing - University of Waterloo

Training an LR Classifier

Maximize the conditional likelihood:

Define the objective in terms of conditional log likelihood:

We know:

So:

Substituting:

29

29

Page 29: Data-Intensive Distributed Computing - University of Waterloo

LR Classifier Update RuleTake the derivative:

General form of update rule:

Final update rule:

30

30

Page 30: Data-Intensive Distributed Computing - University of Waterloo

Want more details? Take a real machine-learning course!

Lots more details…

Regularization

Different loss functions…

31

31

Page 31: Data-Intensive Distributed Computing - University of Waterloo

32

Page 32: Data-Intensive Distributed Computing - University of Waterloo

mapper mapper mapper mapper

reducer

compute partial gradient

single reducer

mappers

update model iterate until convergence

MapReduce Implementation

33

33

Page 33: Data-Intensive Distributed Computing - University of Waterloo

Shortcomings

Hadoop is bad at iterative algorithmsHigh job startup costs

Awkward to retain state across iterations

High sensitivity to skewIteration speed bounded by slowest task

Potentially poor cluster utilizationMust shuffle all data to a single reducer

Some possible tradeoffsNumber of iterations vs. complexity of computation per iteration

E.g., L-BFGS: faster convergence, but more to compute

34

34

Page 34: Data-Intensive Distributed Computing - University of Waterloo

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {

val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

}

mapper mapper mapper mapper

reducer

compute partial gradient

update model

Spark Implementation

35

35