Data-Intensive Distributed Computing

Part 6: Data Mining (1/4)

This work is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United StatesSee http://creativecommons.org/licenses/by-nc-sa/3.0/us/ for details

CS 451/651 431/631 (Winter 2018)

Jimmy LinDavid R. Cheriton School of Computer Science

University of Waterloo

February 27, 2018

These slides are available at http://lintool.github.io/bigdata-2018w/

Structure of the Course

“Core” framework features and algorithm design

Descriptive vs. Predictive Analytics

Frontend

Backend

Frontend

Backend

Frontend

Backend

external APIs

“Traditional”BI tools

SQL on Hadoop

Othertools

Data Warehouse“Data Lake”

data scientists

OLTP database

ETL(Extract, Transform, and Load)

OLTP database

Supervised Machine Learning

The generic problem of function induction given sample instances of input and output

Classification: output draws from finite discrete labels

Regression: output is a continuous value

This is not meant to be an exhaustive treatment of machine learning!

Source: Wikipedia (Sorting)

Classification

Applications

Spam detection

Sentiment analysis

Content (e.g., topic) classification

Link prediction

Document ranking

Object recognition

And much much more!

Fraud detection

training

Machine Learning Algorithm

testing/deployment

Supervised Machine Learning

Objects are represented in terms of features:“Dense” features: sender IP, timestamp, # of recipients, length of message, etc.“Sparse” features: contains the term “viagra” in message, contains “URGENT” in subject, etc.

Feature Representations

Applications

Spam detection

Sentiment analysis

Content (e.g., genre) classification

Link prediction

Document ranking

Object recognition

And much much more!

Fraud detection

Components of a ML Solution

DataFeaturesModel

Optimization

(Banko and Brill, ACL 2001)(Brants et al., EMNLP 2007)

No data like more data!

Limits of Supervised Classification?

Why is this a big data problem?Isn’t gathering labels a serious bottleneck?

SolutionsCrowdsourcing

Bootstrapping, semi-supervised techniquesExploiting user behavior logs

The virtuous cycle of data-driven products

a useful service

analyze user behavior to extract insights

transform insights into action

$(hopefully)

Google. Facebook. Twitter. Amazon. Uber.

data sciencedata products

Virtuous Product Cycle

What’s the deal with neural networks?

DataFeaturesModel

Optimization

Supervised Binary Classification

Restrict output label to be binaryYes/No

Binary classifiers form primitive building blocks for multi-class problems…

Binary Classifiers as Building BlocksExample: four-way classification

One vs. rest classifiers

A or not?

B or not?

C or not?

D or not?

A or not?

B or not?

C or not?

D or not?

Classifier cascades

The Task

Given: D = {(xi, yi)}ni(sparse) feature vector

xi = [x1, x2, x3, . . . , xd]

y 2 {0, 1}

Induce:Such that loss is minimized

f : X ! Y

`(f(xi), yi)

loss function

Typically, we consider functions of a parametric form:

argmin✓

`(f(xi; ✓), yi)

model parameters

Key insight: machine learning as an optimization problem!(closed form solutions generally not possible)

* caveats

Gradient Descent: Preliminaries

Rewrite:argmin

✓L(✓)

argmin

`(f(xi; ✓), yi)

Compute gradient:“Points” to fastest increasing “direction”

rL(✓) =

@L(✓)

@w0,@L(✓)

@w1, . . .

@L(✓)

So, at any point:b = a� �rL(a)

L(a) � L(b)

Gradient Descent: Iterative Update

Start at an arbitrary point, iteratively update:✓(t+1) ✓(t) � �(t)rL(✓(t))

We have:L(✓(0)) � L(✓(1)) � L(✓(2)) . . .

Old weightsUpdate based on gradient

New weights

✓(t+1) ✓(t) � �(t) 1

r`(f(xi; ✓(t)), yi)

Intuition behind the math…

Gradient Descent: Iterative Update

Start at an arbitrary point, iteratively update:

Lots of details:Figuring out the step size

Getting stuck in local minimaConvergence rate

✓(t+1) ✓(t) � �(t)rL(✓(t))

We have:L(✓(0)) � L(✓(1)) � L(✓(2)) . . .

Repeat until convergence:

✓(t+1) ✓(t) � �(t) 1

r`(f(xi; ✓(t)), yi)

Gradient Descent

Note, sometimes formulated as ascent but entirely equivalent

Gradient Descent

Source: Wikipedia (Hills)

✓(t+1) ✓(t) � �(t) 1

r`(f(xi; ✓(t)), yi)

f(x+�x) = f(x) + f

0(x)�x+1

00(x)�x

Even More Details…

Gradient descent is a “first order” optimization techniqueOften, slow convergence

Newton and quasi-Newton methods:Intuition: Taylor expansion

Requires the Hessian (square matrix of second order partial derivatives):impractical to fully compute

Source: Wikipedia (Hammer)

Logistic Regression

D = {(xi, yi)}nixi = [x1, x2, x3, . . . , xd]

y 2 {0, 1}

Pr (y = 1|x)Pr (y = 0|x)

�= w · x

Pr (y = 1|x)

1� Pr (y = 1|x)

�= w · x

f(x; w) : Rd ! {0, 1}

f(x; w) =

⇢1 if w · x � t

0 if w · x < t

Logistic Regression: Preliminaries

Given:

Define:

Interpretation:

Relation to the Logistic Function

After some algebra:

Pr (y = 1|x) = e

Pr (y = 0|x) = 1

The logistic function:

f(z) =ez

ez + 1

-8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8

argmax

Pr(yi|xi,w)

L(w) =nX

ln Pr(yi|xi,w)

y 2 {0, 1}

L(w) =nX

⇣yi ln Pr(yi = 1|xi,w) + (1� yi) lnPr(yi = 0|xi,w)

Pr(y|x,w) = Pr(y = 1|x,w)yPr(y = 0|x,w)(1�y)

Training an LR Classifier

Maximize the conditional likelihood:

Define the objective in terms of conditional log likelihood:

We know:

Substituting:

L(w) =nX

⇣yi ln Pr(yi = 1|xi,w) + (1� yi) lnPr(yi = 0|xi,w)

@wL(w) =

⇣yi � Pr(yi = 1|xi,w)

w(t+1) w(t) + �(t)rwL(w(t))

rL(w) =

@w0,@L(w)

@w1, . . .

(t+1)i w

(t)i + �

⇣yj � Pr(yj = 1|xj ,w(t)

LR Classifier Update RuleTake the derivative:

General form of update rule:

Final update rule:

Want more details? Take a real machine-learning course!

Lots more details…

RegularizationDifferent loss functions

mapper mapper mapper mapper

reducer

compute partial gradient

single reducer

mappers

update model iterate until convergence

✓(t+1) ✓(t) � �(t) 1

r`(f(xi; ✓(t)), yi)

MapReduce Implementation

Shortcomings

Hadoop is bad at iterative algorithmsHigh job startup costs

Awkward to retain state across iterations

High sensitivity to skewIteration speed bounded by slowest task

Potentially poor cluster utilizationMust shuffle all data to a single reducer

Some possible tradeoffsNumber of iterations vs. complexity of computation per iteration

E.g., L-BFGS: faster convergence, but more to compute

val points = spark.textFile(...).map(parsePoint).persist()

var w = // random initial vectorfor (i <- 1 to ITERATIONS) {

val gradient = points.map{ p =>p.x * (1/(1+exp(-p.y*(w dot p.x)))-1)*p.y

}.reduce((a,b) => a+b)w -= gradient

mapper mapper mapper mapper

reducer

compute partial gradient

update model

Spark Implementation

Source: Wikipedia (Japanese rock garden)

Questions?

Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining...

Documents

Transcript of Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 6: Data Mining...

Tutorial: Deep Dive into Data-Intensive Distributed ...

NetLogger Using NetLogger for Distributed Systems Performance Analysis of the BaBar Data Analysis System Data Intensive Distributed Computing Group Lawrence.

Introduction to Data Intensive Computing · What is Apache Hadoop? • Open-source software framework for reliable, scalable, distributed data-intensive computing ... DynamoDB Valeria

Data Management Reading: Chapter 5: Data-Intensive Computing And A Network-Aware Distributed Storage Cache for Data Intensive Environments.

The Quest for Scalable Support of Data-Intensive Workloads ... · The Quest for Scalable Support of Data-Intensive Workloads in Distributed Systems Ioan Raicu,1 Ian T. Foster,1,2,3

CHAIO: Enabling HPC Applications on Data-Intensive File ...thakur/papers/chaio.pdf · Keywords-high-perfomrance computing, MapReduce, data-intensive, distributed file system I. INTRODUCTION

A Parallel R Framework - Data-Intensive Distributed ...datasys.cs.iit.edu/events/DataCloud2013/datacloud_haolin.pdf · R List – Most general data structure • Collection to store

End-to-end Data-flow Parallelism for Throughput Optimization in High-speed Networks Esma Yildirim Data Intensive Distributed Computing Laboratory University.

practice for distributed, analysis systems - Internet2meetings.internet2.edu/.../10/28/20141029-johnston-data-intensive.v3.pdf · Foundations of data‐intensive science: Technology

WANalytics: Geo-Distributed Analytics for a Data Intensive ... › en-us › research › wp... · di erent aggregates on top of the same data-intensive join, and caching reduces

Distributed Energy-Efficient Scheduling for Data-Intensive

Hadoop: A Framework for Data- Intensive Distributed Computing

Nebula: Distributed Edge Cloud for Data-Intensive …dcsg.cs.umn.edu/Projects/Nebula/posters/Nebula-Poster...Nebula for aggregation and decomposition. Expand the range of data-intensive

Data-Intensive Distributed Computing › ~cs451 › slides › big-data-part0… · Comparison Function Index online offline Abstract IR Architecture. one ... delta compression, gap

Attacking Data Intensive Science with Distributed Computing

Data-Intensive Distributed Computing · Data-Intensive Distributed Computing Part 5: Analyzing Relational Data (1/3) This work is licensed under a Creative Commons Attribution-Noncommercial-Share

DiscFinder: A Data-Intensive Scalable Cluster Finder for ...jclopez/ref/discfinder-tr.pdf · DiscFinder is a scalable, distributed, data-intensive group ﬁnder for analyzing observation

Hadoop: A Framework for Data- Intensive Distributed Computingcs561/s12/Lectures/6/Hadoop.pdf · Hadoop: A Framework for Data-Intensive Distributed Computing CS561-Spring 2012 WPI,

Infrastructure to match your ambitions · intelligence. To do this, it must connect distributed data over networks, compute intensive workload data, store expanding data volumes,