Boston hug

61
Machine Learning with Hadoop

description

Talk about why and how machine learning works with Hadoop with recent developments for real-time operation.

Transcript of Boston hug

Page 1: Boston hug

Machine Learning with Hadoop

Page 2: Boston hug

2

Agenda

• Why Big Data? Why now?

• What can you do with big data?

• How does it work?

Page 3: Boston hug

3

Slow Motion Explosion

Page 4: Boston hug

4

Why Now?

• But Moore’s law has applied for a long time

• Why is Hadoop/Big Data exploding now?

• Why not 10 years ago?

• Why not 20?

2/15/12

Page 5: Boston hug

5

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

Page 6: Boston hug

6

Size Matters, but …

• If it were just availability of data then existing big companies would adopt big data technology first

They didn’t

Page 7: Boston hug

7

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

Page 8: Boston hug

8

Or Maybe Cost

• If it were just a net positive value then finance companies should adopt first because they have higher opportunity value / byte

They didn’t

Page 9: Boston hug

9

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

Page 10: Boston hug

10

Backwards adoption

• Under almost any threshold argument startups would not adopt big data technology first

They did

Page 11: Boston hug

11

Everywhere at Once?

• Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

Page 12: Boston hug

12

Everywhere at Once?

• Something very strange is happening– Big data is being applied at many different scales– At many value scales– By large companies and small

Why?

Page 13: Boston hug

Analytics Scaling Laws

• Analytics scaling is all about the 80-20 rule – Big gains for little initial effort– Rapidly diminishing returns

• The key to net value is how costs scale– Old school – exponential scaling– Big data – linear scaling, low constant

• Cost/performance has changed radically– IF you can use many commodity boxes

Page 14: Boston hug

We knew that

We should have known that

We didn’t know that!

You’re kidding, people do that?

Page 15: Boston hug

Anybody with eyes

Intern with a spreadsheet

In-house analytics

Industry-wide data consortium

NSA, non-proliferation

Page 16: Boston hug

Net value optimum has a sharp peak well before maximum effort

Page 17: Boston hug

But scaling laws are changing both slope and shape

Page 18: Boston hug

More than just a little

Page 19: Boston hug

They are changing a LOT!

Page 20: Boston hug
Page 21: Boston hug
Page 22: Boston hug
Page 23: Boston hug
Page 24: Boston hug

Initially, linear cost scaling actually makes things worse

A tipping point is reached and things change radically …

Page 25: Boston hug

Pre-requisites for Tipping

• To reach the tipping point, • Algorithms must scale out horizontally– On commodity hardware– That can and will fail

• Data practice must change– Denormalized is the new black– Flexible data dictionaries are the rule– Structured data becomes rare

Page 26: Boston hug

26

So that is why and why now

Page 27: Boston hug

27

So that is why, and why now

What can you do with it?And how?

Page 28: Boston hug

Agenda

• Mahout outline– Recommendations– Clustering– Classification

• Hybrid Parallel/Sequential Systems• Real-time learning

Page 29: Boston hug

Agenda

• Mahout outline– Recommendations– Clustering– Classification• Supervised on-line learning• Feature hashing

• Hybrid Parallel/Sequential Systems• Real-time learning

Page 30: Boston hug

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training

Page 31: Boston hug

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training

Page 32: Boston hug

Classification in Detail

• Naive Bayes Family– Hadoop based training

• Decision Forests– Hadoop based training

• Logistic Regression (aka SGD)– fast on-line (sequential) training– Now with MORE topping!

Page 33: Boston hug

How it Works

• We are given “features”– Often binary values in a vector

• Algorithm learns weights– Weighted sum of feature * weight is the key

• Each weight is a single real value

Page 34: Boston hug

An Example

Page 35: Boston hug

Features

From:  Dr. Paul AcquahDear Sir,Re: Proposal for over-invoice Contract Benevolence

Based on information gathered from the India hospital directory, I am pleased to propose a confidential business deal for our mutual benefit.  I have in my possession, instruments (documentation) to transfer the sum of 33,100,000.00 eur thirty-three million one hundred thousand euros, only) into a foreign company's bank account for our favor....

Date: Thu, May 20, 2010 at 10:51 AMFrom: George <[email protected]>

Hi Ted, was a pleasure talking to you last night at the Hadoop User Group. I liked the idea of going for lunch together. Are you available tomorrow (Friday) at noon?

Page 36: Boston hug

But …

• Text and words aren’t suitable features• We need a numerical vector• So we use binary vectors with lots of slots

Page 37: Boston hug

Feature Encoding

Page 38: Boston hug

Hashed Encoding

Page 39: Boston hug

Feature Collisions

Page 40: Boston hug

Training Data

Page 41: Boston hug

Training Data

Page 42: Boston hug

Training Data

Page 43: Boston hug

Full Scale Training

Featureextraction

anddown

sampling

Input

Side-data

Datajoin

SequentialSGD

Learning

Map-reduce

Now via NFS

Page 44: Boston hug

44

Hybrid Model Development

Logs User sessions

Training dataGroup by user

Count transaction

patterns

Account info

Training data

Big-data cluster Legacy modeling

Shared filesystem

Merge PROC LOGISTIC

Model

Page 45: Boston hug

Enter the Pig Vector

• Pig UDF’s for– Vector encoding

– Model training

define EncodeVector org.apache.mahout.pig.encoders.EncodeVector( '10','x+y+1', 'x:numeric, y:numeric, z:numeric');

vectors = foreach docs generate newsgroup, encodeVector(*) as v;grouped = group vectors all;model = foreach grouped generate 1 as key, train(vectors) as model;

Page 46: Boston hug

Real-time Developments

• Storm + Hadoop + Mapr– Real-time with Storm– Long-term with Hadoop– State checkpoints with MapR

• Add the Bayesian Bandit for on-line learning

Page 47: Boston hug

Aggregate Splicing

tHadoop handles the

past

Storm handles the present

Page 48: Boston hug

48

Mobile Network MonitorTransaction

data

Batch aggregation

HBase

Real-time dashboard and alerts

Geo-dispersed ingest servers

Retro-analysisinterface

Page 49: Boston hug

A Quick Diversion

• You see a coin– What is the probability of heads?– Could it be larger or smaller than that?

• I flip the coin and while it is in the air ask again• I catch the coin and ask again• I look at the coin (and you don’t) and ask again• Why does the answer change?– And did it ever have a single value?

Page 50: Boston hug

A First Conclusion

• Probability as expressed by humans is subjective and depends on information and experience

Page 51: Boston hug

A Second Conclusion

• A single number is a bad way to express uncertain knowledge

• A distribution of values might be better

Page 52: Boston hug

I Dunno

Page 53: Boston hug

5 and 5

Page 54: Boston hug

2 and 10

Page 55: Boston hug

Bayesian Bandit

• Compute distributions based on data• Sample p1 and p2 from these distributions

• Put a coin in bandit 1 if p1 > p2

• Else, put the coin in bandit 2

Page 56: Boston hug
Page 57: Boston hug
Page 58: Boston hug

The Basic Idea

• We can encode a distribution by sampling• Sampling allows unification of exploration and

exploitation

• Can be extended to more general response models

Page 59: Boston hug

Deployment with Storm/MapR

All state managed transactionally in MapR file system

Page 60: Boston hug

Service Architecture

MapR Lockless Storage Services

MapR Pluggable Service Management

Storm

Hadoop

Page 61: Boston hug

Find Out More

• Me: [email protected] [email protected] [email protected]

• MapR: http://www.mapr.com • Mahout: http://mahout.apache.org

• Code: https://github.com/tdunning