Gl conference2014 deployment_rajat

19
GraphLab in Production: Data Pipelines Rajat Arya Software Engineer July 21, 2014

description

Using GraphLab to build big data analytics pipelines and manage them in production. A presentation from the 3rd annual GraphLab Conference.

Transcript of Gl conference2014 deployment_rajat

Page 1: Gl conference2014 deployment_rajat

GraphLab in Production: Data Pipelines

Rajat Arya Software Engineer July 21, 2014

Page 2: Gl conference2014 deployment_rajat

Reusable components

Runs on Hadoop CDH5 now; Pivotal, Spark coming…

Runs on Cloud EC2 now; Azure, Google coming…

Data pipelines & Predictive services

Clean Learn Deploy

GraphLab Data Pipeline

Beyond batch & stream processing

Predictive applications require real-time service

Deployed directly from data pipeline

GraphLab Predictive Service

Monitor from GraphLab Canvas

   

Page 3: Gl conference2014 deployment_rajat

Sample Data Pipeline

A Simple Recommender System

Train  Model   Recommend   Persist  

•  Source:  Raw  data  from  CSV  •  Tasks:  Train  Model,  Produce  Recommenda;ons,  Persist  •  Des;na;on:  Write  to  Database  

Page 4: Gl conference2014 deployment_rajat

Sample Prototype

Page 5: Gl conference2014 deployment_rajat

Sample Prototype

MESSY   NOT  MODULAR  

FILE  PATHS   NOT  PORTABLE  

Page 6: Gl conference2014 deployment_rajat

Typical Challenges to Production

•  Refactor code to remove magic numbers, file paths, support dynamic config

•  Rewrite entire prototype in ‘production’ language

•  Build / integrate workflow support tools •  Build / integrate monitoring & management

tools

Page 7: Gl conference2014 deployment_rajat

Typical Challenges to Production

•  Refactor code to remove magic numbers, file paths, support dynamic config

•  Rewrite entire prototype in ‘production’ language

•  Build / integrate workflow support tools •  Build / integrate monitoring & management

tools

GraphLab Create provides a better way …

Page 8: Gl conference2014 deployment_rajat

Sample Data Pipeline

TRAIN  

RECOMMEND  

Disc

. users:  

csv:  

model:  

def  train_model(task):      csv  =  task.params[‘csv’]      data  =  gl.SFrame.read_csv(csv’)      model  =  gl.recommender.create(data)      task.outputs[‘model’]  =  model      task.outputs[‘users’]  =  data  

PERSIST  

§  Code can be Python functions or file(s)

Page 9: Gl conference2014 deployment_rajat

Sample Data Pipeline

TRAIN  

RECOMMEND  

PERSIST  

csv:  

Disc

. users:  

Disc

. recs:   §  Code can be Python functions or file(s)

def  gen_recs(task):      model  =  task.inputs[‘model’]      users  =  task.inputs[‘users’]      recs  =  model.recommend(users)      task.outputs[‘recs’]  =  recs  

§  Dependencies managed logically by name

model:  

Page 10: Gl conference2014 deployment_rajat

Sample Data Pipeline

TRAIN  

RECOMMEND  

PERSIST  

csv:  

Disc

. users:  

Disc

. recs:   §  Code can be Python functions or file(s)

§  Dependencies managed logically by name

def  persist_db(task):      recs  =  task.inputs[‘recs’]      conn  =  task.params[‘conn’]      import  mysqlconnector      save_to_db(conn,  recs.save(format…)  

model:  

§  Set required python packages so Task is portable

§  Automatic installation and configuration prior to execution

Page 11: Gl conference2014 deployment_rajat

Sample Data Pipeline

TRAIN  

RECOMMEND  

PERSIST  

csv:  

Disc

. recs:  

Disc

. users:  model:  

INTERN  TRAIN  

§  Tasks are modular and reusable, enabling incremental development and rapid iterations

Page 12: Gl conference2014 deployment_rajat

Sample Data Pipeline

TRAIN  

RECOMMEND  

PERSIST  

csv:  

Disc

. recs:  

Disc

. users:  model:  

INTERN  TRAIN  

§  Tasks are modular and reusable, enabling incremental development and rapid iterations

Page 13: Gl conference2014 deployment_rajat

Executing Data Pipelines  job  =  gl.deploy.job.create(              [train,  recommend,  persist],              environment=‘cdh5-­‐prod’)    

•  One way to create Jobs (with task bindings)

Page 14: Gl conference2014 deployment_rajat

Executing Data Pipelines  job  =  gl.deploy.job.create(              [train,  recommend,  persist],              environment=‘cdh5-­‐prod’)    

•  One way to create Jobs (with task bindings) •  One way to monitor Jobs

Page 15: Gl conference2014 deployment_rajat

Executing Data Pipelines  job  =  gl.deploy.job.create(              [train,  recommend,  persist],              environment=‘ec2-­‐prod’)    

•  One way to create Jobs (with task bindings) •  One way to monitor Jobs •  Run on Hadoop, EC2, or locally without

changing code

Page 16: Gl conference2014 deployment_rajat

Executing Data Pipelines  job  =  gl.deploy.job.create(              [train,  recommend,  persist],              environment=‘cdh5-­‐prod’)    

•  One way to create Jobs (with task bindings) •  One way to monitor Jobs •  Run on Hadoop, EC2, or locally without

changing code •  Recall previous Jobs and Tasks, maintain

workbench

Page 17: Gl conference2014 deployment_rajat

GraphLab Data Pipeline Demo

Page 18: Gl conference2014 deployment_rajat

GraphLab Data Pipeline Recap

Define it Once Run & Monitor it anywhere

All in GraphLab Create

Page 19: Gl conference2014 deployment_rajat

Thank you.

[email protected] @rajatarya