Gl conference2014 deployment_rajat

Post on 05-Dec-2014

75 views 2 download

description

Using GraphLab to build big data analytics pipelines and manage them in production. A presentation from the 3rd annual GraphLab Conference.

Transcript of Gl conference2014 deployment_rajat

GraphLab in Production: Data Pipelines

Rajat Arya Software Engineer July 21, 2014

Reusable components

Runs on Hadoop CDH5 now; Pivotal, Spark coming…

Runs on Cloud EC2 now; Azure, Google coming…

Data pipelines & Predictive services

Clean Learn Deploy

GraphLab Data Pipeline

Beyond batch & stream processing

Predictive applications require real-time service

Deployed directly from data pipeline

GraphLab Predictive Service

Monitor from GraphLab Canvas

   

Sample Data Pipeline

A Simple Recommender System

Train  Model   Recommend   Persist  

•  Source:  Raw  data  from  CSV  •  Tasks:  Train  Model,  Produce  Recommenda;ons,  Persist  •  Des;na;on:  Write  to  Database  

Sample Prototype

Sample Prototype

MESSY   NOT  MODULAR  

FILE  PATHS   NOT  PORTABLE  

Typical Challenges to Production

•  Refactor code to remove magic numbers, file paths, support dynamic config

•  Rewrite entire prototype in ‘production’ language

•  Build / integrate workflow support tools •  Build / integrate monitoring & management

tools

Typical Challenges to Production

•  Refactor code to remove magic numbers, file paths, support dynamic config

•  Rewrite entire prototype in ‘production’ language

•  Build / integrate workflow support tools •  Build / integrate monitoring & management

tools

GraphLab Create provides a better way …

Sample Data Pipeline

TRAIN  

RECOMMEND  

Disc

. users:  

csv:  

model:  

def  train_model(task):      csv  =  task.params[‘csv’]      data  =  gl.SFrame.read_csv(csv’)      model  =  gl.recommender.create(data)      task.outputs[‘model’]  =  model      task.outputs[‘users’]  =  data  

PERSIST  

§  Code can be Python functions or file(s)

Sample Data Pipeline

TRAIN  

RECOMMEND  

PERSIST  

csv:  

Disc

. users:  

Disc

. recs:   §  Code can be Python functions or file(s)

def  gen_recs(task):      model  =  task.inputs[‘model’]      users  =  task.inputs[‘users’]      recs  =  model.recommend(users)      task.outputs[‘recs’]  =  recs  

§  Dependencies managed logically by name

model:  

Sample Data Pipeline

TRAIN  

RECOMMEND  

PERSIST  

csv:  

Disc

. users:  

Disc

. recs:   §  Code can be Python functions or file(s)

§  Dependencies managed logically by name

def  persist_db(task):      recs  =  task.inputs[‘recs’]      conn  =  task.params[‘conn’]      import  mysqlconnector      save_to_db(conn,  recs.save(format…)  

model:  

§  Set required python packages so Task is portable

§  Automatic installation and configuration prior to execution

Sample Data Pipeline

TRAIN  

RECOMMEND  

PERSIST  

csv:  

Disc

. recs:  

Disc

. users:  model:  

INTERN  TRAIN  

§  Tasks are modular and reusable, enabling incremental development and rapid iterations

Sample Data Pipeline

TRAIN  

RECOMMEND  

PERSIST  

csv:  

Disc

. recs:  

Disc

. users:  model:  

INTERN  TRAIN  

§  Tasks are modular and reusable, enabling incremental development and rapid iterations

Executing Data Pipelines  job  =  gl.deploy.job.create(              [train,  recommend,  persist],              environment=‘cdh5-­‐prod’)    

•  One way to create Jobs (with task bindings)

Executing Data Pipelines  job  =  gl.deploy.job.create(              [train,  recommend,  persist],              environment=‘cdh5-­‐prod’)    

•  One way to create Jobs (with task bindings) •  One way to monitor Jobs

Executing Data Pipelines  job  =  gl.deploy.job.create(              [train,  recommend,  persist],              environment=‘ec2-­‐prod’)    

•  One way to create Jobs (with task bindings) •  One way to monitor Jobs •  Run on Hadoop, EC2, or locally without

changing code

Executing Data Pipelines  job  =  gl.deploy.job.create(              [train,  recommend,  persist],              environment=‘cdh5-­‐prod’)    

•  One way to create Jobs (with task bindings) •  One way to monitor Jobs •  Run on Hadoop, EC2, or locally without

changing code •  Recall previous Jobs and Tasks, maintain

workbench

GraphLab Data Pipeline Demo

GraphLab Data Pipeline Recap

Define it Once Run & Monitor it anywhere

All in GraphLab Create

Thank you.

rajat@graphlab.com @rajatarya