Gl conference2014 deployment_rajat
-
Upload
graphlab-inc -
Category
Data & Analytics
-
view
74 -
download
2
description
Transcript of Gl conference2014 deployment_rajat
GraphLab in Production: Data Pipelines
Rajat Arya Software Engineer July 21, 2014
Reusable components
Runs on Hadoop CDH5 now; Pivotal, Spark coming…
Runs on Cloud EC2 now; Azure, Google coming…
Data pipelines & Predictive services
Clean Learn Deploy
GraphLab Data Pipeline
Beyond batch & stream processing
Predictive applications require real-time service
Deployed directly from data pipeline
GraphLab Predictive Service
Monitor from GraphLab Canvas
Sample Data Pipeline
A Simple Recommender System
Train Model Recommend Persist
• Source: Raw data from CSV • Tasks: Train Model, Produce Recommenda;ons, Persist • Des;na;on: Write to Database
Sample Prototype
Sample Prototype
MESSY NOT MODULAR
FILE PATHS NOT PORTABLE
Typical Challenges to Production
• Refactor code to remove magic numbers, file paths, support dynamic config
• Rewrite entire prototype in ‘production’ language
• Build / integrate workflow support tools • Build / integrate monitoring & management
tools
Typical Challenges to Production
• Refactor code to remove magic numbers, file paths, support dynamic config
• Rewrite entire prototype in ‘production’ language
• Build / integrate workflow support tools • Build / integrate monitoring & management
tools
GraphLab Create provides a better way …
Sample Data Pipeline
TRAIN
RECOMMEND
Disc
. users:
csv:
model:
def train_model(task): csv = task.params[‘csv’] data = gl.SFrame.read_csv(csv’) model = gl.recommender.create(data) task.outputs[‘model’] = model task.outputs[‘users’] = data
PERSIST
§ Code can be Python functions or file(s)
Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
. users:
Disc
. recs: § Code can be Python functions or file(s)
def gen_recs(task): model = task.inputs[‘model’] users = task.inputs[‘users’] recs = model.recommend(users) task.outputs[‘recs’] = recs
§ Dependencies managed logically by name
model:
Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
. users:
Disc
. recs: § Code can be Python functions or file(s)
§ Dependencies managed logically by name
def persist_db(task): recs = task.inputs[‘recs’] conn = task.params[‘conn’] import mysqlconnector save_to_db(conn, recs.save(format…)
model:
§ Set required python packages so Task is portable
§ Automatic installation and configuration prior to execution
Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
. recs:
Disc
. users: model:
INTERN TRAIN
§ Tasks are modular and reusable, enabling incremental development and rapid iterations
Sample Data Pipeline
TRAIN
RECOMMEND
PERSIST
csv:
Disc
. recs:
Disc
. users: model:
INTERN TRAIN
§ Tasks are modular and reusable, enabling incremental development and rapid iterations
Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘cdh5-‐prod’)
• One way to create Jobs (with task bindings)
Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘cdh5-‐prod’)
• One way to create Jobs (with task bindings) • One way to monitor Jobs
Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘ec2-‐prod’)
• One way to create Jobs (with task bindings) • One way to monitor Jobs • Run on Hadoop, EC2, or locally without
changing code
Executing Data Pipelines job = gl.deploy.job.create( [train, recommend, persist], environment=‘cdh5-‐prod’)
• One way to create Jobs (with task bindings) • One way to monitor Jobs • Run on Hadoop, EC2, or locally without
changing code • Recall previous Jobs and Tasks, maintain
workbench
GraphLab Data Pipeline Demo
GraphLab Data Pipeline Recap
Define it Once Run & Monitor it anywhere
All in GraphLab Create
Thank you.
[email protected] @rajatarya