Open Source Framework for Deploying Data Science Models and Cloud Based Applications by Noelle Sio...

Post on 19-Jul-2015

203 views 1 download

Tags:

Transcript of Open Source Framework for Deploying Data Science Models and Cloud Based Applications by Noelle Sio...

Open Source Framework for Deploying Data Science Models and

Cloud Based Applications

Pivotal Data Science Team

What happened?

What should I do about it? This is where Data Science comes in

What will happen next?

What Thought Leaders Have In Common Large amounts of structured and

unstructured data Deep personal knowledge of their

audience Quantified understanding of their

products Data-driven culture User experience optimized by data

science

Viewership

Advertisements Merchandise

Sales & Finance

$

Market Research & Competitive Information

Audience Demographics

Internal Data Sources Typical External Sources Semi/Unstructured Data

Clickstream

Social Media

Content

Data Science Impact Business Motivation

Increase Demand

Build Brand Equity

Increase Production Efficiency

Optimize Ad Spend Efficiency

Increase Customer Engagement

• Campaign Optimization

• Marketing Mix Models

Data Science Opportunities

• Customer segmentation

• Affinity analysis

• Social media analytics

• Supply/Demand forecasting

Increase Revenue

Reduce Cost

Example Use Case: Ratings Prediction

Use Case: Increase ratings across viewer demographics How: • Data: Viewership, transcripts and show

data combined in big data platform • Model: Machine learning used to

identify the impact of production decisions on viewership

Insights

Models Insights Actions

Models are built to answer business

questions e.g. what makes viewers tune-

in and tune-out?

Data Scientists interpret models for

answers e.g. On screen arguments

make viewers tune out

Report

Dashboard

BI Tool

Email

Presentation

Cloud App

End User

A good insight drives action that will generate value for stakeholders

Revisiting Rating Prediction Use Case

Model exposed to end users via cloud application allowing what-if scenario building

Characteristics Of Actionable Insights

Real-time

Scalable Social

Relevant

Accessible

Open

Benefits Of Cloud Based Applications

Service failure or data loss at scale

Long innovation cycles

Poor experience at scale

Resilient, scale-out messaging and processing

Agile development with cloud based data services

Low-latency, in-memory computing

Open Source Analytics Ecosystem

Media companies benefit from algorithmic breadth and scalability for building and socializing data science models

MLlib

PL/

X

Algorithms Visualization

Best of breed in-memory and in-database tools for an MPP platform

Example Scalable Open Source Platform

Hadoop++: Complementing the Hadoop platform are Data Science modeling tools. SQL on Hadoop (e.g. HAWQ), Python/R interfaces to SQL, Apache Spark etc.

http://opendataplatform.org/

Apps

Data

Analytics

Leading Media companies are moving towards a platform with Hadoop at the core.

Data Science Pipeline On Hadoop++

MLlib

PL/

X

Data Lake

Hadoop++

Structured + Unstructured

Data

Open Source Framework For Ratings Prediction

Data Lake Insights and

Model Results

Ratings Predictions

Business Levers

Hosted on

What-if Scenario Application Contains structured

+ unstructured data

MLlib

PL/

X

Gather video ads impression stats

Data Lake Ingest

Message Broker Simulate Ad Server

Behavior

Impression Forecasts

Business Levers

Hosted on

Business Metrics Dashboard

Expanding The Framework To Include Impression Forecasting Modeling

MLlib

PL/

X

Measuring Audience Engagement : Workflow

Parallel Parsing of JSON

(PL/Python)

Twitter Decahose (~55 million tweets/day)

Source: http Sink: hdfs

HDFS

External Tables

PXF

Nightly Cron Jobs

Topic Analysis through MADlib

pLDA

Unsupervised Sentiment Analysis

(PL/Python)

Hosted on

Key Takeaways • Blended data sets lead to richer models and more

valuable insights • Turn Data Science models and insights into value

generating actions through data driven applications. • Open source = power and flexibility • Platform extensibility is key to supporting Data Science • Turnkey PaaS is available through CloudFoundry,

including infrastructure monitoring, server configuration and scalability.

THANK YOU!