Post on 19-Jul-2015
Open Source Framework for Deploying Data Science Models and
Cloud Based Applications
Pivotal Data Science Team
What happened?
What should I do about it? This is where Data Science comes in
What will happen next?
What Thought Leaders Have In Common Large amounts of structured and
unstructured data Deep personal knowledge of their
audience Quantified understanding of their
products Data-driven culture User experience optimized by data
science
Viewership
Advertisements Merchandise
Sales & Finance
$
Market Research & Competitive Information
Audience Demographics
Internal Data Sources Typical External Sources Semi/Unstructured Data
Clickstream
Social Media
Content
Data Science Impact Business Motivation
Increase Demand
Build Brand Equity
Increase Production Efficiency
Optimize Ad Spend Efficiency
Increase Customer Engagement
• Campaign Optimization
• Marketing Mix Models
Data Science Opportunities
• Customer segmentation
• Affinity analysis
• Social media analytics
• Supply/Demand forecasting
Increase Revenue
Reduce Cost
Example Use Case: Ratings Prediction
Use Case: Increase ratings across viewer demographics How: • Data: Viewership, transcripts and show
data combined in big data platform • Model: Machine learning used to
identify the impact of production decisions on viewership
Insights
Models Insights Actions
Models are built to answer business
questions e.g. what makes viewers tune-
in and tune-out?
Data Scientists interpret models for
answers e.g. On screen arguments
make viewers tune out
Report
Dashboard
BI Tool
Presentation
Cloud App
End User
A good insight drives action that will generate value for stakeholders
Revisiting Rating Prediction Use Case
Model exposed to end users via cloud application allowing what-if scenario building
Characteristics Of Actionable Insights
Real-time
Scalable Social
Relevant
Accessible
Open
Benefits Of Cloud Based Applications
Service failure or data loss at scale
Long innovation cycles
Poor experience at scale
Resilient, scale-out messaging and processing
Agile development with cloud based data services
Low-latency, in-memory computing
Open Source Analytics Ecosystem
Media companies benefit from algorithmic breadth and scalability for building and socializing data science models
MLlib
PL/
X
Algorithms Visualization
Best of breed in-memory and in-database tools for an MPP platform
Example Scalable Open Source Platform
Hadoop++: Complementing the Hadoop platform are Data Science modeling tools. SQL on Hadoop (e.g. HAWQ), Python/R interfaces to SQL, Apache Spark etc.
http://opendataplatform.org/
Apps
Data
Analytics
Leading Media companies are moving towards a platform with Hadoop at the core.
Data Science Pipeline On Hadoop++
MLlib
PL/
X
Data Lake
Hadoop++
Structured + Unstructured
Data
Open Source Framework For Ratings Prediction
Data Lake Insights and
Model Results
Ratings Predictions
Business Levers
Hosted on
What-if Scenario Application Contains structured
+ unstructured data
MLlib
PL/
X
Gather video ads impression stats
Data Lake Ingest
Message Broker Simulate Ad Server
Behavior
Impression Forecasts
Business Levers
Hosted on
Business Metrics Dashboard
Expanding The Framework To Include Impression Forecasting Modeling
MLlib
PL/
X
Measuring Audience Engagement : Workflow
Parallel Parsing of JSON
(PL/Python)
Twitter Decahose (~55 million tweets/day)
Source: http Sink: hdfs
HDFS
External Tables
PXF
Nightly Cron Jobs
Topic Analysis through MADlib
pLDA
Unsupervised Sentiment Analysis
(PL/Python)
Hosted on
Key Takeaways • Blended data sets lead to richer models and more
valuable insights • Turn Data Science models and insights into value
generating actions through data driven applications. • Open source = power and flexibility • Platform extensibility is key to supporting Data Science • Turnkey PaaS is available through CloudFoundry,
including infrastructure monitoring, server configuration and scalability.
THANK YOU!