Infosys Insights: Driving revenue through service innovation
Driving Real Insights Through Data Science
Click here to load reader
-
Upload
pivotal -
Category
Data & Analytics
-
view
1.637 -
download
0
Transcript of Driving Real Insights Through Data Science
![Page 1: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/1.jpg)
The Journey to Becoming a Data-Driven Enterprise
Pivotal Big Data Roadshow 2015
![Page 2: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/2.jpg)
2© Copyright 2015 Pivotal. All rights reserved.
Where we’re going today…
3 Great Keynotes• Journey to a Data-driven Enterprise
• Data Science Use Cases
• Streaming Data and Predictive Analytics
Stock Inference Demo and Architecture Overview
Intensive hands-on training sessions
![Page 3: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/3.jpg)
3© Copyright 2015 Pivotal. All rights reserved.
Today’s Agenda
12:00 PM – 12:45 PM - Check-In & Lunch
12:45 PM – 1:00 PM – Welcome and Agenda Review
1:00 PM – 1:PM AM – How Pivotal’s Tools Help Drive Value from Data Science
1:20 PM – 2:20 PM – Accelerating the Generation of New Insights – R&D Use Case
Review and Demo
2:20 PM – 2:30 PM – Coffee Break
2:30 PM – 3:00 PM – Manufacturing Use Case Review and Demo
3:00 PM – 3:15 PM – Closing Remarks
![Page 4: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/4.jpg)
© Copyright 2015 Pivotal. All rights reserved.
MASHING BIG DATA WITH BIG MACHINES IS ‘BEAUTIFUL, DESIRABLE, INVESTABLE’ - IT COULD TRANSFORM GE'S BUSINESS -
AND THE ECONOMY.
JEFF IMMELT, CEO, GE
“
”JEFF IMMELT, CEO, GE
![Page 5: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/5.jpg)
© Copyright 2015 Pivotal. All rights reserved.
THE POWER OF 1
RX
IncreasingFreight Utilization Rail
PredictiveMaintenance Healthcare
PredictiveDiagnostics Power
Driving Outcomes That Matter
One Percent Improvement Equals$27B
Industry Value byReducing System
Inefficiency
$63BIndustry Value byReducing Process
Inefficiency
$66BIndustry Value with
Efficiency ImprovementsIn Gas-fired Power
Plant FleetsSource: General Electric
![Page 6: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/6.jpg)
© Copyright 2015 Pivotal. All rights reserved.
DATA-DRIVEN ENTERPRISE JOURNEY
STORE• Structured
• Unstructured
• High Volume
• High Velocity
ANALYZE• Predictive Analytics
• Machine Learning
• Advance Data Science
• Realtime Analytics
DEVELOP• Advanced Analytic Pipelines
• Realtime Analytical Applications
• Global Scale Data-Driven Applications
• Enterprise, Consumer, and Mobile
INNOVATE• Agile Dev Expertise
• DevOps
• Microservice
• Continuous Delivery
• Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA PREDICTIVE ANALYTICS
CLOUD NATIVE PLATFORM
![Page 7: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/7.jpg)
8© Copyright 2015 Pivotal. All rights reserved.
0% of CIOs think their IT infrastructure is fully prepared for big data (3)
30% of companies have deployed advanced analytics, 11% big data analysis (4)
44% of new applications failed to meet performance expectations (5)
2X90% of companies allocate at least 2X more cloud capacity than needed to ensure performance (6)
But…
80% of CEOs thinking data mining and analysis are strategically important (1)
4% of companies use analytics effectively (2)
(1) 2015 PWC CEO Survey; (2)2013 Baine and Company - The Value of Big Data; (3) 2014 IT Infrastructure Conversation - IBM; (4) Ernest and Young - 2014 Enterprise IT Trends and Investments; (5) 2014 Riverbed Tecnologies - The Transformers; (6) 2014 ElasticHosts CIO Study
LARGE ENTERPRISE BIG DATA TROUBLE
![Page 8: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/8.jpg)
9© Copyright 2015 Pivotal. All rights reserved.
BIG DATACHASM
70%of data
generated by customers
80%of data stored
3%prepared for
analysis
0.5%being
analyzed
<0.5%being
operationalized
9
THE DATA DIVIDE
![Page 9: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/9.jpg)
10© Copyright 2015 Pivotal. All rights reserved.
Software Is Eating The World
Data Is Fueling Software
SOFTWARE IS EATING THE WORLD
![Page 10: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/10.jpg)
11© Copyright 2015 Pivotal. All rights reserved.
WE CHOSE PIVOTAL BECAUSE WE BELIEVE IT PROVIDES A 360-DEGREE VIEW OF THE PROCESS.
FROM A DATA SCIENCE AND DATA TECHNOLOGY PERSPECTIVE, IT MEANS DELIVERING BEST-IN-
CLASS DATA TECHNOLOGIES AND ENABLING THEM ON THEIR PLATFORM.
“
”
![Page 11: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/11.jpg)
12© Copyright 2015 Pivotal. All rights reserved.
ACROSS INDUSTRIES
![Page 12: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/12.jpg)
13© Copyright 2015 Pivotal. All rights reserved.
THE NEW DATA IMPERATIVES
ConvergedData & Cloud
OpenData-DrivenApps
![Page 13: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/13.jpg)
14© Copyright 2015 Pivotal. All rights reserved.
THE BIG DATA PROBLEM
Fragmentation ConstraintsComplexity
![Page 14: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/14.jpg)
15© Copyright 2015 Pivotal. All rights reserved.
• Remove Lock-in
• Leverage Ecosystem
• Co-innovate
GUIDING PRINCIPLES IN THE NEW ERA
OPEN AGILE CLOUD-READY
• Shorten innovation cycles
• Reduce TCO
• Improve TTM
• Solve business problems
• Avoid lock-in
• Appropriate security
![Page 15: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/15.jpg)
16© Copyright 2015 Pivotal. All rights reserved.
JOURNEY TO A DATA-DRIVEN ENTERPRISE
Deploy analytic apps and automate at scale
Perform advanced analyticsDiscover insights
Modernize data infrastructure
![Page 16: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/16.jpg)
17© Copyright 2015 Pivotal. All rights reserved.
Deploy analytic apps and automate at scale
Perform advanced analyticsDiscover insights
Modernize data infrastructure
DATA-DRIVEN COMPANIES:
USE MODERN DATA INFRASTRUCTURE
![Page 17: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/17.jpg)
18© Copyright 2015 Pivotal. All rights reserved.
MODERNIZE DATA INFRASTRUCTURE
Elastic, Scale-outstorage and processing
Flexible data types and pipelining
ETL on demand: low operational costExpanded use cases
Higher quality analyticsLowered storage/processing cost
Less fragmented ecosystemReduced vendor lock-in
REQUIREMENTS BENEFITS
Cloud friendly and open-source based
![Page 18: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/18.jpg)
19© Copyright 2015 Pivotal. All rights reserved.
Modernize data infrastructure
Deploy analytic apps and automate at scale
Perform advanced analyticsDiscover insights
DATA-DRIVEN COMPANIES:
STRATEGICALLY USE ADVANCED ANALYTICS
![Page 19: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/19.jpg)
20© Copyright 2015 Pivotal. All rights reserved.
ADVANCED ANALYTICS
Leverage existing skills and toolsRapid time to insights
Internet of Things use casesRapid time to insights
Solve business problemsPredictive insights: proactive execution
REQUIREMENTS BENEFITS
Machine learning and advanced analytics
010101010101010100101010101010101100101010
SQL- compliant batch and interactive queries
Massive stream processing
0101010101010101001010
1010101010110010101010
10101010
![Page 20: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/20.jpg)
21© Copyright 2015 Pivotal. All rights reserved.
Modernize data infrastructure
Perform advanced analyticsDiscover insights
DATA-DRIVEN COMPANIES:
INNOVATE AT SCALE
Deploy analytic apps and automate at scale
![Page 21: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/21.jpg)
22© Copyright 2015 Pivotal. All rights reserved.
ANALYTIC APPS AND AUTOMATION AT SCALE
Reduced time to actionLow ‘analytics app-dev’ integration cost
Reduced time to insightsFlexible ingestion: low operating cost
High performance: low operating costTransactional safety: business critical ops
REQUIREMENTS BENEFITS
Low-latency, distributed in-memory transactions
Resilient, scale-out messaging and object storage
Agile analytic app-dev with enterprise PaaSPaaS
![Page 22: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/22.jpg)
23© Copyright 2015 Pivotal. All rights reserved.
JOURNEY TO A DATA-DRIVEN ENTERPRISE
Deploy analytic apps and automate at scale
Perform advanced analyticsDiscover insights
Modernize data infrastructure
Pivotal Data Sciencehelps you move from BI to
Data Science
Pivotal Labs helps you move to an agile development of apps at scale
Pivotal Data Engineeringhelps you move from data administration to data engineering
![Page 23: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/23.jpg)
26© Copyright 2015 Pivotal. All rights reserved.
PIVOTAL BIG DATA SUITE
![Page 24: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/24.jpg)
27© Copyright 2015 Pivotal. All rights reserved.
Open sourcing all Pivotal Big Data Suite components including:
WORLD’S FIRST OPEN SOURCED BIG DATA PORTFOLIOBUILDING ON SUCCESS OF CLOUD FOUNDRY FOUNDATION
BUILT FOR ENTERPRISES
Pivotal GemFireApache Geode
Apache HAWQPivotal HDB
PivotalGreenplum Database
![Page 25: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/25.jpg)
28© Copyright 2015 Pivotal. All rights reserved.
BUILT FOR ENTERPRISES
Value added features: enterprise grade performance + robustness without lock-in• Advanced Query Optimization in analytics
• WAN replication and continuous query in transactional processing
Flexible Deployment models: align to business objectives and needs• Balance cost objectives with policy and compliance requirements
• Leverage Pivotal’s pre-integration + certification on supported configurations
Enterprise grade support: one throat to choke for the suite• Focus on business problems – not on lifecycle management
• Expert support on Big Data Suite means reduced business risk
![Page 26: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/26.jpg)
29© Copyright 2015 Pivotal. All rights reserved.
• Common core for Hadoop ecosystem
• Rapidly accelerated certifications, ecosystem development and enterprise-grade quality
OpenDataPlatform.org
OPEN
![Page 27: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/27.jpg)
30© Copyright 2015 Pivotal. All rights reserved.
AGILE
Deploy analytic apps and automate at scale
Perform advanced analyticsDiscover insights
Modernize data infrastructure
Spring XD Spark Pivotal HD & Open Data Platform
Pivotal Greenplum Database
Pivotal HDB Rabbit MQ
Redis
Pivotal GemFire Pivotal BDS on PCF
Pivotal Cloud Foundry
![Page 28: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/28.jpg)
31© Copyright 2015 Pivotal. All rights reserved.
CLOUD-READY
COMMODITY HARDWARE APPLIANCE HYBRID CLOUDCLOUD
IaaS IaaS
PAAS
![Page 29: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/29.jpg)
32© Copyright 2015 Pivotal. All rights reserved.
DATA-DRIVEN ENTERRPRISE JOURNEY WITH PIVOTAL BIG DATA SUITE
STORE• Structured
• Unstructured
• High Volume
• High Velocity
ANALYZE• Predictive Analytics
• Machine Learning
• Advance Data Science
• Realtime Analytics
DEVELOP• Advanced Analytic Pipelines
• Realtime Analytical Applications
• Global Scale Data-Driven Applications
• Enterprise, Consumer, IoT, and Mobile
INNOVATE• Agile Dev Expertise
• DevOps
• Microservices
• Continuous Delivery
• Closed Loop Applications
AGILE DEVELOPMENT
BIG DATA PREDICTIVE ANALYTICS
CLOUD NATIVE PLATFORM
Spring XD
Spark
Pivotal HD & Open Data Platform
Spring XD
Pivotal Greenplum Database
Pivotal HDB
Spring XD
Pivotal GemFire
Rabbit MQ
Spring Cloud
Pivotal BDS on PCF
Pivotal Cloud Foundry
Pivotal LabsData ScienceData Engineering
![Page 30: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/30.jpg)
35© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO, CHECKOUT…
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Pivotal Academy @ https://pivotal.biglms.com
Or reach out to your local Pivotal Account Executive…
![Page 31: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/31.jpg)
36© Copyright 2015 Pivotal. All rights reserved. 36© Copyright 2013 Pivotal. All rights reserved.
Pivotal Data Science Overview and Use CasesPivotal Big Data Roadshow
![Page 32: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/32.jpg)
37© Copyright 2015 Pivotal. All rights reserved.
DATA SCIENCE?
App Development
Analytics
Business IntelligenceReporting
Visualization
Dashboards
Insights Big Data
Machine LearningStatistics
MathematicsTime Series
Algorithms
Databases
Software
Modeling
Queries
Real-Time
Sensors
Predictive Models
ETL
Research
Hadoop
Distributed Computing
MapReduce
SQL
In-Memory
OLAP
Text Mining
Unstructured Data
Open Source
Decision Science
Ad Hoc Queries
Hacking
In-Database Analytics
Internet of Things
Data Cleansing
Sentiment
![Page 33: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/33.jpg)
38© Copyright 2015 Pivotal. All rights reserved.
• ETL
• Unstructured
• Data Cleansing
• Sensors
Data Related
• Algorithms
• Mathematics
• Statistics
• Econometrics
• Predictive Modeling
• Machine Learning
• Text Mining
• Sentiment
• Map Reduce
Fields of Study & Techniques
• Dashboards
• Insights
• Visualization
• Ad Hoc Queries
• Reporting
Business Intelligence
• Software
• In-Database Analysis
• Distributed Computing
• Hadoop
• Open Source
Implementation
• Big Data
• Decision Science
• Internet of Things
• Real-Time
• Hacking
• In-Memory
Industry Buzzwords
![Page 34: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/34.jpg)
39© Copyright 2015 Pivotal. All rights reserved.
What is Data Science?
The use of statistical and machine learning techniques on big multi-structured data in a distributed computing environment to identify correlations and causal relationships, classify and predict events, identify patterns and anomalies, and infer probabilities, interest, and sentiment.
DRIVE AUTOMATED, LOW-LATENCY ACTIONS IN RESPONSE TO EVENTS OF INTEREST
![Page 35: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/35.jpg)
40© Copyright 2015 Pivotal. All rights reserved.
Gene Sequencing
Smart Grids
COST TO SEQUENCE ONE GENOMEHAS FALLEN FROM $100M IN 2001 TO $10K IN 2011TO $1K IN 2014
READING SMART METERSEVERY 15 MINUTES IS
3000X MOREDATA INTENSIVE
Stock Market
Social Media
FACEBOOK UPLOADS250 MILLION
PHOTOS EACH DAY
Billions of Data Points
Oil Exploration
Video Surveillance
OIL RIGS GENERATE
25000DATA POINTS PER SECOND
Medical Imaging
Mobile Sensors
![Page 36: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/36.jpg)
41© Copyright 2015 Pivotal. All rights reserved.
What is Big Data Analytics?
DescriptiveAnalytics
WHAT HAPPENED?DiagnosticAnalytics
WHY DID IT HAPPENED?
BI
Data Science
PredictiveAnalytics
WHAT WILL HAPPEN?
PrescriptiveAnalytics
HOW CAN WE MAKE IT HAPPEN?
Hindsight
Insight
Foresight
Complexity
Value ofAnalytics
($)
![Page 37: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/37.jpg)
42© Copyright 2015 Pivotal. All rights reserved.
P L A T F O R M
Data Science Toolkit
KEY TOOLS KEY LANGUAGES
SQL
![Page 38: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/38.jpg)
43© Copyright 2015 Pivotal. All rights reserved.
Scalable, In-Database ML
• Open Source https://github.com/madlib/madlib• Works on Greenplum DB, HAWQ and PostgreSQL• In active development by Pivotal• Downloads and Docs: http://madlib.net/
![Page 39: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/39.jpg)
44© Copyright 2015 Pivotal. All rights reserved.
Functions
Supervised LearningRegression Models• Cox Proportional Hazards Regression• Elastic Net Regularization• Generalized Linear Models• Linear Regression• Logistic Regression• Marginal Effects• Multinomial Regression• Ordinal Regression• Robust Variance, Clustered Variance• Support Vector MachinesTree Methods• Decision Tree• Random ForestOther Methods• Conditional Random Field• Naïve Bayes
Unsupervised Learning• Association Rules (Apriori)• Clustering (K-means) • Topic Modeling (LDA)
StatisticsDescriptive• Cardinality Estimators• Correlation• SummaryInferential• Hypothesis TestsOther Statistics• Probability Functions
Other Modules• Conjugate Gradient• Linear Solvers• PMML Export• Random Sampling• Term Frequency for Text
Time Series• ARIMA
Aug 2015
Data Types and Transformations• Array Operations• Dimensionality Reduction (PCA)• Encoding Categorical Variables• Matrix Operations• Matrix Factorization (SVD, Low Rank)• Norms and Distance Functions• Sparse Vectors
Model Evaluation• Cross Validation
Predictive Analytics Library
![Page 40: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/40.jpg)
45© Copyright 2015 Pivotal. All rights reserved.
A single address for everything analyticsAnalytics with Pivotal
Time-to-Insights
FORECASTING CLUSTERING
REGRESSION
CLASSIFICATION
OPTIMIZATION
![Page 41: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/41.jpg)
46© Copyright 2015 Pivotal. All rights reserved.
Smart Systems = Sensors + Digital Brain + Actuators
Problem Formulation
Modeling Step
Data StepApplication Step
Data Science forBuilding Models
Sensors & Actuators
Data Lake
![Page 42: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/42.jpg)
47© Copyright 2015 Pivotal. All rights reserved. 47© Copyright 2013 Pivotal. All rights reserved.
Data Science Use Cases
![Page 43: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/43.jpg)
48© Copyright 2015 Pivotal. All rights reserved. 48© Copyright 2013 Pivotal. All rights reserved.
Financial Services
![Page 44: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/44.jpg)
49© Copyright 2015 Pivotal. All rights reserved.
Identifying and Pricing Cross-Sell Opportunities
CUSTOMER
A global financial services provider
BUSINESS PROBLEM
Identify cross-sell opportunities between two business arms of a financial institution.
CHALLENGES
Integration of large-scale data originating from multiple data warehouses. Developing predictive models to identify novel cross-sell opportunities within the financial institution. Evaluate the identified cross-sell opportunities by their revenue potential.
SOLUTIONS
Fast integration of data in Pivotal Greenplum Database.
Predictive models and evaluation of profitability:– Association rule. – Logistic regression for each product
offered.– Estimation of revenue opportunity.
On-demand reporting and visualization via custom dashboards connected to in-database models.
Identified multi-million dollar opportunities for the bank.
![Page 45: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/45.jpg)
50© Copyright 2015 Pivotal. All rights reserved.
Financial Compliance
BUSINESS PROBLEMEnsure compliance with Dodd-Frank and Basel Committee regulationsIdentify underlying risk and fraud while reducing the compliance department’s overburdened
Emails Chats Trades
Transactions Policy Securities
Phone Calls Watch Lists …
Financial complianceData Lake
Data integration
Data clean up Modeling Classification
and ranking
Analyst user interfacesFeedback
Analytics
Analyst feedbackData integration: e.g., append trade information with email and chat communications
Data cleanup: e.g., identify newsletters and spam emails
Modeling: • Predictive modeling to flag
messages and trades• Graph and cohort analysis
Analyst feedbackReviewed fraud instances included in periodic model refreshes
SOLUTION A data lake platform coupled with cutting edge data
science techniques Flexible user interface to promote an adaptive,
continuously learning compliance framework
![Page 46: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/46.jpg)
51© Copyright 2015 Pivotal. All rights reserved. 51© Copyright 2013 Pivotal. All rights reserved.
Telco & Mobile
![Page 47: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/47.jpg)
52© Copyright 2015 Pivotal. All rights reserved.
Subscriber Micro-SegmentationCUSTOMER
A major telco with cable & VOD, internet, and phone business unitsBUSINESS PROBLEM
Better understand aggregated subscriber behavior to drive business strategy using newly available data sources
CHALLENGES
▪ Large quantities of deep packet inspection data and set top box data that had not been analyzed before
▪ Needed to incorporate internet usage and TV consumption information into pre-existing subscriber segments
SOLUTION
▪ Generated new subscriber segments that incorporated features based on consumption of TV and internet services across a variety of devices
▪ Crossed new segments with existing segments to generate new micro-segments for cross-sell/upsell and new product development opportunities
Customized Micro-Segments
![Page 48: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/48.jpg)
53© Copyright 2015 Pivotal. All rights reserved.
Newly Identified Behavior-Based Segments S
ubsc
riber
s
Moderates
OTT & Data Heavyweights
Portable OTT Entertainment Seekers
iPhone Heavy
Android Heavy
iPad Heavy
In-Home OTT Entertainment Seekers
In-Home Native Content Seekers
VOD Heavy
TV Heavy
![Page 49: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/49.jpg)
54© Copyright 2015 Pivotal. All rights reserved.
Opportunities for Data-Driven Decisions in Pharma
![Page 50: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/50.jpg)
55© Copyright 2015 Pivotal. All rights reserved.
Data driven drugs: From discovery to delivery
RICH DATA SOURCES• Molecular data
• Cellular drug screens• Animal models
• Clinical data including notes, images, markers (e.g. genomics, lab results)
• Sensor and assay data• Internal and partner/purchased
external data• Contact center data• Patient registries, public and federal
data, clinical partnerships
Clinical Trials
Manufacturing
Marketing
Distribution and surveillance
Drug discovery + development
![Page 51: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/51.jpg)
56© Copyright 2015 Pivotal. All rights reserved.
A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing
Input materials Mix Incubate Filter Centrifuge Final Product
0 20 40 60 80 100 120 140 160 180 200
0
5
10
15
20
25
30
Sensors
High-Content Screens
TEM
PTIME
Abs
orba
nce
Elution volume
Velo
city
TimeAutomated raw
materials mixing
![Page 52: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/52.jpg)
57© Copyright 2015 Pivotal. All rights reserved.
Vaccine Potency PredictionCUSTOMER A major pharmaceutical company
BUSINESS PROBLEMPredict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process.
CHALLENGES Customer’s data model was not optimal for
running analytical queries Manual data quality issues Data capture was performed with varying
consistency due to high cost associated with manual data collection
SOLUTION Introduced a new data model to make data
accessible and enable analytics (including LIMS and DeltaV)
Built automated outlier detection/correction methods to address manual data entry quality issues
Devised imputation methods to deal with data completeness issues
Built predictive models with high accuracy
![Page 53: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/53.jpg)
58© Copyright 2015 Pivotal. All rights reserved.
http://blog.pivotal.io/data-science-pivotalCheck out the Pivotal Data Science Blog!
![Page 54: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/54.jpg)
59© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO…
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Pivotal Academy @ https://pivotal.biglms.com
• Or reach out to your local Pivotal Account Executive…
![Page 55: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/55.jpg)
60© Copyright 2015 Pivotal. All rights reserved.
Pivotal Data Science Labs: Packaged Services
• Analytics Roadmap
• Prioritized Opportunities
• Architectural Recommendations
• Hands-on training
• Hosted data on
Pivotal Data stack
• Results review &assessment
• On-site MPP analytics training
• Analytics tool-kit
• Quick insight (2 weeks)
• Prof. services
• Data science model building
• Ready-to-deploymodel(s)
• Prof. services
• Data sciencemodel building
• Ready-to-deploymodel(s)
LAB PRIMER(2-Week Roadmapping)
LAB 600(6-Week Lab)
LAB 1200(12-Week Lab)
LAB 100(Analytics Bundle)
DATA JAM(Internal DS Contest)
![Page 56: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/56.jpg)
61© Copyright 2015 Pivotal. All rights reserved.
Data Streaming and Predictive AnalyticsUsing Pivotal Big Data Suite
![Page 57: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/57.jpg)
62© Copyright 2015 Pivotal. All rights reserved.
Converging Trends
InnovationNew Data
New Processes
New Insights
The Journey to the Data-Driven Enterprise
Data Scienceand Machine
LearningBig DataIoT, Mobile Apps,
Social Media
![Page 58: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/58.jpg)
63© Copyright 2015 Pivotal. All rights reserved.
HDFS
Data Lake
Ingest Store Analytics
Hard to changeLabor intensive
Inefficient
Coding basedNo real-time informationBased on expensive ETL
Migrating from a Reactive, Static and Constrained Model…
![Page 59: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/59.jpg)
64© Copyright 2015 Pivotal. All rights reserved.
HDFSData Lake Expert System / Machine Learning
In-Memory Real-Time
Data
Continuous LearningContinuous Improvement
Continuous Adapting
Data Stream Pipeline
Multiple Data SourcesReal-Time ProcessingStore Everything
To Pro-Active, Self-Improving, Machine Learning Systems
![Page 60: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/60.jpg)
65© Copyright 2015 Pivotal. All rights reserved.
New York Times Research: http://www.nytimes.com/2014/08/18/technology/for-big-data-scientists-hurdle-to-insights-is-janitor-work.html
“50-80% OF THE TIME ON DATA
SCIENCE PROJECTS IS SPENT ON DATA WRANGLING
”
![Page 61: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/61.jpg)
66© Copyright 2015 Pivotal. All rights reserved.
Data FeedsStream Processing
Expert Systems Machine Learning
Historical Data
Business ValueSmart
Decisions
Still…
HDFS
Data Lake
![Page 62: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/62.jpg)
67© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
GemFire
Data Stream Needs an Agile, Scalable and Fast Solution
HAWQ GPDBDataLake
![Page 63: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/63.jpg)
68© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
DistributedComputing
In-Memory Real-Time Data
Spring XD Orchestrates and Automates all the Steps on Data Stream Pipelining
Expert System / Machine Learning
ExtensibleOpen-SourceFault-TolerantHorizontally Scalable HAWQ GPDB
DataLake
![Page 64: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/64.jpg)
69© Copyright 2015 Pivotal. All rights reserved.
INGEST / SINK PROCESS ANALYZE
• No coding required
• Dozens of built-in connectors
• Seamless integration with Kafka, Sqoop
• Create new connectors easily using Spring
• Call Spark, Reactor or RxJava
• Built-in configurable filtering, splitting and transformation
• Out-of-box configurable jobs for batch processing
• Import and invoke PMML jobs easily
• Call Python, R, Madlib and other tools
• Built-in configurable counters and gauges
Spring XDState of the Art Data Pipeline Automation
![Page 65: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/65.jpg)
70© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
DistributedComputing
GemFire Provides Scalable, Low-Latency Data Access, Storage and Event Processing
Expert System / Machine Learning
GemFire
ExtensibleOpen-SourceFault-TolerantHorizontally Scalable HAWQ GPDB
DataLake
![Page 66: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/66.jpg)
71© Copyright 2015 Pivotal. All rights reserved.
GemFire
• In-Memory Enterprise Data Grid• Horizontally Scalable, Consistent,
Highly Available • Event handling• Continuous Queries• Enterprise Data Geo Distribution
In-memory Real Time Data
![Page 67: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/67.jpg)
72© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
DistributedComputing
Pivotal Provides SQL Based Advanced Analytics
Expert System / Machine Learning
GemFire
ExtensibleOpen-SourceFault-TolerantHorizontally Scalable
DataLake HAWQ GPDB
![Page 68: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/68.jpg)
73© Copyright 2015 Pivotal. All rights reserved.
HAWQ
• Massively Parallel Processing RDBMS on HADOOP
• ANSI SQL on Hadoop• Extremely high
performance for analytics (not like Hive)
• Stores all data directly on HDFS
• Open-Source
Advanced SQL analytics in Hadoop
Combining SQL with Hadoop is key for analytics
SQL remains #1 choice for Data Science
![Page 69: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/69.jpg)
74© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
Developers and Data Scientists Can Focus onthe Business Value of Data
GemFire
ExtensibleOpen-SourceFault-TolerantHorizontally Scalable
DataLake HAWQ GPDB
![Page 70: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/70.jpg)
75© Copyright 2015 Pivotal. All rights reserved.
Data Streaming Reference ArchitectureData Feeds Transactional Apps Analytic Apps
Data Stream Pipeline
DistributedComputing Real-Time Data Expert Systems &
Machine LearningAdvancedAnalytics
HDFSData Lake
![Page 71: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/71.jpg)
76© Copyright 2015 Pivotal. All rights reserved.
Data Streaming Reference ArchitectureData Feeds Transactional Apps Analytic Apps
Data Stream Pipeline
HDFSData Lake
GemFire HAWQ GPDB
SpringXD
![Page 72: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/72.jpg)
77© Copyright 2015 Pivotal. All rights reserved.
“SO WE ARE MOVING TO A WORLD WHERE THE
MACHINES WE WORK WITH ARE NOT JUST INTELLIGENT; THEY ARE BRILLIANT.THEY ARE SELF-
AWARE, THEY ARE PREDICTIVE, REACTIVE AND SOCIAL. IT'S A WORLD WHERE INFORMATION
ITSELF BECOMES INTELLIGENT AND COMES TO US AUTOMATICALLY WHEN WE NEED IT WITHOUT
HAVING TO LOOK FOR IT.
”MARCO ANNUNZIATA, GE
![Page 73: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/73.jpg)
78© Copyright 2015 Pivotal. All rights reserved.
DemoPowered by Pivotal Big Data Suite
![Page 74: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/74.jpg)
79© Copyright 2015 Pivotal. All rights reserved.
It's all about DATA
Data SourcesLook for patterns
Prediction
![Page 75: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/75.jpg)
Transform Sink
SpringXD
ExtensibleOpen-SourceFault-TolerantHorizontally ScalableCloud-Native
Machine Learning
Enrich Filter
Split
Dashboard
Indicators
1
2
Predict
3
Real data
Simulator
/Stocks
/TechIndicators
/Predictions
![Page 76: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/76.jpg)
81© Copyright 2015 Pivotal. All rights reserved. 81
![Page 77: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/77.jpg)
![Page 78: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/78.jpg)
91© Copyright 2015 Pivotal. All rights reserved.
“THE REAL OPPORTUNITY FOR
CHANGE...SURPASSING THE MAGNITUDE OF THE CONSUMER INTERNET...IS THE INDUSTRIAL
INTERNET, AN OPEN, GLOBAL NETWORK THAT CONNECTS PEOPLE, DATA AND MACHINES.
”JEFF IMMELT, CEO, GE
![Page 79: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/79.jpg)
100© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO, CHECKOUT…
• Pivotal Data Product Info, Docs and Downloads @ http://pivotal.io/big-data
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Data Science Blog @ http://blog.pivotal.io/data-science-pivotal
• Pivotal Academy @ https://pivotal.biglms.com
• Or reach out to your local Pivotal Account Executive…
![Page 80: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/80.jpg)
BUILT FOR THE SPEED OF BUSINESS
![Page 81: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/81.jpg)
102© Copyright 2015 Pivotal. All rights reserved. 102© Copyright 2013 Pivotal. All rights reserved.
Accelerating the Generation of New Insights
October 27, 2015
Sarah Aerni, Data ScienceAntonio Petrole, Data Engineering
![Page 82: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/82.jpg)
BUILT FOR THE SPEED OF BUSINESS
![Page 83: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/83.jpg)
104© Copyright 2015 Pivotal. All rights reserved.
Gene Sequencing
Smart Grids
COST TO SEQUENCE ONE GENOMEHAS FALLEN FROM $100M IN 2001 TO $10K IN 2011TO $1K IN 2014
READING SMART METERSEVERY 15 MINUTES IS
3000X MOREDATA INTENSIVE
Stock Market
Social Media
FACEBOOK UPLOADS250 MILLION
PHOTOS EACH DAY
Technology to process and store data is needed in all industries
Oil Exploration
Video Surveillance
OIL RIGS GENERATE
25000DATA POINTS PER SECOND
Medical Imaging
Mobile Sensors
![Page 84: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/84.jpg)
105© Copyright 2015 Pivotal. All rights reserved.
What is Big Data Analytics?
DescriptiveAnalytics
WHAT HAPPENED?DiagnosticAnalytics
WHY DID IT HAPPEN?
BI
Data Science
PredictiveAnalytics
WHAT WILL HAPPEN?
PrescriptiveAnalytics
HOW CAN WE MAKE IT HAPPEN?
Hindsight
Insight
Foresight
Complexity
Value ofAnalytics
($)
![Page 85: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/85.jpg)
106© Copyright 2015 Pivotal. All rights reserved.
Opportunities for Data-Driven Decisions in Pharma
![Page 86: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/86.jpg)
107© Copyright 2015 Pivotal. All rights reserved.
Data driven drugs: From discovery to delivery
RICH DATA SOURCES• Molecular data
• Cellular drug screens• Animal models
• Clinical data including notes, images, markers (e.g. genomics, lab results)
• Sensor and assay data• Internal and partner/purchased
external data• Contact center data• Patient registries, public and federal
data, clinical partnerships
Clinical Trials
Manufacturing
Marketing
Distribution and surveillance
Drug discovery + development
![Page 87: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/87.jpg)
108© Copyright 2015 Pivotal. All rights reserved.
A pipeline of sensors and opportunities for optimizing outputInternet of Things in Manufacturing
Input materials Mix Incubate Filter Centrifuge Final Product
0 20 40 60 80 100 120 140 160 180 200
0
5
10
15
20
25
30
Sensors
High-Content Screens
TEM
PTIME
Abs
orba
nce
Elution volume
Velo
city
TimeAutomated raw
materials mixing
![Page 88: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/88.jpg)
109© Copyright 2015 Pivotal. All rights reserved.
Vaccine Potency PredictionCUSTOMER A major pharmaceutical company
BUSINESS PROBLEMPredict potency and antigen levels of live virus vaccines based on manufacturing sensor data and manual data collected throughout the process.
CHALLENGES Customer’s data model was not optimal for
running analytical queries Manual data quality issues Data capture was performed with varying
consistency due to high cost associated with manual data collection
SOLUTION Introduced a new data model to make data
accessible and enable analytics (including LIMS and DeltaV)
Built automated outlier detection/correction methods to address manual data entry quality issues
Devised imputation methods to deal with data completeness issues
Built predictive models with high accuracy
![Page 89: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/89.jpg)
110© Copyright 2015 Pivotal. All rights reserved.
Interpreting the utility of a measure obtained during manufacturing based on model outcomes
Sample model insights
Some features may reveal tunable parameters to alter potency, others may simply be markers
Features consistently absent from models may be uninformative for predicting potency
Opportunities to provide real-time feedback on data entry errors and predicted potency outcomes
Assayed value Duration of a step
Pot
ency
Pot
ency
Correlation=0.45 Correlation=0.38
![Page 90: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/90.jpg)
111© Copyright 2015 Pivotal. All rights reserved.
Need for new environments to process big data?HDFS STORAGE AND MPP
ARCHITECTURES DISTRIBUTE STORAGE AND PREVENT DATA MOVEMENT
VARIETY/VELOCITY
DISTRIBUTED COMPUTATION FOR PARALLELIZATION
PETABYTES OF DATA
OPEN-SOURCE LIBRARY FOR MACHINE LEARNING AT SCALE AND FRAMEWORK
TO ACCESS COMMON LANGUAGES
RAPIDLY EVOLVING FIELD OF DATA SCIENCE AND
TOOLS
SQL ENGINE AND ODBC/JDBC CONNECTIONS TO HADOOP
MANY EXISTING LIBRARIES, TOOLS AND EXPERTISE
FLEXIBLE
SCALABLE
ENABLING
ACCESSIBLE
![Page 91: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/91.jpg)
112© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed storage with in-place computation
PivotalHadoop
Pivotal Greenplum Database
HAWQ
![Page 92: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/92.jpg)
113© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed storage with in-place computation
Think of it as multiple PostGreSQL servers
Segments/Workers
Master
Rows are distributed across segments by a particular field (or randomly)
PivotalHadoop
Pivotal Greenplum Database
HAWQ
![Page 93: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/93.jpg)
114© Copyright 2015 Pivotal. All rights reserved.
Identifying duplicates: counting with groupingOpportunities for performance improvements
– Sorting and re-sorting is required in many pipelines
– Single-threaded processes create bottlenecks in speed and need to move data
https://www.broadinstitute.org/gatk//events/2038/GATKwh0-BP-1-Map_and_Dedup.pdf
Solution: Leverage Pivotal’s distributed MPP environment by using common database functions
![Page 94: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/94.jpg)
115© Copyright 2015 Pivotal. All rights reserved.
Identifying duplicates: counting with grouping
Reference genome
Mapped reads
![Page 95: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/95.jpg)
116© Copyright 2015 Pivotal. All rights reserved.
Identifying duplicates: counting with grouping
Duplicates
13
11115
13
select locus, count(*) from readsgroup by locusReference genome
Mapped reads
![Page 96: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/96.jpg)
117© Copyright 2015 Pivotal. All rights reserved.
Reference genome
Mapped reads
Counting numbers of reads mapped to featuresselect exon, count(*) from reads JOIN refseqON(<reads overlap exon>)group by exon
5
17
12
![Page 97: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/97.jpg)
118© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed storage with in-place computation
Think of it as distributed file system with very large blocks of data
Schema on read allows flexibility for a variety of datasetsCompute using a variety of paradigms (e.g. MapReduce)
PivotalHadoop
Pivotal Greenplum Database
HAWQ
Name Node
Data Node 1
Data Node 2
Data Node 3
Data Node 4
1 2 3 2 3 1 1 2
![Page 98: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/98.jpg)
119© Copyright 2015 Pivotal. All rights reserved.
Multiple tools with a single, simple goal: Distributed storage with in-place computation
SQL compliantWorld-class query optimizerInteractive queryHorizontal scalabilityRobust data managementCommon Hadoop formatsDeep analytics
PivotalHadoop
Pivotal Greenplum Database
HAWQ
Think of it as distributed PostGreSQL (GPDB) on Hadoop• SQL compliant• World-class query optimizer• Interactive query• Horizontal scalability• Robust data management• Common Hadoop formats• Deep analytics
![Page 99: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/99.jpg)
120© Copyright 2015 Pivotal. All rights reserved.
A single address for everything analyticsAnalytics with Pivotal
Time-to-Insights
FORECASTING CLUSTERING
REGRESSION
CLASSIFICATION
OPTIMIZATION
![Page 100: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/100.jpg)
121© Copyright 2015 Pivotal. All rights reserved.
P L A T F O R M
Data Science Toolkit
KEY TOOLS KEY LANGUAGES
SQL
![Page 101: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/101.jpg)
122© Copyright 2015 Pivotal. All rights reserved.
Historically data was studied in silosBRCA dataset
Treatments
Protein Assays
Imaging
VariantsPatient History &Follow-Ups
Gene Expression
Copy NumberVariation
miRNA
![Page 102: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/102.jpg)
123© Copyright 2015 Pivotal. All rights reserved.
GenomicsData Center
ResearcherComputing
ClusterUnnecessary data movementNetwork usage
Need for new environment: Data movementComputation and storage in a single location
![Page 103: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/103.jpg)
124© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SQL & RCOVARIATES GENOTYPESIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
SNP1 2 MAA
CC
TT
AT
CG
TT
AA
GG
TC
TT CG
TC
![Page 104: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/104.jpg)
125© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
COVARIATES GENOTYPES
![Page 105: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/105.jpg)
126© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
COVARIATES GENOTYPES
![Page 106: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/106.jpg)
127© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
COVARIATES GENOTYPES
![Page 107: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/107.jpg)
128© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
SNP P-value1 2.34x10-21
2 0.3953 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
![Page 108: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/108.jpg)
129© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
SNP P-value1 2.34x10-21
2 0.3953 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
• In-database computation of 1 million loci for thousands of individuals in seconds
• Results are easily manipulated and explored
![Page 109: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/109.jpg)
130© Copyright 2015 Pivotal. All rights reserved.
In-database genome-wide association study
NetworkInterconnect
MasterSevers
SegmentSevers
SNP1 SNP2 SNPM
Pval1 Pval2 PvalM
LOR1 LOR2 LORM
SQL & RIndiv Covariates
1 2 10
1 F 23 18
2 M 39 41
3 M 50 23
N F 19 24
Indiv SNP
Geno
1 1 AA2 1 AT3 1 AA1 2 CC2 2 CG3 2 GG
N M TC
SNP P-value1 2.34x10-21
2 0.3953 7.15x10-17
M 0.000142
COVARIATES GENOTYPES RESULTS
• In-database computation of 1 million loci for thousands of individuals in seconds
• Results are easily manipulated and explored
![Page 110: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/110.jpg)
131© Copyright 2015 Pivotal. All rights reserved.
Procedural Languages in Big Data Science HAWQ & PL/X can take advantage of “data
parallel” tasks by performing analyses in parallel – embarrassingly parallel tasks
– Little/no effort required to break up the problem into parallel tasks
– No dependency (or communication) between tasks
Examples of ‘data parallel’ problems:– Counting words in documents– Genome-Wide Association Study– Studying network anomalies
Sample Implementations by PDL– Digital image processing– Bayesian Inference with MCMC– Parallel Bagged Decision Trees
Doc1 Doc2 DocM
Stem1 Stem2 StemM
SQL & R
Count1 Count2 CountM
NetworkInterconnect
MasterSevers
SegmentSevers
![Page 111: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/111.jpg)
132© Copyright 2015 Pivotal. All rights reserved.
Finding Causal Variants in LupusCustomer
Biotech Company
Business Problem
The customer wants to establish internal data science capabilities: building a culture and acquiring hardware and people to support it
Challenges
Customer needs to establish a culture around sharing and analyzing datasets for value
Current in-house technology is unable to support large-scale analysis (e.g. unable to analyze genomics datasets)
Need to learn new paradigms for analyzing data at-scale
Solution
Train customer employees on our solution stack, provide one-on-one consulting and run a hackathon
Greatly reduced computation time enabling implementation and results during a 30 hour period
– Module previously requiring 30min took only 5sec in using SQL in database
Novel scientific discovery on untouched data– Dataset previously untouched for 2 years due to
limited resources– Mine for statistically significant associations
between ~400,000 variants for 1000 phenotypes
![Page 112: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/112.jpg)
133© Copyright 2015 Pivotal. All rights reserved.
Processing images and building integrated models at scale
![Page 113: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/113.jpg)
134© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop Sequenc
eFile
Thousands of Images
Image Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS
Map reduce
Map reduce
One sequence file
![Page 114: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/114.jpg)
135© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop Sequenc
eFile
Thousands of Images
Image Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map reduce
Map reduce
Join to additional datasets
ProteomicsMedical HistoryVariants
Additional Datasets
Build Models at-Scale
SQLOne sequence file
![Page 115: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/115.jpg)
136© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop Sequenc
eFile
Thousands of Images
One sequence file
Image Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map reduce
Map reduce
Raw Pixels
img1 [rgb1-rgbK]
img2 [rgb1-rgbK]
imgN [rgb1-rgbK]
Map reduce
Join to additional datasets
ProteomicsMedical HistoryVariants
Additional Datasets
Build Models at-Scale
SQL
![Page 116: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/116.jpg)
137© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop Sequenc
eFile
Thousands of Images
One sequence file
Image Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map reduce
Map reduce
Raw Pixels
img1 [rgb1-rgbK]
img2 [rgb1-rgbK]
imgN [rgb1-rgbK]
Map reduce PL/X
SQL
Join to additional datasets
Feature Generation
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
ProteomicsMedical HistoryVariants
Additional Datasets
Build Models at-Scale
SQL
![Page 117: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/117.jpg)
138© Copyright 2015 Pivotal. All rights reserved.
Representing an image in HAWQHAWQ enables rapid processing of multiple or extremely large images in parallel without memory limitations
Source Image:Col
Row
0 1 2012
0 00 10 21 01 11 22 02 12 2
col
row
intsy
Structured:
![Page 118: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/118.jpg)
139© Copyright 2015 Pivotal. All rights reserved.
Translating image processing to simple SQL
Function Distribution of pixel intensities
SQL SELECT intsy, count(*) FROM tbl GROUP BY intsy
Output 150, 5215, 4
HAWQ enables rapid processing of multiple or extremely large images in parallel without memory limitations
No data movement required
Simple SQL queries for data exploration0 00 10 21 01 11 22 02 12 2
Source Image:Col
Row
0 1 2012
col
row
intsy
Structured:
![Page 119: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/119.jpg)
140© Copyright 2015 Pivotal. All rights reserved.
Image Processing PipelineFor Object Counting
Original
Image name # Cells
Tma_001.jpg 359
Tma_002.jpg 1892
Tma_003.jpg 871
… …
SmoothingAverage over
window of pixels
ThresholdingSelect pixels under intensity threshold
CleanupMin/max over
window of pixels
Object DetectionConnected
components
Object CountingSelect components
with size filter
![Page 120: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/120.jpg)
141© Copyright 2015 Pivotal. All rights reserved.
Image Computation Framework
Hadoop Sequenc
eFile
Thousands of Images
One sequence file
Image Pre-Processing
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
Feature Generation
HDFS HAWQ/GPDB
Map reduce
Map reduce
Raw Pixels
img1 [rgb1-rgbK]
img2 [rgb1-rgbK]
imgN [rgb1-rgbK]
Map reduce PL/X
SQL
Join to additional datasets
Feature Generation
Features
img1 [x1-xM]
img2 [x1-xM]
imgN [x1-xM]
ProteomicsMedical HistoryVariants
Additional Datasets
Build Models at-Scale
SQL
![Page 121: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/121.jpg)
142© Copyright 2015 Pivotal. All rights reserved.
A Drug-Centric Data Lake to Enable Drug DiscoveryCustomer A major pharmaceutical company
Business ProblemIdentifying promising drug targets leveraging and integrating the vast datasets available will reduce time and cost to bring a new product to market
Challenges Data for drugs screens across multiple modalities
cannot be easily integrated Current environments cannot support the growing
data form high-content screens Researchers are unable to leverage the entirety of
datasets, instead work with aggregates or summaries
Solution Proved that current customer models can be ported,
sped up and scaled in the Pivotal environment
Created richer models integrating multiple types of data (genomics, images, etc)
Improve models using the raw, most-granular data
Demonstrate how the availability of tools and data enables scientists to interrogate models and derive a deeper, actionable insight
![Page 122: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/122.jpg)
143© Copyright 2015 Pivotal. All rights reserved.
http://blog.pivotal.io/data-science-pivotalCheck out the Pivotal Data Science Blog!
![Page 123: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/123.jpg)
144© Copyright 2015 Pivotal. All rights reserved.
FOR FURTHER INFO, CHECKOUT…
• Pivotal Blog @ http://blog.pivotal.io
• Pivotal Academy @ https://pivotal.biglms.com
![Page 124: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/124.jpg)
145© Copyright 2015 Pivotal. All rights reserved. 145© Copyright 2013 Pivotal. All rights reserved.
Driving Insights from Data Lakes Manufacturing Demo
Matthew Ross & Antonio PetrolePivotal Data Engineering
![Page 125: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/125.jpg)
146© Copyright 2015 Pivotal. All rights reserved.
Internet of What????
![Page 126: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/126.jpg)
147© Copyright 2015 Pivotal. All rights reserved.
Industrial Internet of Things?
![Page 127: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/127.jpg)
148© Copyright 2015 Pivotal. All rights reserved.
IoT Goes MainstreamAccording to Gartner, Inc. (a technology research and advisory corporation), there will be nearly 26 billion devices on the Internet of Things by 2020.
![Page 128: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/128.jpg)
149© Copyright 2015 Pivotal. All rights reserved.
GE Doubles DownGE invests in IIoT cloud and creates Predix cloud built upon Pivotal Cloud Foundry. GE estimates that connecting these industrial machines to the IoT could boost global GDP by $10 trillion to $15 trillion in 20 years. McKinsey Global Institute research holds that IoT in general could add $6.2 billion to the global economy by 2025.
![Page 129: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/129.jpg)
150© Copyright 2015 Pivotal. All rights reserved.
Converging Trends
InnovationNew Data New Processes New Insights
The Journey to the Data-Driven Enterprise
Data Scienceand Machine
LearningBig Data
Internet of Things
![Page 130: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/130.jpg)
151© Copyright 2015 Pivotal. All rights reserved.
IoT Key WorkflowsData Flow Management
Reliable Infrastructure
Enterprise Level Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources and Destinations
● Workflow Orchestration
● Admin Tooling
● Developer Enablement
![Page 131: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/131.jpg)
152© Copyright 2015 Pivotal. All rights reserved.
IoT Key WorkflowsData Flow Management
Reliable Infrastructure
Enterprise Level Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources and Destinations
● Workflow Orchestration
● Admin Tooling
● Developer Enablement
![Page 132: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/132.jpg)
153© Copyright 2015 Pivotal. All rights reserved.
Data Flow Management-Data Overload● The ability to stream
and process massive amounts of data
● Must have a platform that can handle that much data without losing or corrupting any of it
![Page 133: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/133.jpg)
154© Copyright 2015 Pivotal. All rights reserved.
Data Flow Management-Data Normalization and Cleansing
● Organizing Fields to fit into a relational structure
● Adding extra fields or removing unneeded ones
![Page 134: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/134.jpg)
155© Copyright 2015 Pivotal. All rights reserved.
Data Flow Management-Multiple Sources and Destinations
● Stream data from multiple different sources
● Persist it so multiple different destinations
● Process multiple different data formats
![Page 135: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/135.jpg)
156© Copyright 2015 Pivotal. All rights reserved.
IoT Key WorkflowsData Flow Management
Reliable Infrastructure
Enterprise Level Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources and Destinations
● Workflow Orchestration
● Admin Tooling
● Developer Enablement
![Page 136: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/136.jpg)
157© Copyright 2015 Pivotal. All rights reserved.
Reliable Infrastructure-High Availability and Fault Tolerance
● Resources are available under any conditions
● System stays up even if some resources go down
![Page 137: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/137.jpg)
158© Copyright 2015 Pivotal. All rights reserved.
Reliable Infrastructure-Scalability
● Must be able to handle more traffic demand at any time
● Easily process big data workloads
![Page 138: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/138.jpg)
159© Copyright 2015 Pivotal. All rights reserved.
IoT Key WorkflowsData Flow Management
Reliable Infrastructure
Enterprise Level Tooling
● High Availability
● Fault Tolerant
● Scalable
● Data Overload
● Normalization
● Multiple Sources and Destinations
● Workflow Orchestration
● Admin Tooling
● Developer Enablement
![Page 139: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/139.jpg)
160© Copyright 2015 Pivotal. All rights reserved.
Enterprise Level Tooling-Workflows and Tooling
● Manage complex flows of data
● Provide rich User Interface Applications
![Page 140: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/140.jpg)
161© Copyright 2015 Pivotal. All rights reserved.
Enterprise Level Tooling-Developer Enablement
● Full Featured APIs
● Extreme module customization if needed
![Page 141: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/141.jpg)
162© Copyright 2015 Pivotal. All rights reserved.
Are you making the most out of your data?
![Page 142: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/142.jpg)
163© Copyright 2015 Pivotal. All rights reserved.
Bringing it all together
![Page 143: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/143.jpg)
164© Copyright 2015 Pivotal. All rights reserved. 164© Copyright 2013 Pivotal. All rights reserved.
Reporting is nice, but being able to take action is what drives the value of a platform
![Page 144: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/144.jpg)
165© Copyright 2015 Pivotal. All rights reserved.
Predictive Analytics
Proactive Monitoring
Reactive Maintenance
![Page 145: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/145.jpg)
166© Copyright 2015 Pivotal. All rights reserved.
ETL vs Streaming● Data is loaded in large
batches● Typically happens once a
day● Analysis can only be done
once data is transformed and persisted to data warehouse
![Page 146: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/146.jpg)
167© Copyright 2015 Pivotal. All rights reserved.
ETL vs Streaming● Continuous, data streams
are “listening” for data being emitted from sensors
● Data can be analyzed in stream
● Can be integrated into data driven applications
![Page 147: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/147.jpg)
168© Copyright 2015 Pivotal. All rights reserved.
Reactive Maintenance● Alert is sent out to someone
on factory floor● Worker must receive alert, and
be able to react● Equipment is down for
however long it takes for worker to receive notification and complete repairs.
![Page 148: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/148.jpg)
169© Copyright 2015 Pivotal. All rights reserved.
Proactive Maintenance Workflow● Manager has Dashboard with
all gauges of the system● Historical Log of Recent Alerts
are visible on the Dashboard● Has the ability to dispatch a
worker to investigate a specific line or robot
![Page 149: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/149.jpg)
170© Copyright 2015 Pivotal. All rights reserved.
Predictive Analytics Workflow● Run Machine Learning and
Data Science models
● Incorporate Business Intelligence Tools
● Persist data to Data Lake and run advanced queries
![Page 150: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/150.jpg)
171© Copyright 2015 Pivotal. All rights reserved.
Data Streaming Needs an Agile, Scalable and Fast Solution
Data Lake
Data Ingestion
Business Intelligence
Real Time Analytics
Mobile App
![Page 151: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/151.jpg)
172© Copyright 2015 Pivotal. All rights reserved.
Ingest Transform SinkSpringXD
Spring XD Orchestrates and Automates all the Steps on Data Stream Pipelining
HDB PHD
DataLake
ExtensibleOpen-SourceFault-TolerantHorizontally Scalable
![Page 152: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/152.jpg)
173© Copyright 2015 Pivotal. All rights reserved.
INGEST / SINK PROCESS ANALYZE
• No coding required
• Dozens of built-in connectors
• Seamless integration with Kafka, Sqoop
• Create new connectors easily using Spring
• Call Spark, Reactor or RxJava
• Built-in configurable filtering, splitting and transformation
• Out-of-box configurable jobs for batch processing
• Import and invoke PMML jobs easily
• Call Python, R, Madlib and other tools
• Built-in configurable counters and gauges
Spring XDState of the Art Data Pipeline Automation
![Page 153: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/153.jpg)
174© Copyright 2015 Pivotal. All rights reserved.
Pivotal HDBHadoop Native SQL
• Exceptional Hadoop Native SQL Performance
• No compatibility risks to SQL developers or SQL BI tools and applications
• Support query roll-ups, dynamic partitions and joins
• Massive MPP scalability to petabytes
• On premise or on the cloud
• Scale your cluster out, not up
• World class parallel loading and unloading
• Fast performance for complex and advanced data analytics
• Integrated with MADLib for advanced machine learning
• Powerful Cost-based Query Optimizer
![Page 154: Driving Real Insights Through Data Science](https://reader037.fdocuments.us/reader037/viewer/2022102322/5873d3461a28ab9d168b6a13/html5/thumbnails/154.jpg)
BUILT FOR THE SPEED OF BUSINESS