An Ecosystem for Competitiveness and Innovation Hervé Mouren, TERATEC Managing Director
Philippe Bricard Emmanuel Lecerf - Teratec
Transcript of Philippe Bricard Emmanuel Lecerf - Teratec
![Page 1: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/1.jpg)
© 2011 IBM Corporation
Smarter Decisions for Optimized Performance
Pursuing Big Data with IBM Platform Symphony
Philippe BricardEmmanuel Lecerf
An IBM Company
![Page 2: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/2.jpg)
© 2011 IBM Corporation2
![Page 3: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/3.jpg)
© 2011 IBM Corporation3
IBM Big Data = Volume, Variety and Velocity
33
Scale from terabytes to zettabytes
Relational and non-relational data types from an ever-expanding variety of sources.
Streaming data and large volume data movement
Volume:
Variety:
Velocity:
![Page 4: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/4.jpg)
© 2011 IBM Corporation4 4 4
Glo
bal
Dat
a Vo
lum
e in
Exa
byt
es
Multiple sources: IDC,Cisco
100
90
80
70
60
50
40
30
20
10
Agg
rega
te U
ncer
tain
ty %
9000
8000
7000
6000
5000
4000
3000
2000
1000
0
2005 2010 2015
By 2015, 80% of all available data will be uncertain: Veracity
Data quality solutions exist for enterprise data like customer, product, and address data, but
this is only a fraction of the total enterprise data.
By 2015 the number of networked devices will be double the entire global population. All
sensor data has uncertainty.
The total number of social media accounts exceeds the entire global
population. This data is highly uncertain in both its expression and content.
![Page 5: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/5.jpg)
© 2011 IBM Corporation
Big Data Customer Examples
Big Data for the CMO
5.8 terabytes of Internet and Social Media
Fix negative opinions and build on positive ones
””””Listen to the voice of clients””””
Big Data for Smarter Planet
2.8 petabytes of public and private weather data
Modeling time reduced from weeks to hrs.
Reduced modeling time by 97%
Big Data for Telco
6 billion Call Detail Records per day
Personalized marketing to individual customers
Latency reduced from 12 hrs to 1 sec
![Page 6: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/6.jpg)
© 2011 IBM Corporation
Traditional Analytics Solution Architecture
� Analytics to examine past events and existing conditions
� Web servers provide minimal reference data, if any
� Data Warehouses and databases provide almost all of the information
Analytics Engine
WebServers
Database Servers
SelectedUsers
![Page 7: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/7.jpg)
© 2011 IBM Corporation
IBM Big Data Reference Architecture
� Multiple Analytics Enginespredict outcomes and behavior
� Web Servers with crawlers and stream services gather vast amounts of external data
� Data Warehouses, Databases, and File Servers provide internal data
� Map/Reduce Clusters process unstructured data quickly
� Event Servers manage streams from devices and networks
Analytics Engines
WebServers
Database and File Servers
MapReduce Clusters Masses of
Users
EventServers
![Page 8: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/8.jpg)
© 2011 IBM Corporation8
A new infrastructure point of view
8
Text Analytics /Mining Streams Hadoop
Content Analytics UIMA Based high volume unstructured management
GPFS
Committed to Open Source Committed to Innovation
![Page 9: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/9.jpg)
© 2011 IBM Corporation9
IBM Confidential
Industry-grade Distributed File SystemGPFS SNC (General Parallel File System for Shared Nothing Clusters)
• GPFS-SNC - scalable, high performance, and highly reliable file system• Support applications on MapReduce and standard POSIX interfaces • Optimized for online and batch processing applications
Beta
![Page 10: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/10.jpg)
© 2011 IBM Corporation
Map Reduce paradigm to support innovation
(<a, > , <o, > , <p, > , …)
Each input to a map is a list of <key, value> pairs
Each output of a map is a list of <key, value> pairs
(<a’, > , <o’, > , <p’, > , …)
Grouped by key
Each input to a reduce is a <key, value-list> (possibly a list of these, depending on the grouping/hashing mechanism)e.g. <a’, ( …)>
Reduced into a list of values
![Page 11: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/11.jpg)
© 2011 IBM Corporation11
Applications or End User Access
Distributed Parallel File Systems / Data Storage
MapReduce Workload Management
Hadoop Technology
Three Logical Layers
![Page 12: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/12.jpg)
© 2011 IBM Corporation12
Resource Orchestration
C
Workload Manager
C C C C C
C C C C C C
D
D
D
D
D
D
D
D
D
D
D
D
C C C C C C
Metadata generation,File classification,
Batch analysis
Search, Analysis, Concept Recognition
Data Intensive Apps
A A A A
A A A A
A A A A
A A A A
B
B
B
B
B
B
B
B
B
B
B
BB B B B B B
A B C DRisk analytics,
Sensitivity analysis,Monte Carlo simulation
Platform Symphony - Support for Diverse Workloads & Platforms
![Page 13: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/13.jpg)
© 2011 IBM Corporation
BigInsights & Platform SymphonyBenefits of an Integrated Solution
� Low Latency Requirements– Many simultaneous short running jobs– Job cycle measured in seconds or a few minutes– Thousands or millions of tasksPlatform Symphony provides: A Service Oriented (SOA), low latency architecture and a
sophisticated scheduling engine.
� Heterogeneous Application Support on a Multi Tenant Grid– Existing Platform Symphony customers extending grid for MapReduce– Combined compute & data intensive applications on a single grid– Support for multiple applications & job types
• C#, C++, MapReduce, .NET, Python, etc.• SOA & MPI workloads
– Lines of business sharing a common grid infrastructure
Platform Symphony provides: Dynamic resource orchestration & sharing across application boundaries.
![Page 14: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/14.jpg)
© 2011 IBM Corporation
Application Development / End User Access
File System / Data Store Connectors(Distributed parallel fault-tolerant file systems / Relational & MPP Databases)
Hadoop MapReduce Processing Framework
MR Java Pig Hive MR Apps
Hadoop Applications Technical Computing Applications
R, C/C++, Python, Java, Binaries Jaql
Platform Resource Orchestrator
Distributed Runtime Scheduling Engine - Platform Symphony
Other
Managem
ent Console
SOA Framework
Application & Data Integration Architecture
HDFSHBase
DistributedFile System
Scale-outFile System
RelationalDatabase
MPPDatabase
![Page 15: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/15.jpg)
© 2011 IBM Corporation
Application Development / End User Access
File System / Data Store Connectors(Distributed parallel fault-tolerant file systems / Relational & MPP Databases)
Hadoop MapReduce Processing Framework
MR Java Pig Hive MR Apps
Hadoop Applications Technical Computing Applications
R, C/C++, Python, Java, Binaries Jaql
Platform Resource Orchestrator
Distributed Runtime Scheduling Engine - Platform Symphony
Other
Managem
ent Console
SOA Framework
Application & Data Integration Architecture
HDFSHBase
DistributedFile System
Scale-outFile System
RelationalDatabase
MPPDatabase
Infosphere BigInsights
(no JobTracker)
Infosphere BigInsights
(HDFS/GPFS SNC)
![Page 16: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/16.jpg)
© 2011 IBM Corporation
Heterogeneous Applications SupportPlatform Technology – (Green Boxes)
File System / Data Store Connectors(Distributed parallel fault-tolerant file systems / Relational & MPP Databases)
In-house Risk& TradingHbase
MapReduce Framework Low Latency SOA
MPIDDTData
AffinityVB6 / COM
Platform Resource Orchestrator (EGO)
MPPDatabase
RelationalDatabase
DistributedCache (Gemstone)
DistributedFile Systems
HDFSGPFS-SNC
Financial Modeling & Simulation
Policy-based, Distributed, Highly Available Runtime Infrastructure
Scalable Batch Distributed Job Flow
Executables
ETL & BIStatistical Analysis
Design simulation & test
Internet ScaleAnalytics
Traditional HPCData MiningMachine Learning
In-house HPC apps
In-house app Hive, Pig, Jaql
Text Analytics, IndexingPage RankingClick Stream
BigInsight Apps
Future: IBM Streams, SPSS
Big Insight Connectors
Netezza DB2
![Page 17: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/17.jpg)
© 2011 IBM Corporation17
� Higher quality results faster– Starts & runs jobs the fastest– Scales the highest
� Lower cost– Uses infrastructure more efficiently– Easier to manage– Simplifies application integration
� Better resource sharing– Integrated compute + data services– Sophisticated hierarchical sharing model– Harvesting & multi-site sharing options
� Smarter data handling– Full MapReduce & HDFS implementation– Considers data locality when scheduling tasks– Adapts to multiple data sources
� World-wide support & services– Consulting, Customer Education, Comprehensive support services
What makes Symphony unique?
![Page 18: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/18.jpg)
© 2011 IBM Corporation
� Fair Share Proportional Scheduling – 10,000 Level of Prioritization
� Priority Based Scheduling– Higher priority consumes all resources
� Pre-emptive Scheduling– Interruptive or non-interruptive
� Threshold Based Scheduling– Resources dynamically monitored– Dynamic Open/Close Logic– Administrator sets limits
� Task Reclaim Logic– Automatic when resources fail or ‘hang’
� Resource Draining– Maintenance mode
� Administrative Control of Running Jobs– Suspend, Resume, Change Priority, Kill Jobs/Tasks, Monitor
Sophisticated Scheduling Engine
![Page 19: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/19.jpg)
© 2011 IBM Corporation
Platform Symphony MapReduce versus Hadoop
Time Elapsed (seconds)
Test Number
E.Coli (K-12 MG1655, 10Kbase subset) Assembly Times
PMR
Hadoop
Performance Comparison
![Page 20: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/20.jpg)
© 2011 IBM Corporation
FSS Use cases� Customer Correspondence Analytics Approaches
– Call Center Call Data and Agent Notes Analytics– Background Quality of Service Screening / Risk Mitigation– Cross-channel Behavior Correlation– Call Center Conversation “Health” Monitoring
� Risk Platform and Analytics– Trade Manipulation Monitoring– Faster and Expanded Trade History Analytics– Corporate Risk / Exposure Analytics– Debit and credit card fraud detection– Increased Automation of account opening and “know your customer” due diligence
� IT enablement– Holistic SOA Environment Analytics– Application Security and Action Auditing on DB2 on Z– Application & Server Log Clearinghouse– Low Latency Combined DB2 on Z and Distributed System Data – Cyber-Security Packet Investigation
� Social Media Analytics– Social Media Analytics Focused on High Value Retail Banking: – Social Media Monetization *– Social Media Analytics Focused on Private Banking: – Advisor Social Media Monitoring
![Page 21: Philippe Bricard Emmanuel Lecerf - Teratec](https://reader031.fdocuments.us/reader031/viewer/2022012410/616a541d11a7b741a3514d81/html5/thumbnails/21.jpg)
© 2011 IBM Corporation
A large US Investment Banking
� Challenge: – Expanding compute capacity while reducing costs
� Needs– Low latency (< 1 ms) pre-trade application– Homegrown overnight batch risk application– Trading anomalies detection
� Solution:– Built a grid infrastructure combining compute and data
intensive risk systems
• Several million dollars of savings• 20% reduction in costs• A new credit trading application
was built in just 10 weeks instead of the 5 months
• $0.56 /CPU hr