Post on 08-Jan-2017
DIRECTIONREDACTOR DATE
BIG DATA PLATFORMINDUSTRIALIZATION
BIG DATA INITIAVES @Renault
2014 Big Data Sandbox on old HPC Infrastructure.Site: Innovation LABPOC: Quality Data Exploration
2015DataLab ImplementationNew HP InfrastructureData Protection: NO1st Level of Industrialization
2016Big Data Platform Industrialization to host both Pocs and Projects in Production.Data Protection: YES
3DIRECTIONREDACTOR DATE
Big Data Deployment Production Stakes
• One Hadoop cluster with a 24/7 always-on visibility of data(instead of siloing them).• Many crossing Data possibilities • Simplify Operations• Design Simplicity• Charge Back Model• Scalability and Isolation• Isolate Experimental applications from Production
4DIRECTIONREDACTOR DATE
Déploiement des projets Qualité sur DataLake
Big Data Developpers
Serveur Client(Clients Hadoop Installés)
DMZ
KNOX
GA
TEW
AY
Search Node
Edge Node
Name Node
DataStoreMaster DNDNDNDNDNDNDNDNDN
DNDNDNDNDNDNDNDNDNData Sources
Load
Bal
ance
r
Web service
submit jobs
Web service
Web Applications
Web service
Import
Access GUI
Search Node
5DIRECTIONREDACTOR DATE
QualitySales and MarketingSupply Chain Engineering
Consumers
Open DataInternet of Things
ProducersBatch (RDBMS, Files)
Messages, Logs
Streaming, Data Flow
NFS Gateway, Sqoop, Spark SQL
FLUMELOGSTASH
KAFKA PRODUCERSKafka Broker(Topics)
Spark Streaming
Elasticsearch
HBASEHIVEHDFS
Spark SQLSpark RDD
Big Data Ecosystem @ Renault
YARN + HDFS
6DIRECTIONREDACTOR DATE
Data Ingestion Scenarios : RDBMS
RDBMS
Sqoop
Spark SQL
HIVE
Flat Files
HBaseBasic data import based on column id (Integer) or timestampExample: SOPHIAELT Architecture: Extract Load and TransformNon standard Data Import, Specific schemaExample: BLMSSupport ETL Architecture : Extract Transfrom and LoadSupport Ingestion directly to Elasticsearch
INSERT ONLYNOCTURNAL BATCHSQL QUERIESHIGH LATENCY
INSERT ONLYNOCTURNAL BATCHDATA PROCESSINGFiles Format: CSV, PARQUET, AVRO
INSERT AND UPDATENOSQL DB (KEY-VALUE SCHEMA)LOW LATENCYFOR SCALING OUT RELATIONAL DB ON HADOOPINTERACTIVE ANALYTICS (SPOTFIRE)
VERY LOW LATENCY (SSD DISKS)NEAR REAL TIME ANALITYCS (WATCHING ALERTS)TEXT SEARCH (LOG ANALYSIS)NESTED and PARENT/CHILD RELATIONSHIPS
Elasticsearch
7DIRECTIONREDACTOR DATE
Interactive SQL Data Analytics
Main Objectives
• Speeding up BI queries on Big Data Stores• Hide Complexity of Big Data Architecture to End-User and Provide Only one
Data Connector for Spotfire• Provide Interactive User Experience• Data Virtualization (No need to import RDBMS systematically for Crossing
Data)
Spark SQL (1.6) : The emerging solution for Interactive SQL Data Analytics with the Data Source API.
8DIRECTIONREDACTOR DATE
Only One Data Connector
Data Processing Applications
Add In-Memory Capability
Load/Insert
Load/Insert
Load/Insert
Interactive SQL Data Analytics
HBASEHIVE Elasticsearch
Spark SQL
Files Parquet
DataSource API
Load/Insert
RDBMS
Load/Insert
9DIRECTIONREDACTOR DATE
Big Data In Action
Multitenancy
High Availability
Security
Data GovernancePolicy
Continuous Delivery
Data Protection Hadoop
Organization
POC#0
POC#1
POC#2
POC#n
Release Management
PRODUCTIONSLA & Monitoring
Tenant 1
App 1
Tenant 2
App 1
App 2
Tenant n
App 1
Man
agem
ent a
nd
Mon
itorin
g
DataLab, Open to Data Exploration
Bi-modal Big Data Platform
One physical Cost –effective platform
11DIRECTIONREDACTOR DATE
Hadoop Global Data Life Cycle
Data Sources
Load or archive batch data
Stream real time data
Mask Sensitive data
with Automated Process
Renault Big Data Platform
Refine, curate, process, query data
Big Data Web Services
INGESTION
Scheduled and Monitored by AITS in PROD Policies pre-defined
by Security Officer
ADA ARCA
subscription request to datasets
Data Access
AITS: Groups Management
DIRx Data OwnerValidation
Sync
Users
Defined in Ranger and Protegrity
Defined in Falcon
12DIRECTIONREDACTOR DATE
Active / Active Hadoop Platform
Rack Salle 1 C2 Rack Salle 2 C2
Data Nodes for Bloc Storage
HDFS PROD HDFS POC
High Availability Architecture
13DIRECTIONREDACTOR DATE
Hadoop Security Levels
OS Security
Authorization
Perimeter Level SecurityProtected Zone
Data ProtectionSelected Solution : Tokenization (Protegrity)
14DIRECTIONREDACTOR DATE
Tokenization Definition
Selected solution: Tokenization• Tokenization is a form of data protection that
converts sensitive data into fake data.• The real data can be retrieved by authorized
users.• Protegrity: The only Available Solution for Hadoop (supports also traditional Data Systems)
Data Protection Key Requirements: • Ability to de-identify Personally Identifiable Data• Restrict data access (financial data, Data
residency obligations, …)• Provide central management and control of all
data security operations
15DIRECTIONREDACTOR DATE
Identifier Clear ProtectedAuthorized Role 1* Can see most data in the clear
Authorized Role 2* Can see limited data in the clear
Name Joe Smith csu wusoj Joe Smith Joe Smith
Address 100 Main Street, Pleasantville, CA
476 srta coetse, cysieondusbak, HA
100 Main Street, Pleasantville, CA
“No Access”
Date of Birth 12/25/1966 01/02/1966 12/25/1966 01/02/1966
VIN VF1112C0000724284 AB9875R8467364752 VF1112C0000724284 “No Access”
Credit Card Number
3678 2289 3907 3378 3846 2290 3371 3890 xxxx xxxx xxxx 3378 3846 2290 3371 3890
E-mail Address joe.smith@surferdude.org eoe.nwuer@beusorpdqo.aku joe.smith@surferdude.org joe.smith@surferdude.org
Telephone Number
760-278-3389 998-389-2289 760-278-3389 998-389-2289
DATA PROTECTION Example 1Fine Grained Protection
16DIRECTIONREDACTOR DATE
DATA PROTECTION Example 2Data Residency
17DIRECTIONREDACTOR DATE
The Next Step: HDFS FEDERATION
Name Service 1 /store1/
Name Service 2 /store2/
NameNode 1
Name Node 2
NameNode 3
Name Node 4
DataNode 1
DataNode 2
DataNode 3
DataNode 4
DataNode 5
DataNode 6
DataNode 7
DataNode 8
Federation
Scale-out Data NodesResources usage orchestrated by YARN
HA HA
• see only access /store2• cannot access to /store1• cannot access to Name Node 1 and 2
• see only access /store1• cannot access to /store2• cannot access to Name Node 3 and 4
PHYS
ICA
L IS
OLA
TIO
N
Privileged Users can access /store1 and /store2 for Crossing Data
Bloc StorageUnreadable Data
DataNode 9
18DIRECTIONREDACTOR DATE
BIG DATA PROJECT PROCESSBusiness Use Case DIRx
POC PROD Project
Data Exploration
AT
Data Sources IngestionSecurity Policies
Data Processing development
User Access Authorization
CPT
DIA - Innovation Front End deploymentAdmin
@BICC
architecture
development
MEP
Implementation
Deployment management
DAT
DAT
DAT
19DIRECTIONREDACTOR DATE
DATA CHANGE MANAGEMENT
• Publish real time dashboard of all data flows.
• For each data source the producer and the consumers will be displayed:• Producer: the DIRx generating the Data• Consumers: all the Big Data applications using this Data source
• If the Producer decides to change the data source schema, He has to send notification from the dashboard to all consumers.
• By monitoring in real time the logs of data ingestion jobs, the failure detection is more reactive and efficient.
20DIRECTIONREDACTOR DATE
COLLECT LOGS FOR (NRT) CARTO
Sqoop Metastore
Storing Jobs
Hive Metastore
Storing SQL Schema
OozieMetastore
Storing Workflows
Oozie Logs
YarnJob History Logs
Storing Projects
HueDatabase
Elasticsearch
JDBC
PLUGIN
LOGSTASH
Kibana
21DIRECTIONREDACTOR DATE
Dynamic Relationship Portal