Big Data Platform Industrialization

DIRECTIONREDACTOR DATE

BIG DATA PLATFORMINDUSTRIALIZATION

BIG DATA INITIAVES @Renault

2014 Big Data Sandbox on old HPC Infrastructure.Site: Innovation LABPOC: Quality Data Exploration

2015DataLab ImplementationNew HP InfrastructureData Protection: NO1st Level of Industrialization

2016Big Data Platform Industrialization to host both Pocs and Projects in Production.Data Protection: YES

3DIRECTIONREDACTOR DATE

Big Data Deployment Production Stakes

• One Hadoop cluster with a 24/7 always-on visibility of data(instead of siloing them).• Many crossing Data possibilities • Simplify Operations• Design Simplicity• Charge Back Model• Scalability and Isolation• Isolate Experimental applications from Production

Déploiement des projets Qualité sur DataLake

Big Data Developpers

Serveur Client(Clients Hadoop Installés)

Search Node

Edge Node

Name Node

DataStoreMaster DNDNDNDNDNDNDNDNDN

DNDNDNDNDNDNDNDNDNData Sources

Web service

submit jobs

Web service

Web Applications

Web service

Import

Access GUI

Search Node

QualitySales and MarketingSupply Chain Engineering

Consumers

Open DataInternet of Things

ProducersBatch (RDBMS, Files)

Messages, Logs

Streaming, Data Flow

NFS Gateway, Sqoop, Spark SQL

FLUMELOGSTASH

KAFKA PRODUCERSKafka Broker(Topics)

Spark Streaming

Elasticsearch

HBASEHIVEHDFS

Spark SQLSpark RDD

Big Data Ecosystem @ Renault

YARN + HDFS

Data Ingestion Scenarios : RDBMS

Spark SQL

Flat Files

HBaseBasic data import based on column id (Integer) or timestampExample: SOPHIAELT Architecture: Extract Load and TransformNon standard Data Import, Specific schemaExample: BLMSSupport ETL Architecture : Extract Transfrom and LoadSupport Ingestion directly to Elasticsearch

INSERT ONLYNOCTURNAL BATCHSQL QUERIESHIGH LATENCY

INSERT ONLYNOCTURNAL BATCHDATA PROCESSINGFiles Format: CSV, PARQUET, AVRO

INSERT AND UPDATENOSQL DB (KEY-VALUE SCHEMA)LOW LATENCYFOR SCALING OUT RELATIONAL DB ON HADOOPINTERACTIVE ANALYTICS (SPOTFIRE)

VERY LOW LATENCY (SSD DISKS)NEAR REAL TIME ANALITYCS (WATCHING ALERTS)TEXT SEARCH (LOG ANALYSIS)NESTED and PARENT/CHILD RELATIONSHIPS

Elasticsearch

Interactive SQL Data Analytics

Main Objectives

• Speeding up BI queries on Big Data Stores• Hide Complexity of Big Data Architecture to End-User and Provide Only one

Data Connector for Spotfire• Provide Interactive User Experience• Data Virtualization (No need to import RDBMS systematically for Crossing

Spark SQL (1.6) : The emerging solution for Interactive SQL Data Analytics with the Data Source API.

Only One Data Connector

Data Processing Applications

Add In-Memory Capability

Load/Insert

Interactive SQL Data Analytics

HBASEHIVE Elasticsearch

Spark SQL

Files Parquet

DataSource API

Load/Insert

Big Data In Action

Multitenancy

High Availability

Security

Data GovernancePolicy

Continuous Delivery

Data Protection Hadoop

Organization

Release Management

PRODUCTIONSLA & Monitoring

Tenant 1

Tenant 2

Tenant n

itorin

DataLab, Open to Data Exploration

Bi-modal Big Data Platform

One physical Cost –effective platform

Hadoop Global Data Life Cycle

Data Sources

Load or archive batch data

Stream real time data

Mask Sensitive data

with Automated Process

Renault Big Data Platform

Refine, curate, process, query data

Big Data Web Services

INGESTION

Scheduled and Monitored by AITS in PROD Policies pre-defined

by Security Officer

ADA ARCA

subscription request to datasets

Data Access

AITS: Groups Management

DIRx Data OwnerValidation

Defined in Ranger and Protegrity

Defined in Falcon

Active / Active Hadoop Platform

Rack Salle 1 C2 Rack Salle 2 C2

Data Nodes for Bloc Storage

HDFS PROD HDFS POC

High Availability Architecture

Hadoop Security Levels

OS Security

Authorization

Perimeter Level SecurityProtected Zone

Data ProtectionSelected Solution : Tokenization (Protegrity)

Tokenization Definition

Selected solution: Tokenization• Tokenization is a form of data protection that

converts sensitive data into fake data.• The real data can be retrieved by authorized

users.• Protegrity: The only Available Solution for Hadoop (supports also traditional Data Systems)

Data Protection Key Requirements: • Ability to de-identify Personally Identifiable Data• Restrict data access (financial data, Data

residency obligations, …)• Provide central management and control of all

data security operations

Identifier Clear ProtectedAuthorized Role 1* Can see most data in the clear

Authorized Role 2* Can see limited data in the clear

Name Joe Smith csu wusoj Joe Smith Joe Smith

Address 100 Main Street, Pleasantville, CA

476 srta coetse, cysieondusbak, HA

100 Main Street, Pleasantville, CA

“No Access”

Date of Birth 12/25/1966 01/02/1966 12/25/1966 01/02/1966

VIN VF1112C0000724284 AB9875R8467364752 VF1112C0000724284 “No Access”

Credit Card Number

3678 2289 3907 3378 3846 2290 3371 3890 xxxx xxxx xxxx 3378 3846 2290 3371 3890

E-mail Address joe.smith@surferdude.org eoe.nwuer@beusorpdqo.aku joe.smith@surferdude.org joe.smith@surferdude.org

Telephone Number

760-278-3389 998-389-2289 760-278-3389 998-389-2289

DATA PROTECTION Example 1Fine Grained Protection

DATA PROTECTION Example 2Data Residency

The Next Step: HDFS FEDERATION

Name Service 1 /store1/

Name Service 2 /store2/

NameNode 1

Name Node 2

NameNode 3

Name Node 4

DataNode 1

DataNode 2

DataNode 3

DataNode 4

DataNode 5

DataNode 6

DataNode 7

DataNode 8

Federation

Scale-out Data NodesResources usage orchestrated by YARN

• see only access /store2• cannot access to /store1• cannot access to Name Node 1 and 2

• see only access /store1• cannot access to /store2• cannot access to Name Node 3 and 4

Privileged Users can access /store1 and /store2 for Crossing Data

Bloc StorageUnreadable Data

DataNode 9

BIG DATA PROJECT PROCESSBusiness Use Case DIRx

POC PROD Project

Data Exploration

Data Sources IngestionSecurity Policies

Data Processing development

User Access Authorization

DIA - Innovation Front End deploymentAdmin

architecture

development

Implementation

Deployment management

DATA CHANGE MANAGEMENT

• Publish real time dashboard of all data flows.

• For each data source the producer and the consumers will be displayed:• Producer: the DIRx generating the Data• Consumers: all the Big Data applications using this Data source

• If the Producer decides to change the data source schema, He has to send notification from the dashboard to all consumers.

• By monitoring in real time the logs of data ingestion jobs, the failure detection is more reactive and efficient.

COLLECT LOGS FOR (NRT) CARTO

Sqoop Metastore

Storing Jobs

Hive Metastore

Storing SQL Schema

OozieMetastore

Storing Workflows

Oozie Logs

YarnJob History Logs

Storing Projects

HueDatabase

Elasticsearch

PLUGIN

LOGSTASH

Kibana

Dynamic Relationship Portal

Big Data Platform Industrialization

Technology

Transcript of Big Data Platform Industrialization

Industrial Internet: Big Data Platform

2/4 Aim : How did industrialization pave the way for big business?

Big Business- Industrialization

Taylor-fit Big Data Platform

THE FORGEROCK PLATFORM BIG PICTURE

Big Era Seven Industrialization and Its Consequences 1750 ...

Executive Summary Of Wonders and Disruption · 1. The industrialization and standardization of the internet of things 2. The emergence of Sensors-as-a-Service as a platform for Big

Industrialization and Its Consequences 1750-1914 CE Big Era Seven.

Share of World Industrialization Causes of Massive Industrialization 1.The Railroad fueled the growing US economy: First big business in the US. Aided.

Tableau + Hortonworks Big Data Platform

Microsoft Big Data Platform vs. SAP HANAdocshare01.docshare.tips/files/26901/269017629.pdf · Microsoft Big Data Platform versus SAP HANA Platform for Big Data ... Hortonworks TechEd

Plan 1. Industrialization of Ukraine. 2. Collectivization. The famine 3. Big terror.

1750-1900 Review. The Big Themes Industrialization and Global Integration Industrialization and Global Capitalism Imperialism and Nation-State Formation.

Ibm big data-platform

Big Business (Industrialization)

Industrialization and Its Discontents The Rise of Big Business, Labor and Agriculture.

Rackspace Cloud Big Data Platformc744563d32d0468a7cf1-2fe04d8054667ffada6c4002813… · · 2015-10-12Rackspace Cloud Big Data Platform: On-demand Big Data Processing Platform ...

Chapter 14 Industrialization Section 3 Big Business.

Big Era Seven Industrialization and its Consequences 1750 …worldhistoryforusall.ss.ucla.edu/units/seven/landscape/... · · 2016-10-19Big Era Seven Industrialization and its Consequences

Myntra.com's Big Data Platform