LECTURE @DHBW: DATA WAREHOUSE PART VIII: …buckenhofer/20182DWH/Bucken...Data Lake on Spark Data...

33
A company of Daimler AG LECTURE @DHBW: DATA WAREHOUSE PART VIII: DATA LAKE ANDREAS BUCKENHOFER, DAIMLER TSS

Transcript of LECTURE @DHBW: DATA WAREHOUSE PART VIII: …buckenhofer/20182DWH/Bucken...Data Lake on Spark Data...

A company of Daimler AG

LECTURE @DHBW: DATA WAREHOUSE

PART VIII: DATA LAKEANDREAS BUCKENHOFER, DAIMLER TSS

ABOUT ME

https://de.linkedin.com/in/buckenhofer

https://twitter.com/ABuckenhofer

https://www.doag.org/de/themen/datenbank/in-memory/

http://wwwlehre.dhbw-stuttgart.de/~buckenhofer/

https://www.xing.com/profile/Andreas_Buckenhofer2

Andreas BuckenhoferSenior DB [email protected]

Since 2009 at Daimler TSS Department: Big Data Business Unit: Analytics

ANDREAS BUCKENHOFER, DAIMLER TSS GMBH

Data Warehouse / DHBWDaimler TSS 3

“Forming good abstractions and avoiding complexity is an essential part of a successful data architecture”

Data has always been my main focus during my long-time occupation in the area of data integration. I work for Daimler TSS as Database Professional and Data Architect with over 20 years of experience in Data Warehouse projects. I am working with Hadoop and NoSQL since 2013. I keep my knowledge up-to-date - and I learn new things, experiment, and program every day.

I share my knowledge in internal presentations or as a speaker at international conferences. I'm regularly giving a full lecture on Data Warehousing and a seminar on modern data architectures at Baden-Wuerttemberg Cooperative State University DHBW. I also gained international experience through a two-year project in Greater London and several business trips to Asia.

I’m responsible for In-Memory DB Computing at the independent German Oracle User Group (DOAG) and was honored by Oracle as ACE Associate. I hold current certifications such as "Certified Data Vault 2.0 Practitioner (CDVP2)", "Big Data Architect“, „Oracle Database 12c Administrator Certified Professional“, “IBM InfoSphere Change Data Capture Technical Professional”, etc.

DHBWDOAG

xing

Contact/Connect

As a 100% Daimler subsidiary, we give

100 percent, always and never less.

We love IT and pull out all the stops to

aid Daimler's development with our

expertise on its journey into the future.

Our objective: We make Daimler the

most innovative and digital mobility

company.

NOT JUST AVERAGE: OUTSTANDING.

Daimler TSS

INTERNAL IT PARTNER FOR DAIMLER

+ Holistic solutions according to the Daimler guidelines

+ IT strategy

+ Security

+ Architecture

+ Developing and securing know-how

+ TSS is a partner who can be trusted with sensitive data

As subsidiary: maximum added value for Daimler

+ Market closeness

+ Independence

+ Flexibility (short decision making process,

ability to react quickly)

Daimler TSS 5

Daimler TSS

LOCATIONS

Data Warehouse / DHBW

Daimler TSS ChinaHub Beijing10 employees

Daimler TSS MalaysiaHub Kuala Lumpur42 employees

Daimler TSS IndiaHub Bangalore22 employees

Daimler TSS Germany

7 locations

1000 employees*

Ulm (Headquarters)

Stuttgart

Berlin

Karlsruhe

* as of August 2017

6

• After the end of this lecture you will be able to

• Understand the idea behind Data Lakes

WHAT YOU WILL LEARN TODAY

Data Warehouse / DHBWDaimler TSS 7

LOGICAL STANDARD DATA WAREHOUSE ARCHITECTURE

Data Warehouse / DHBWDaimler TSS 8

Data Warehouse

FrontendBackend

External data sources

Internal data sources

Staging Layer(Input Layer)

OLTP

OLTP

Core Warehouse

Layer(Storage

Layer)

Mart Layer(Output Layer)

(Reporting Layer)

Integration Layer

(Cleansing Layer)

Aggregation Layer

Metadata Management

Security

DWH Manager incl. Monitor

WHAT IS A DATA LAKE?

Data Warehouse / DHBWDaimler TSS 9

Dump anything in and wait?

Hoard 100ths ofPetabyte in HDFS?

Data Warehouse and Big Data / DHBWDaimler TSS 10

Data Lake on Hadoop

Data Swamp

Data Reservoir

Landing Zone

Data Library

Data Repository

Data Archive

Data Lake on Spark

Data Lake 3.0

IT‘S VERY HARD TO GET SPEED AND QUALITY

Data Warehouse and Big Data / DHBWDaimler TSS 11

Schema-on-write• RDBMS: create data model firstSchema-on-read• Hadoop HDFS / NoSQL: create

data model later (when reading data)

RDBMS can also work with schema-on-read. Hadoop can also work with schema-on-write.

Dump question, but actually

there are many comparisons like that at the moment

• Hadoop is a tool / technology (or even many tools) like a RDBMS

• DWH is an architecture and concept• Architecture is abstraction and

defines a goal

• Architecture vs tools / technology

WHAT ARE DIFFERENCES BETWEEN HADOOP AND DWH?

Data Warehouse and Big Data / DHBWDaimler TSS 12

• Architecture, conceptData Lake

• Tools (that can be used to implement a Lake)

Hadoop, Spark, Elastic Stack

DATA LAKE VS HADOOP

Data Warehouse and Big Data / DHBWDaimler TSS 13

DWH AND DATA LAKE

Data Warehouse and Big Data / DHBWDaimler TSS 14

DWH on RDBMS

Slowly Changing DimensionELT vs ETL3-Layer vs 2-LayerKimball ApproachInmon DefinitionStar SchemaData VaultAnchor Modelingetc

Data Lake on Hadoop

Schema-on-ReadAgilityParquetHiveHBaseSQL-on-HadoopImpalaOozieZoekeeper

Methods, Concepts,

Techniques

Tools,Tools,Tools

• The data is not always known in advance, so it can’t be modeled in advance. [data can be anywhere → collect everything approach]

• The data architecture must be read-write from both the back and front, not a one-way data flow. [vs Inmon]

• The data written back may be repeatedly used, persistent data, or it may be temporary.

• The data may arrive with any frequency, and the rate may not be under your control.

ASSUMPTIONS + REQUIREMENTS HAVE CHANGED

Data Warehouse and Big Data / DHBWDaimler TSS 15

• Naive idea: dump everything in (“landing zone”)

• Data hoarding is not a data management strategy

• A Data Lake brings in structure

• e.g. create directories in HDFS if Hadoop is used

HOW DOES A DATA LAKE DIFFER FROM A DATA SWAMP?

Data Warehouse and Big Data / DHBWDaimler TSS 16

WHAT IS A DATA LAKE?SEPARATE COLLECT / MANAGE / DELIVERY

Data Warehouse and Big Data / DHBWDaimler TSS 17

ZONES INSTEAD OF LAYERS

Data Warehouse and Big Data / DHBWDaimler TSS 18

New data of unknown value, simple requests for new data can land here first, with little work by IT. Typically schema-

on-read.

More effort applied to management, slower.

Optimized for specific uses / workloads. Generally the slowest change. Typically

schema-on-write.

• No agreed, standardized definition

• Additionally, there are many more buzzwords like Landing Zone, Data Repository, Data Swamp,

• Characteristics of a Data Lake architecture according to Madsen:

• Deals with data and schema change easily

• Does not always require up front modeling

• Does not limit the format or structure of data

• Assumes a full range of data latencies, from streaming to one-time bulk loads, both in and out including write-back

• Supports different uses of the same data

WHAT IS A DATA LAKE?

Data Warehouse and Big Data / DHBWDaimler TSS 19

SCHEMA-ON-WRITE VS SCHEMA-ON-READ REVISITED

Data Warehouse and Big Data / DHBWDaimler TSS 20

Old approach New approach

Model Collect

Collect Model

Analyze Analyze

Promote

DATA LAKE (MARTIN FOWLER)

Data Warehouse / DHBWDaimler TSS 21

Source: https://martinfowler.com/bliki/DataLake.html

DATA LAKE (MARTIN FOWLER)

Data Warehouse / DHBWDaimler TSS 22

Source: https://martinfowler.com/bliki/DataLake.html

A Data Lake acquires data from multiple sources in an enterprise in its native form and may also have internal, modeled forms of this same data for various purposes. The information thus handled could be any type of information, ranging from structured or semi-structured data to completely unstructured data. A Data Lake is expected to be able to derive enterprise-relevant meanings and insights from this information using various analysis and machine learning algorithms.

WHAT IS A DATA LAKE?

Data Warehouse / DHBWDaimler TSS 23

Source: Pankaj Misra, Tomcy John: Data Lake for Enterprises Packt 2017

DATA LAKE LIFE CYCLE

Data Warehouse / DHBWDaimler TSS 24

Source: Pankaj Misra, Tomcy John: Data Lake for Enterprises Packt 2017

USE CASE: ANALYSIS BATTERY AGING

Data Warehouse and Big Data / DHBWDaimler TSS 25

Max capacityCurrent capacity

• JSON data ingested into HDFS, Hive tables on JSON files

• Identify breaks (“> 8h”) and compute current drain

• Sensor data format change without notice

• Sensors get regularly updated with new versions

• Names of metrics may change

• Sensors with various versions in the field

• Sensors from different suppliers

• Often many fields >>100 and increasing with new sensor versions

• Easy storing of data in HDFS and applying schema later

• Data from Robots, vehicles, …

STRUCTURING THE DATA LAKENEW DATA SOURCES – SENSOR DATA

Data Warehouse and Big Data / DHBWDaimler TSS 26

• Sensor data format change without notice• Time consuming and error-prone

data integration into the Data Lake

• Therefore preparation of data for usage in the Data Reservoir required: “Data Engineer”

STRUCTURING THE DATA LAKE“SCHEMA-ON-READ”

Data Warehouse and Big Data / DHBWDaimler TSS 27

Raw dataD

ata

Go

vern

ance

Consumption

Enhanced data

Met

adat

a M

anag

eme

nt

Data A

rchival

Data Secu

rity

json

Samp-ling / filter

Hive tables

Hive tables

Struc-ture

R Python

DATA VAULT 2.0 ARCHITECTURE – TODAY’S WORLD (DANLINSTEDT)

Data Warehouse / DHBWDaimler TSS

https://www.youtube.com/watch?v=tDNjI1Yvqxw

DEFINING A DATA LAKE … BY DAN LINSTEDT

Data Warehouse and Big Data / DHBWDaimler TSS 29

DATA LAKE TURNED INTO DATA SWAMP

Data Warehouse / DHBWDaimler TSS 30

Source: Ungerer: Cleaning Up the Data Lake with an Operational Data Hub, O’Reilly Media 2018, p.12

DATA LAKE

Data Warehouse / DHBWDaimler TSS 31

Source: Ungerer: Cleaning Up the Data Lake with an Operational Data Hub, O’Reilly Media 2018, p.11

DATA LAKE (ECKERSON GROUP)

Data Warehouse / DHBWDaimler TSS 32

Source: https://www.eckerson.com/articles/ten-characteristics-of-a-modern-data-architecture

Daimler TSS GmbHWilhelm-Runge-Straße 11, 89081 Ulm / Telefon +49 731 505-06 / Fax +49 731 505-65 99

[email protected] / Internet: www.daimler-tss.com/ Intranet-Portal-Code: @TSSDomicile and Court of Registry: Ulm / HRB-Nr.: 3844 / Management: Christoph Röger (CEO), Steffen Bäuerle

Data Warehouse / DHBWDaimler TSS 33

THANK YOU