Planing and optimizing data lake architecture

Milos Milovanovic, Co-Founder & Data Engineer @ Things [email protected]@datascience.rs

Planning and Optimizing Data Lake Architecture

Agenda

Introduction - Business Data Requirements

What is A Data Lake?

A Common Data Lake Architecture

When Problems Start To Show Up - Optimizing Data Lake

Expanding a Data Lake

How To Plan Data Lake - Success Factors

Introduction - Business Data Requirements

Main goal for organizations is to adapt and put all of their data into use.

It’s not an easy task - it might require the mindset and structural changes.

Flexibility and agility are required for success.

Various trends and buzzwords are making it hard to stay on track.

Challenge of Transforming Enterprise Data Management - (“The data lake is a

foundational component and common denominator of the modern data architecture

enabling, and complementing specialized components, such as enterprise data

warehouses, discovery-oriented environments, and highly-specialized analytic or

operational data technologies…” - John O’Brien, CEO @ Radiant Advisors)

Data Lake - The Very First Definition

“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”

- James Dixon, CTO @ Pentaho

A More Formal Definition

“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”

Data Warehouse & Data Lake

Data Warehouse & Data Lake by Example

Social Media Streaming can be implemented using traditional Data Warehouse

… but such an application will be to restricted and inflexible (extending the number of columns analyzed).

Using Data Lake for this purpose gives us flexibility to adapt and test new metrics

… and we can easily add new applications on top.

Data Lake Architecture Overview

A Common Data Lake Implementation Architecture

❏ In general, the architecture of a data lake is simple: a Hadoop File System (HDFS) with lots of directories and files on it.

❏ Hadoop is usually in the center of Data Lake Architecture, although the concept is broader than Hadoop.

❏ Hadoop’s scalable, low-cost persistence layer and its ability to perform big data processing and analytics is a great toolset to achieve measurable business value opportunities at speed and low cost.

❏ Hive and Spark provide us rich analytics on top of the data that is persisted at low cost.

This Architecture:Acts like SQL

Efficient and Scalable

Connects to Basically Anything

Different Processing Modes (Realtime, Batch, Pipelines, Machine Learning, Ad Hoc Analysis …)

HADOOPDISTRIBUTED

FILESYSTEM

HIVE AND SPARK

DATA SOURCES

When Problems Show Up

Hadoop + Spark/Hive != Database

- Searching a row within TBs of Data

select * from my_table where some_column like ’%123asd%’;

- No updates and deletes- Too many concurrent requests from BI Tools

...Spark Best Practice: http://go.databricks.com/not-your-fathers-database

How Do We Optimize Such a Solution?

❏ Use ORC File Format❏ File Compaction (small files, deduplication)❏ Run Spark on YARN❏ Use Spark Dataframes❏ Data Caching❏ Use Traditional Databases❏ Extend the Toolset (Solr, ES, Kafka, Redis, …)

Data Lake - Extended Toolset

HDFS

AND MANY MORE...

How To Start With The Data Lake?

❏ Think of the Use Cases (don’t plan all the use cases - have some in mind)❏ Master the Technology❏ Go agile and flexible❏ Do not forget about the Data Governance, Data Quality, Security (but do not

drown in this)❏ Integrate with BI and DWH

Make data accessible and let Data Scientists go fishing in the Lake.

Milos Milovanovic, Co-Founder & Data Engineer @ Things [email protected]@datascience.rs

Planning and Optimizing Data Lake Architecture

Planing and optimizing data lake architecture

Data & Analytics

Transcript of Planing and optimizing data lake architecture