Planing and optimizing data lake architecture
-
Upload
milos-milovanovic -
Category
Data & Analytics
-
view
74 -
download
1
Transcript of Planing and optimizing data lake architecture
Milos Milovanovic, Co-Founder & Data Engineer @ Things [email protected]@datascience.rs
Planning and Optimizing Data Lake Architecture
Agenda
Introduction - Business Data Requirements
What is A Data Lake?
A Common Data Lake Architecture
When Problems Start To Show Up - Optimizing Data Lake
Expanding a Data Lake
How To Plan Data Lake - Success Factors
Introduction - Business Data Requirements
Main goal for organizations is to adapt and put all of their data into use.
It’s not an easy task - it might require the mindset and structural changes.
Flexibility and agility are required for success.
Various trends and buzzwords are making it hard to stay on track.
Challenge of Transforming Enterprise Data Management - (“The data lake is a
foundational component and common denominator of the modern data architecture
enabling, and complementing specialized components, such as enterprise data
warehouses, discovery-oriented environments, and highly-specialized analytic or
operational data technologies…” - John O’Brien, CEO @ Radiant Advisors)
Data Lake - The Very First Definition
“If you think of a datamart as a store of bottled water – cleansed and packaged and structured for easy consumption – the data lake is a large body of water in a more natural state. The contents of the data lake stream in from a source to fill the lake, and various users of the lake can come to examine, dive in, or take samples.”
- James Dixon, CTO @ Pentaho
A More Formal Definition
“A data lake is a storage repository that holds a vast amount of raw data in its native format, including structured, semi-structured, and unstructured data. The data structure and requirements are not defined until the data is needed.”
Data Warehouse & Data Lake by Example
Social Media Streaming can be implemented using traditional Data Warehouse
… but such an application will be to restricted and inflexible (extending the number of columns analyzed).
Using Data Lake for this purpose gives us flexibility to adapt and test new metrics
… and we can easily add new applications on top.
A Common Data Lake Implementation Architecture
❏ In general, the architecture of a data lake is simple: a Hadoop File System (HDFS) with lots of directories and files on it.
❏ Hadoop is usually in the center of Data Lake Architecture, although the concept is broader than Hadoop.
❏ Hadoop’s scalable, low-cost persistence layer and its ability to perform big data processing and analytics is a great toolset to achieve measurable business value opportunities at speed and low cost.
❏ Hive and Spark provide us rich analytics on top of the data that is persisted at low cost.
This Architecture:Acts like SQL
Efficient and Scalable
Connects to Basically Anything
Different Processing Modes (Realtime, Batch, Pipelines, Machine Learning, Ad Hoc Analysis …)
HADOOPDISTRIBUTED
FILESYSTEM
HIVE AND SPARK
DATA SOURCES
When Problems Show Up
Hadoop + Spark/Hive != Database
- Searching a row within TBs of Data
select * from my_table where some_column like ’%123asd%’;
- No updates and deletes- Too many concurrent requests from BI Tools
...Spark Best Practice: http://go.databricks.com/not-your-fathers-database
How Do We Optimize Such a Solution?
❏ Use ORC File Format❏ File Compaction (small files, deduplication)❏ Run Spark on YARN❏ Use Spark Dataframes❏ Data Caching❏ Use Traditional Databases❏ Extend the Toolset (Solr, ES, Kafka, Redis, …)
How To Start With The Data Lake?
❏ Think of the Use Cases (don’t plan all the use cases - have some in mind)❏ Master the Technology❏ Go agile and flexible❏ Do not forget about the Data Governance, Data Quality, Security (but do not
drown in this)❏ Integrate with BI and DWH
Milos Milovanovic, Co-Founder & Data Engineer @ Things [email protected]@datascience.rs
Planning and Optimizing Data Lake Architecture