Immersion Day Building a Data Lake on AWS

28
© 2021, Amazon Web Services, Inc. or its Affiliates. Jason Moldan, Solutions Architect Norman Owens, Solutions Architect Arun Shanmugam, Data Architect Sreeji Gopal, Data Architect Matt Atwater, Principle Architect, Clear Scale Immersion Day – Building a Data Lake on AWS

Transcript of Immersion Day Building a Data Lake on AWS

© 2021, Amazon Web Services, Inc. or its Affiliates.

Jason Moldan, Solutions Architect

Norman Owens, Solutions Architect

Arun Shanmugam, Data Architect

Sreeji Gopal, Data Architect

Matt Atwater, Principle Architect, Clear Scale

Immersion Day – Building a Data

Lake on AWS

© 2021, Amazon Web Services, Inc. or its Affiliates.

Workshop Agenda

1. Overview of the Data Lake

2. Hydrating the Data Lake

Lab1 – Hydrating the Data Lake with DMS

3. Working Within the Data Lake

Lab2 – ETL with AWS Glue

Lab3 - Querying the Data Lake with Amazon Athena and

Amazon QuickSight

4. Consuming the Data Lake – Reporting, Analytics, Machine

Learning

5. Introduction to DataBrew

© 2021, Amazon Web Services, Inc. or its Affiliates.

• A centralized repository for both

structured and unstructured data

• Store data as-is in open-source file

formats to enable direct analytics

What is a Data Lake?

© 2021, Amazon Web Services, Inc. or its Affiliates.

Why a Data Lake?

• Decouple storage from compute,

allowing you to scale

• Enable advanced analytics across all of

your data sources

• Reduce complexity in ETL and

operational overhead

• Future extensibility as new database and

analytics technologies are invented

© 2021, Amazon Web Services, Inc. or its Affiliates.

Traditionally, Analytics Looked Like This

OLTP ERP CRM LOB

Data Warehouse

Business Intelligence

TBs-PBs Scale

Schema Defined Prior to Data Load

Operational and Ad Hoc Reporting

Large Initial Capex + $$K / TB/ Year

Relational Data

© 2021, Amazon Web Services, Inc. or its Affiliates.

Data Lakes Extend the Traditional Approach

OLTP ERP CRM LOB

Catalog

DW

Queries

Big Data

ProcessingInteractive Real-Time

Web Sensors SocialDevices

Business Intelligence Machine Learning TB-EBs Scale

All Data in one place, a Single Source of Truth

Relational and Non-Relational Data

Decouples (low cost) Storage and Compute

Schema on Read

Diverse Analytical Engines

Data Lake

100110000100101011100

101010111001010100001

011111011010001111001

0110010110

0100011000010

© 2021, Amazon Web Services, Inc. or its Affiliates.

A m a z o n S 3

A m a z o n G l a c i e r

A W S G l u e

Store Data in the Format You WantOpen and comprehensive

• Store data in the format you want:

• Text files like CSV

• Columnar like Apache Parquet, and Apache ORC

• Logstash like Grok

• JSON (simple, nested), AVRO

• And more…

CSV

ORC

Grok

Avro

Parquet

JSON

© 2021, Amazon Web Services, Inc. or its Affiliates.

Any ScaleScalable and durable

• S3 has trillions of objects and exabytes of data

• Built to store any amount of data

• Run analytic engines at largest scale by spinning

up any amount of compute resources in minutes

• Runs on the world’s largest global

cloud infrastructure

© 2021, Amazon Web Services, Inc. or its Affiliates.

Designed for 11 9s

of durability

Designed for

99.99% availability

Durable Available High performance Multiple upload

Range GET

Store as much as you need

Scale storage and compute

independently

No minimum usage commitments

Scalable

Amazon EMR

Amazon Redshift

Amazon DynamoDB

Amazon SageMaker

Many more

Integrated

Simple REST API

AWS SDKs

Read-after-create consistency

Event notification

Lifecycle policies

Easy to use

Why Amazon S3 for a Data Lake?

© 2021, Amazon Web Services, Inc. or its Affiliates.

Pay Only for the Resources You Use as you ScaleLowest Cost

• Pay-as-you-go for the resources you consume

• As low as $0.05/GB scanned with Athena

• EMR and Athena can automatically scale down

resources after job completes, saving you costs

• Commit to a set term and save up to 75% with

Reserved Instance

• Run on spare compute capacity with EMR and

save up to 90% with Spot

Traditional approach leads to wasted capacity

Traditional: Rigid

AWS: Elastic

Capacity

Demand

Demand

Servers

Unmet demand

upset players

missed revenue

Excess capacity

wasted $$$

AWS approach: pay for the capacity you use

© 2021, Amazon Web Services, Inc. or its Affiliates.

Lowest Total Cost of Ownership (TCO)Cost-effective

• Less admin time to

manage, and support

• No up-front costs—

hardware acquisition,

installation

• Save on operating

costs—data center space,

power, cooling

• Business value: cost of

delays, risk premium,

competitive abilities,

governance, etc.

• Less admin time to manage,

and support

• No up-front costs—hardware

acquisition, installation

• Save on operating

costs—data center space,

power, cooling

• Business value: cost of delays,

risk premium, competitive

abilities, governance, etc.

© 2021, Amazon Web Services, Inc. or its Affiliates.

Typical steps for building a data lake

Implementing a Data Lake architecture requires a broad set of tools and

technologies to serve an increasingly diverse set of applications and use

cases.

Set up storage1

Move data2 Cleanse, prep, and catalog data

3 Configure and enforce security and compliance policies

4 Make data available for analytics

5

Processing & Analytics

Real-time Analytics

AI & Predictive

BI & Data Visualization

Transactional & RDBMS

AWS LambdaApache Storm on

EMR

Apache Flink

on EMRSpark Streaming

on EMR

Elasticsearch

ServiceKinesis Data Analytics,

Kinesis Data Streams

DynamoDB

NoSQL DB Relational Database

Aurora

EMR

Hadoop, Spark,

Presto

Redshift

Data Warehouse

Athena

Query Service

Amazon Lex

Speech

recognition

Amazon

Rekognition

Amazon Polly

Text to speech

Machine Learning

Predictive analytics

SageMaker

© 2021, Amazon Web Services, Inc. or its Affiliates.

Data Lake on AWS

Data Ingestion

AWS

SnowballAWS Storage

Gateway

Amazon

Kinesis Data

Firehose

AWS Direct

Connect

AWS Database

Migration

Service

TempSST SVT

S3

Central Storage

Scalable, secure, cost-effective ETL Enrich

© 2021, Amazon Web Services, Inc. or its Affiliates.

Data Lake on AWS

Data Ingestion

AWS

SnowballAWS Storage

Gateway

Amazon

Kinesis Data

Firehose

AWS Direct

Connect

AWS Database

Migration

Service

S3

Central Storage

Scalable, secure, cost-effective

Catalog & Search

Amazon

DynamoDBAmazon Elasticsearch

Service

AWS

Glue

Analytics & Serving

Amazon

Athena

Amazon

EMRAWS

Glue

Amazon

Redshift

Amazon

DynamoDB

Amazon

QuickSight

Amazon

Kinesis

Amazon

Elasticsearch

Service

Amazon

NeptuneAmazon

RDS

Access & User Interfaces

AWS

AppSync

Amazon

API Gateway

Amazon

Cognito

AWS

KMSAWS

CloudTrail

Manage & Secure

AWS

IAM

Amazon

CloudWatch

© 2021, Amazon Web Services, Inc. or its Affiliates.

Data Lake on AWS

Access & User Interfaces

AWS

AppSync

Amazon

API Gateway

Amazon

Cognito

AWS

KMSAWS

CloudTrail

Manage & Secure

AWS

IAM

Amazon

CloudWatch

Data Ingestion

AWS

SnowballAWS Storage

Gateway

Amazon

Kinesis Data

Firehose

AWS Direct

Connect

AWS Database

Migration

Service

Analytics & Serving

Amazon

Athena

Amazon

EMRAWS

Glue

Amazon

Redshift

Amazon

DynamoDB

Amazon

QuickSight

Amazon

Kinesis

Amazon

Elasticsearch

Service

Amazon

NeptuneAmazon

RDS

S3

Central Storage

Scalable, secure, cost-effective

Catalog & Search

Amazon

DynamoDBAmazon Elasticsearch

Service

AWS

Glue

© 2021, Amazon Web Services, Inc. or its Affiliates.

Data Lake Architectureson AWS

S CALAB LE D ATA LAK ES

P U R P O S E - B U I LT

D ATA S E R V I C E S

S EAM LE S S

D ATA M O V EM EN T

U N I FI E D G O V E R N AN CE

P E R FO R M AN T AN D

C O S T- EFFECTI V E

Amazon

DynamoDB

Amazon

SageMaker

Amazon

Redshift

Amazon

Elasticsearch

Service

Amazon

EMR

AmazonS3

Amazon

Aurora

AmazonAthena

Non-

relational

databases

Machine

learning

Data

warehousing

Log

analytics

Big data

processing

Relational

databases

© 2021, Amazon Web Services, Inc. or its Affiliates.

Benefits of a Data Lake – All Data in One Place

Store and analyze all of your data,

from all of your sources, in one

centralized location.

“Why is the data distributed in

many locations? Where is the

single source of truth ?”

© 2021, Amazon Web Services, Inc. or its Affiliates.

Benefits of a Data Lake – Quick Ingest

Quickly ingest data

without needing to force it into a

pre-defined schema.

“How can I collect data quickly

from various sources and store

it efficiently?”

© 2021, Amazon Web Services, Inc. or its Affiliates.

Benefits of a Data Lake – Storage vs Compute

Separating your storage and compute

allows you to scale each component as

required

“How can I scale up with the

volume of data being generated?”

© 2021, Amazon Web Services, Inc. or its Affiliates.

Benefits of a Data Lake – Schema on Read

“Is there a way I can apply multiple

analytics and processing frameworks

to the same data?”

A Data Lake enables ad-hoc

analysis by applying schemas

on read, not write.

© 2021, Amazon Web Services, Inc. or its Affiliates.

What can you do with a Data Lake?

© 2021, Amazon Web Services, Inc. or its Affiliates.

Query Directly with Amazon Athena

© 2021, Amazon Web Services, Inc. or its Affiliates.

Analyze with Hadoop on Amazon EMR

© 2021, Amazon Web Services, Inc. or its Affiliates.

Create Visualizations with Amazon QuickSight

© 2021, Amazon Web Services, Inc. or its Affiliates.

Train ML Models with Amazon SageMaker

© 2021, Amazon Web Services, Inc. or its Affiliates.

Create a Central Data Catalog with AWS Glue

© 2021, Amazon Web Services, Inc. or its Affiliates.

Load into Downstream Services

AURORAAmazon Redshift

Amazon DynamoDB

Amazon Aurora

Amazon Elasticsearch

Run complex analytic queries against petabytes of structured data

A NoSQL database service that

delivers consistent, single-digit millisecond latency at any scale.

A MySQL and PostgreSQL compatible relational database built for the cloud

Delivers Elasticsearch’s real-time analytics

capabilities alongside the availability,

scalability, and security that production workloads require.

Amazon SageMakerfully managed service that provides

every developer and data scientist with

the ability to build, train, and deploy

machine learning (ML) models quickly

© 2021, Amazon Web Services, Inc. or its Affiliates.

Thanks! You