(BDT317) Building A Data Lake On AWS

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Ian Meyers, Principal Solution Architect, AWS

October 2015

BDT317

Building Your Data Lake on AWS

Benefits of the Enterprise Data Warehouse

• Self documenting schema

• Enforced data types

• Ubiquitous and common security model

• Simple tools to access, robust ecosystem

• Transactionality

STORAGECOMPUTE

But customers have additional requirements…

The Rise of “Big Data”

Enterprise

data warehouse

Amazon

EMR

Amazon

S3

STORAGECOMPUTE

COMPUTECOMPUTE

COMPUTE

COMPUTE

COMPUTE

COMPUTE

COMPUTECOMPUTE

COMPUTE

Benefits of Separation of Compute & Storage

• All your data, without paying for unused cores

• Independent cost attribution per dataset

• Use the right tool for a job, at the right time

• Increased durability without operations

• Common model for data, without enforcing access

method

Comparison of a Data Lake to an Enterprise Data Warehouse

Complementary to EDW (not replacement) Data lake can be source for EDW

Schema on read (no predefined schemas) Schema on write (predefined schemas)

Structured/semi-structured/Unstructured data Structured data only

Fast ingestion of new data/content Time consuming to introduce new content

Data Science + Prediction/Advanced Analytics + BI use

casesBI use cases only (no prediction/advanced analytics)

Data at low level of detail/granularity Data at summary/aggregated level of detail

Loosely defined SLAs Tight SLAs (production schedules)

Flexibility in tools (open source/tools for advanced

analytics)Limited flexibility in tools (SQL only)

Enterprise DWEMR S3

EMR S3

The New Problem

Enterprise

data warehouse

≠

Which system has my data?

How can I do machine

learning against the DW?

I built this in Hive, can we get

it into the Finance reports?

These sources are giving

different results…

But I implemented the

algorithm in Anaconda…

Dive Into The Data Lake

≠Enterprise

data warehouseEMR S3

Dive Into The Data Lake

Enterprise

data warehouseEMR S3

Load Cleansed Data

Export Computed Aggregates

Ingest any data

Data cleansing

Data catalogue

Trend analysis

Machine learning

Structured analysis

Common access tools

Efficient aggregation

Structured business rules

Components of a Data Lake

Data Storage

• High durability

• Stores raw data from input sources

• Support for any type of data

• Low cost

Streaming

• Streaming ingest of feed data

• Provides the ability to consume any

dataset as a stream

• Facilitates low latency analytics

Storage & Streams

Catalogue & Search

Entitlements

API & UI


Storage & Streams

Catalogue & Search

Entitlements

API & UICatalogue

• Metadata lake

• Used for summary statistics and data

Classification management

Search

• Simplified access model for data

discovery


Storage & Streams

Catalogue & Search

Entitlements

API & UIEntitlements system

• Encryption

• Authentication

• Authorisation

• Chargeback

• Quotas

• Data masking

• Regional restrictions


Storage & Streams

Catalogue & Search

Entitlements

API & UIAPI & User Interface

• Exposes the data lake to customers

• Programmatically query catalogue

• Expose search API

• Ensures that entitlements are respected

STORAGE

High durability

Stores raw data from input sources

Support for any type of data

Low cost

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Amazon Simple Storage ServiceHighly scalable object storage for the Internet

1 byte to 5 TB in size

Designed for 99.999999999% durability, 99.99%

availability

Regional service, no single points of failure

Server side encryption

Compute Storage

AWS Global Infrastructure

Database

App Services

Deployment & Administration

Networking

Analytics

Storage Lifecycle Integration

S3 – Standard S3 – Infrequent Access Amazon Glacier

Data Storage Format

• Not all data formats are created equally

• Unstructured vs. semi-structured vs. structured

• Store a copy of raw input

• Data standardisation as a workflow following ingest

• Use a format that supports your data, rather than force

your data into a format

• Consider how data will change over time

• Apply common compression

Consider Different Types of Data

Unstructured• Store native file format (logs, dump files, whatever)

• Compress with a streaming codec (LZO, Snappy)

Semi-structured - JSON, XML files, etc.• Consider evolution ability of the data schema (Avro)

• Store the schema for the data as a file attribute (metadata/tag)

Structured• Lots of data is CSV!

• Columnar storage (Orc, Parquet)

Where to Store Data

• Amazon S3 storage uses a flat keyspace

• Separate data storage by business unit, application, type, and time

• Natural data partitioning is very useful

• Paths should be self documenting and intuitive

• Changing prefix structure in future is hard/costly

Metadata

Services

CRUD API

Query API

Analytics API

Systems of

Reference

Return

URLsURLs as deeplinks to

applications, file

exchanges via S3

(RESTful file services)

or manifests for Big

Data Analytics / HPC.

Integration Layer

System to system via Amazon SNS/Amazon SQS

System to user via mobile push

Amazon Simple Workflow for high level system integration / orchestration

http://en.wikipedia.org/wiki/Resource-oriented_architecture

s3://${system}/${application}/${YYY-MM-DD}/${resource}/${resourceID}#appliedSecurity/${entitlementGroupApplied}

Resource Oriented Architecture

STREAMING

Streaming ingest of feed data

Provides the ability to consume any

dataset as a stream

Facilitates low latency analytics

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Why Do Streams Matter?

• Latency between event & action

• Most BI systems target event to action latency of 1 hour

• Streaming analytics would expect event to action latency

< 2 seconds

• Stream orientation simplifies architecture, but can

increase operational complexity

• Increase in complexity needs to be justified by business

value of reduced latency

Amazon KinesisManaged service for real time big data processing

Create streams to produce & consume data

Elastically add and remove shards for performance

Use Amazon Kinesis Worker Library to process data

Integration with S3, Amazon Redshift, and DynamoDB

Compute Storage


Database

App Services


Networking

Analytics

Data Sources

AW

S En

dp

oin

tData

Sources

Data Sources

Data Sources

S3

App.1

[Archive/Ingestion]

App.2

[Sliding Window Analysis]

App.3

[Data Loading]

App.4

[Event Processing Systems]

DynamoDB

Amazon Redshift

Data Sources

Availability

Zone

Shard 1

Shard 2

Shard N

Availability

ZoneAvailability

Zone

Amazon Kinesis Architecture

Streaming Storage Integration

Object store

Amazon S3

Streaming store

Amazon

Kinesis

Analytics applicationsRead & write file dataRead & write to streams

Archive

stream

Replay

history

CATALOGUE & SEARCH

Metadata lake

Used for summary statistics and data

Classification management

Simplified model for data discovery &

governance

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Building a Data Catalogue

• Aggregated information about your storage & streaming

layer

• Storage service for metadata

Ownership, data lineage

• Data abstraction layer

Customer data = collection of prefixes

• Enabling data discovery

• API for use by entitlements service

Data Catalogue – Metadata Index

• Stores data about your Amazon S3 storage environment

• Total size & count of objects by prefix, data classification,

refresh schedule, object version information

• Amazon S3 events processed by Lambda function

• DynamoDB metadata tables store required attributes

http://amzn.to/1LSSbFp

Amazon DynamoDBProvisioned throughput NoSQL database

Fast, predictable, configurable performance

Fully distributed, fault tolerant HA architecture

Integration with Amazon EMR & Hive

Amazon

RDS

Amazon

DynamoDB

Amazon

Redshift

Amazon

ElastiCache

Managed NoSQL

Compute Storage


Database

App Services


Networking

Analytics

AWS LambdaFully-managed event processor

Node.js or Java, integrated AWS SDK

Natively compile & install any Node.js modules

Specify runtime RAM & timeout

Automatically scaled to support event volume

Events from Amazon S3, Amazon SNS, Amazon

DynamoDB, Amazon Kinesis, & AWS Lambda

Integrated CloudWatch logging

Compute Storage


Database

App Services


Networking

Analytics

Serverless Compute

Data Catalogue – Search

Ingestion and pre-processing

Text processing (normalization)

• Tokenization

• Downcasing

• Stemming

• Stopword removal

• Synonym addition

Indexing

Matching

Ranking and relevance

• TF-IDF

• Additional criteria (rating, user behavior, freshness, etc.)

NoSQLRDBMS Files Any Source

Search Index

Processor

Features and Benefits

Easy to set up and operate • AWS Management Console, SDK, CLI

Scalable• Automatic scaling on data size and traffic

Reliable• Automatic recovery of instances, multi-AZ, etc.

High performance• Low latency and high throughput performance through in-memory

caching

Fully managed• No capacity guessing

Rich features• Faceted search, suggestions, relevance ranking, geospatial search,

multi-language support, etc.

Cost effective• Pay as you go

Amazon CloudSearch &

Amazon Elasticsearch

Data Catalogue – Building Search Index

• Enable DynamoDB

Update Stream for

metadata index table

• Additional AWS Lambda

function reads Update

Stream and extracts

index fields from S3

object

• Update to search

domain

Catalogue & Search Architecture

ENTITLEMENTS

Encryption

Authentication

Authorisation

Chargeback

Quotas

Data masking

Regional restrictions

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Data Lake != Open Access

Identity & Access Management

• Manage users, groups, and roles

• Identity federation with Open ID

• Temporary credentials with Amazon Security Token

Service (Amazon STS)

• Stored policy templates

• Powerful policy language

• Amazon S3 bucket policies

IAM Policy Language

• JSON documents

• Can include variables

which extract information

from the request context

aws:CurrentTime For date/time conditions

aws:EpochTime The date in epoch or UNIX time, for use with date/time conditions

aws:TokenIssueTime The date/time that temporary security credentials were issued, for use with date/time conditions.

aws:principaltype A value that indicates whether the principal is an account, user, federated, or assumed role—see the explanation that follows

aws:SecureTransport Boolean representing whether the request was sent using SSL

aws:SourceIp The requester's IP address, for use with IP address conditions

aws:UserAgent Information about the requester's client application, for use with string conditions

aws:userid The unique ID for the current user

aws:username The friendly name of the current user

IAM Policy Language

Example: Allow a user to access a private part of the data lake

{"Version": "2012-10-17","Statement": [

{"Action": ["s3:ListBucket"],"Effect": "Allow","Resource": ["arn:aws:s3:::mydatalake"],"Condition": {"StringLike": {"s3:prefix": ["${aws:username}/*"]}}

},{

"Action": ["s3:GetObject","s3:PutObject"

],"Effect": "Allow","Resource": ["arn:aws:s3:::mydatalake/${aws:username}/*"]

}]

}

IAM Federation

• IAM allows federation to Active

Directory and other OpenID

providers (Amazon, Facebook,

Google)

• AWS Directory Service provides

an AD Connector which can

automate federated connectivity

to ADFS

IAM

Users

AWS

Directory

Service

AD Connector

Direct

Connect

Hardware

VPN

Extended user defined security

Entitlements Engine: Amazon STS Token Vending Machine

http://amzn.to/1FMPrTF

Data Encryption

AWS CloudHSM

Dedicated Tenancy SafeNet Luna SA HSM

Device

Common Criteria EAL4+, NIST FIPS 140-2

AWS Key Management Service

Automated key rotation & auditing

Integration with other AWS services

AWS server side encryption

AWS managed key infrastructure

Entitlements – Access to Encryption Keys

Customer

Master Key

Customer

Data Keys

Ciphertext

Key

Plaintext

Key

IAM Temporary

Credential

Security Token

Service

MyData

MyData

S3

S3 Object

…

Name: MyData

Key: Ciphertext Key

…

Secure Data Flow

IAM

Amazon S3

API Gateway

Users

Temporary

Credential

Temporary

Credential

s3://mydatalake/${YYY-MM-DD}/${resource}/${resourceID}

Encrypted

Data

Metadata

Index -

DynamoDB

TVM - Elastic

Beanstalk

Security Token

Service

API & UI

Exposes the data lake to customers

Programmatically query catalogue

Expose search API

Ensures that entitlements are respected

Storage & Streams

Catalogue & Search

Entitlements

API & UI

Data Lake API & UI

• Exposes the Metadata API, search, and Amazon S3

storage services to customers

• Can be based on TVM/STS Temporary Access for many

services, and a bespoke API for Metadata

• Drive all UI operations from API?

AMAZON API GATEWAY

Amazon API Gateway

Host multiple versions and stages of APIs

Create and distribute API keys to developers

Leverage AWS Sigv4 to authorize access to APIs

Throttle and monitor requests to protect the backend

Leverages AWS Lambda

Additional Features

Managed cache to store API responses

Reduced latency and DDoS protection through

AWS CloudFront

SDK generation for iOS, Android, and JavaScript

Swagger support

Request / response data transformation and API mocking

An API Call Flow

Internet

Mobile Apps

Websites

Services

API

Gateway

AWS Lambda

functions

AWS

API Gateway

cache

Endpoints on

Amazon EC2

Any other publicly

accessible endpointAmazon

CloudWatch

monitoring

Amazon

CloudFront

API & UI Architecture

API Gateway

UI - Elastic

Beanstalk

AWS Lambda Metadata IndexUsersIAM

TVM - Elastic

Beanstalk

Putting It All Together

A Data Lake Is…

• A foundation of highly durable data storage and

streaming of any type of data

• A metadata index and workflow which helps us

categorise and govern data stored in the data lake

• A search index and workflow which enables data

discovery

• A robust set of security controls – governance through

technology, not policy

• An API and user interface that expose these features to

internal and external users

Storage & Streams

Amazon

Kinesis

Amazon S3 Amazon Glacier

Entitlements

IAM

Encrypted

Data

Security Token

Service

Data Catalogue & Search

AWS Lambda

Search

Index

Metadata

Index

API & UI

API GatewayUsers UI - Elastic

Beanstalk

TVM - Elastic

Beanstalk

KMS

Remember to complete

your evaluations!

Thank you!

Ian Meyers, Principal Solution Architect

(BDT317) Building A Data Lake On AWS

Technology

Transcript of (BDT317) Building A Data Lake On AWS