Ingest, Transform & Visualize w Amazon Web Services

35
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary. Data Ingestion, Transformation, and Visualization with Amazon Web Services Sponsored by Amazon Web Services and Intel (Venue ) Presented by Arun Kumar Palathumpattu (Cloudwick)

Transcript of Ingest, Transform & Visualize w Amazon Web Services

Page 1: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick© 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Data Ingestion, Transformation, and Visualization with Amazon Web ServicesSponsoredbyAmazonWebServicesandIntel(Venue)PresentedbyArunKumarPalathumpattu(Cloudwick)

Page 2: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Cloudwick

Page 3: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Agenda Cloudwick

• Introduction• CommonChallengesinDataAnalyticsEnvironments• CreatingDataLake– Singlesourceoftruth• BuildEffectiveDataworkflow• Demo• Q&A

Page 4: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Top 5 Challenges in Building Data Analysis Environments

Cloudwick

Page 5: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

# 1 Market LandscapeCloudwick

Page 6: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Cloudwick

Image:http://mwicorp.com/lake-okeechobee-water-transfer/

Structured

# 2 Issues of Storing All the Data

Page 7: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

# 3 Time it Takes to Find the Useful InformationCloudwick

Image:https://img.clipartfest.com/0ffb2c38607970437e20a6fdd2872eb9_4-benefits-from-taking-guitar-time-value-of-money-clipart_500-500.png

Page 8: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

# 4 SecurityCloudwick

Image:https://insights.ubuntu.com/2017/03/20/three-flaws-at-the-heart-of-iot-security/

Page 9: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

# 5 Skill GapCloudwick

https://www.linkedin.com/pulse/wonder-three-years-analytics-big-data-skills-most-demand-hosseini

Page 10: Ingest, Transform & Visualize w Amazon Web Services

EvolutionofDataArchitectures1985:DataWarehouseAppliances Benefits

• Consolidatedmultipledecisionsupportenvironments(i.e.databases)intoasinglearchitecture

• Bestperformanceavailableattimeofconception,hencetheexpensivelicenses

• Workedwellwithstructured,columnardata• Couldbuildcustomizeddatamartsontop

SharedStorageTier(NASAppliance)

ComputeNode

ComputeNode

ComputeNode

ComputeNode

• Proprietarysoftwarelicensepaidpernodeperyear

• Gold-platedhardwareavailableonlyfromthevendorwithpernodeperyearcost

Constraints

• Proprietarysoftwarelicensepaidpernodeperyear• Gold-platedhardwareavailableonlyfromthe

vendorwithpernodeperyearcost• Couldnothandleunstructureddatasets• HeavyETL&datacleansing

Page 11: Ingest, Transform & Visualize w Amazon Web Services

EvolutionofDataArchitectures2006:HadoopClusters

CPUMemory

HDFSStorage

HadoopMasterNode

CPUMemory

HDFSStorage

CPUMemory

HDFSStorage

Improvements• Opensourcebasedsoftwarelicense!!!• Commoditywhiteboxservers!!!!• Couldhandlestructured&unstructureddatasets• Manydifferentapplicationswithintheframework

(MapReduce,Spark,Hive,Pig,HBase,Presto,etc.)

Constraints• HDFS3Xreplicationtoprotectagainstnodefailure

getsexpensiveatscale• 500TBdataset=1.5PBcluster

• LocalstoragemeansyoumustscaleandpayforCPU&memoryresourceswhenaddingdatacapacity

• Generalpurpose,monolithic clusterwithmanydifferentappsonsamehardware

• Stilladatasilo

Page 12: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Legacy Data Architectures Exist as Isolated Data Silos

Hadoop Cluster

SQL Database

Data Warehouse Appliance

Page 13: Ingest, Transform & Visualize w Amazon Web Services

EvolutionofDataArchitectures2009:DecoupledEMRArchitecture

CPUMemory

HadoopMasterNode

CPUMemory

CPUMemory

Improvements• Decoupledstorage&compute• ScaleCPUandmemoryresourcesindependently

andup&down• Onlypayforthe500TBdataset(not3X)• Multi-physicalfacilityreplicationviaS3• Multipleclusterscanruninparallelagainstshared

datainS3• Eachjobgetsitsownoptimizedcluster.i.e.Spark

onmemoryintensive,HiveonCPUintensive,HBaseonI/Ointensive,etc.

Constraints• Stillhaveaclustertoprovisionandmanage• MustexposeEMRclustertoSQLusersviaHive,

Presto,etc.

S3asHDFS

Page 14: Ingest, Transform & Visualize w Amazon Web Services

EvolutionofDataArchitectures

2012:AmazonRedshift– CloudDWImprovements

Constraints

• Stillhavetoloaddataintoaschema

Leader node

Compute node

10 GigE(HPC)

IngestionBackupRestore

Customer VPC

Internal VPC

BI tools SQL clientsAnalytics tools

Compute node Compute node

JDBC/ODBC

• Automatedinstallation,patching,backups• Noserverstomanageandmaintain• MPPColumnarrelationaldatabase• $1,000/TB/Year• AccessibletoanyODBCorJDBCBITool

Page 15: Ingest, Transform & Visualize w Amazon Web Services

EvolutionofDataArchitectures

Today:Clusterless Improvements• Nocluster/infrastructuretomanage• BusinessusersandanalystscanwriteSQLwithout

havingtoprovisionaclusterortouchinfrastructure• Paybythequery• ZeroAdministration• Processdatawhereitlives

Constraints

• LimitedtoSQL,HiveandSparkjobstoday.Moreframeworkstocome!

SQLInterfaceinwebbrowser

AthenaforSQL

S3DataLake

GlueforETL

S3DataLake

Spark&HiveInterfaceinwebbrowser

Page 16: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Building a Flexible Data Lake Architecture on AWS

Page 17: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Enter Data Lake Architectures

Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data.

Separating your storage and compute allows you to scale each component as

required

“How can I scale up with the volume of data being generated?”

Page 18: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Benefits of a Data Lake – All Data in One Place

Store and analyze all of your data, from all of your sources, in one

centralized location.

“Why is the data distributed in many locations? Where is the

single source of truth ?”

Quickly ingest data without needing to force it into a

pre-defined schema.

“How can I collect data quickly from various sources and store it

efficiently?”

Page 19: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Benefits of a Data Lake – Schema on Read

“Is there a way I can apply multiple analytics and processing frameworks

to the same data?”

A Data Lake enables ad-hoc analysis by applying schemas

on read, not write.

Page 20: Ingest, Transform & Visualize w Amazon Web Services

BenefitsofanAWSS3DataLake

FixedClusterDataLake AWSS3DataLake

• Limitedtoonlythesingletoolcontainedonthecluster(i.e.HadoopordatawarehouseorCassandra,etc.).Usecases&ecosystemtoolschangerapidly

• Expensivetoaddnodestoaddstoragecapacity

• Expensivetoreplicatedataagainstnodeloss

• Complexityinscalinglocalstoragecapacity• Longrefreshcyclestoaddadditional

storageequipment

• DecouplestorageandcomputebymakingS3objectbasedstorage,notafixedtoolclusterthedatalake

• Flexibilitytouseanyandalltoolsintheecosystem.Therighttoolforthejob

• Futureproofyourarchitecture.Asnewusecasesandnewtoolsemergeyoucanplugandplaycurrentbestofbreed.

Page 21: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Designed for 11 9s of durability

Designed for 99.99% availability

Durable Available High performance� Multiple upload� Range GET

� Store as much as you need� Scale storage and compute

independently� No minimum usage commitments

Scalable� Amazon EMR� Amazon Redshift� Amazon DynamoDB

Integrated� Simple REST API� AWS SDKs� Read-after-create consistency� Event notification� Lifecycle policies

Easy to use

S3 for data lake

Page 22: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Building a Data Lake on AWS

Kinesis FirehoseAthena

Query Service

Page 23: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Processing & Analytics

Real-time Batch

AI & Predictive

BI & Data Visualization

Transactional & RDBMS

AWS LambdaApache Storm

on EMR

Apache Flinkon EMR

Spark Streaming on EMR

ElasticsearchService

Kinesis Analytics, Kinesis Streams

DynamoDB

NoSQL DB Relational DatabaseAurora

EMRHadoop, Spark,

Presto

RedshiftData Warehouse

AthenaQuery Service

Amazon LexSpeech recognition

Amazon Rekognition

Amazon PollyText to speech

Machine LearningPredictive analytics

Kinesis Streams & Firehose

Page 24: Ingest, Transform & Visualize w Amazon Web Services

SummaryofAWSAnalytics,Database&AITools

AmazonRedshiftEnterpriseDataWarehouse

AmazonEMRHadoop/Spark

AmazonAthenaClusterless SQL

AmazonGlueClusterless ETL

AmazonAuroraManagedRelationalDatabase

AmazonMachineLearningPredictiveAnalytics

AmazonQuicksightBusinessIntelligence/Visualization

AmazonElasticSearch ServiceElasticSearch

AmazonElastiCacheRedis In-memoryDatastore

AmazonDynamoDBManagedNoSQLDatabase

AmazonRekognitionDeepLearning-basedImageRecognition

AmazonLexVoiceorTextChatbots

Page 25: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Encryption ComplianceSecurity

� Identity and Access Management (IAM) policies

� Bucket policies� Access Control Lists (ACLs)� Private VPC endpoints to

Amazon S3

� SSL endpoints� Server Side Encryption

(SSE-S3)� S3 Server Side

Encryption with provided keys (SSE-C, SSE-KMS)

� Client-side Encryption

� Buckets access logs� Lifecycle Management

Policies� Access Control Lists

(ACLs)� Versioning & MFA

deletes� Certifications – HIPAA,

PCI, SOC 1/2/3 etc.

Implement the right cloud security controls

Page 26: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

More Efficient Data Lake Architectures

Cloudwick

Page 27: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

What is a Modern Enterprise Data Warehouse?

A Modern Enterprise Data warehouse (EDW), is designed to support rapid data growth, quick analytics over relational, non-relational as well as streaming data, with an easy and single interface to consume all these types of data.

Page 28: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

DataConsumption

Building a Modern EDW on AWS

Amazon Kinesis

Firehose

Amazon S3

Amazon S3

AmazonEMR

Amazon

Redshift

Streaming

Un-structured

Relational

Amazon

RDS

Amazon

Athena

AmazonQuicksig

ht

Amazon

Machine

Learning

Analyze

Predict

Data Mart

EDW

Ad-hoc query

DataIngestion

Page 29: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

What is Real-time analytics platform?

Real-time Analytics platform provides an ability to load and analyze streaming data, and build custom streaming data applications for specialized needs. This platform can handle staggering amounts of streaming data – sometimes TBs per hour – that need to be collected, stored, and processed continuously.

Page 30: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Building a Real-time analytics platform (Streaming vs batching)

Stream Delivery

Stream Analytics

Stream processing

Event Driven

Kinesis Streaming

Page 31: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Simplify Big Data Processing

ingest /collect

store process /analyze

consume / visualize

data answers

Time to Answer (Latency)Throughput

Cost

Page 32: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Amazon QuickSight is a Business Analytics Service that lets business users quickly and easily visualize, explore, and share insights from their data.

Page 33: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Amazon S3 Data Lake

Visualizing the Data Lake

Amazon Athena

Page 34: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Cloudwick

DEMO

Page 35: Ingest, Transform & Visualize w Amazon Web Services

Cloudwick© 2015 Cloudwick. All rights reserved. Confidential and Proprietary.

Credits

EventSponsoredbyAmazonWebServicesVenuebyIntelPresentedbyArunKumarPalathumpattu(Cloudwick)Slidescredit:AWS+Cloudwick