Ingest, Transform & Visualize w Amazon Web Services
-
Upload
bigdatacamp -
Category
Software
-
view
206 -
download
4
Transcript of Ingest, Transform & Visualize w Amazon Web Services
Cloudwick© 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Data Ingestion, Transformation, and Visualization with Amazon Web ServicesSponsoredbyAmazonWebServicesandIntel(Venue)PresentedbyArunKumarPalathumpattu(Cloudwick)
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Cloudwick
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Agenda Cloudwick
• Introduction• CommonChallengesinDataAnalyticsEnvironments• CreatingDataLake– Singlesourceoftruth• BuildEffectiveDataworkflow• Demo• Q&A
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Top 5 Challenges in Building Data Analysis Environments
Cloudwick
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
# 1 Market LandscapeCloudwick
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Cloudwick
Image:http://mwicorp.com/lake-okeechobee-water-transfer/
Structured
# 2 Issues of Storing All the Data
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
# 3 Time it Takes to Find the Useful InformationCloudwick
Image:https://img.clipartfest.com/0ffb2c38607970437e20a6fdd2872eb9_4-benefits-from-taking-guitar-time-value-of-money-clipart_500-500.png
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
# 4 SecurityCloudwick
Image:https://insights.ubuntu.com/2017/03/20/three-flaws-at-the-heart-of-iot-security/
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
# 5 Skill GapCloudwick
https://www.linkedin.com/pulse/wonder-three-years-analytics-big-data-skills-most-demand-hosseini
EvolutionofDataArchitectures1985:DataWarehouseAppliances Benefits
• Consolidatedmultipledecisionsupportenvironments(i.e.databases)intoasinglearchitecture
• Bestperformanceavailableattimeofconception,hencetheexpensivelicenses
• Workedwellwithstructured,columnardata• Couldbuildcustomizeddatamartsontop
SharedStorageTier(NASAppliance)
ComputeNode
ComputeNode
ComputeNode
ComputeNode
• Proprietarysoftwarelicensepaidpernodeperyear
• Gold-platedhardwareavailableonlyfromthevendorwithpernodeperyearcost
Constraints
• Proprietarysoftwarelicensepaidpernodeperyear• Gold-platedhardwareavailableonlyfromthe
vendorwithpernodeperyearcost• Couldnothandleunstructureddatasets• HeavyETL&datacleansing
EvolutionofDataArchitectures2006:HadoopClusters
CPUMemory
HDFSStorage
HadoopMasterNode
CPUMemory
HDFSStorage
CPUMemory
HDFSStorage
Improvements• Opensourcebasedsoftwarelicense!!!• Commoditywhiteboxservers!!!!• Couldhandlestructured&unstructureddatasets• Manydifferentapplicationswithintheframework
(MapReduce,Spark,Hive,Pig,HBase,Presto,etc.)
Constraints• HDFS3Xreplicationtoprotectagainstnodefailure
getsexpensiveatscale• 500TBdataset=1.5PBcluster
• LocalstoragemeansyoumustscaleandpayforCPU&memoryresourceswhenaddingdatacapacity
• Generalpurpose,monolithic clusterwithmanydifferentappsonsamehardware
• Stilladatasilo
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Legacy Data Architectures Exist as Isolated Data Silos
Hadoop Cluster
SQL Database
Data Warehouse Appliance
EvolutionofDataArchitectures2009:DecoupledEMRArchitecture
CPUMemory
HadoopMasterNode
CPUMemory
CPUMemory
Improvements• Decoupledstorage&compute• ScaleCPUandmemoryresourcesindependently
andup&down• Onlypayforthe500TBdataset(not3X)• Multi-physicalfacilityreplicationviaS3• Multipleclusterscanruninparallelagainstshared
datainS3• Eachjobgetsitsownoptimizedcluster.i.e.Spark
onmemoryintensive,HiveonCPUintensive,HBaseonI/Ointensive,etc.
Constraints• Stillhaveaclustertoprovisionandmanage• MustexposeEMRclustertoSQLusersviaHive,
Presto,etc.
S3asHDFS
EvolutionofDataArchitectures
2012:AmazonRedshift– CloudDWImprovements
Constraints
• Stillhavetoloaddataintoaschema
Leader node
Compute node
10 GigE(HPC)
IngestionBackupRestore
Customer VPC
Internal VPC
BI tools SQL clientsAnalytics tools
Compute node Compute node
JDBC/ODBC
• Automatedinstallation,patching,backups• Noserverstomanageandmaintain• MPPColumnarrelationaldatabase• $1,000/TB/Year• AccessibletoanyODBCorJDBCBITool
EvolutionofDataArchitectures
Today:Clusterless Improvements• Nocluster/infrastructuretomanage• BusinessusersandanalystscanwriteSQLwithout
havingtoprovisionaclusterortouchinfrastructure• Paybythequery• ZeroAdministration• Processdatawhereitlives
Constraints
• LimitedtoSQL,HiveandSparkjobstoday.Moreframeworkstocome!
SQLInterfaceinwebbrowser
AthenaforSQL
S3DataLake
GlueforETL
S3DataLake
Spark&HiveInterfaceinwebbrowser
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Building a Flexible Data Lake Architecture on AWS
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Enter Data Lake Architectures
Data Lake is a new and increasingly popular architecture to store and analyze massive volumes and heterogeneous types of data.
Separating your storage and compute allows you to scale each component as
required
“How can I scale up with the volume of data being generated?”
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Benefits of a Data Lake – All Data in One Place
Store and analyze all of your data, from all of your sources, in one
centralized location.
“Why is the data distributed in many locations? Where is the
single source of truth ?”
Quickly ingest data without needing to force it into a
pre-defined schema.
“How can I collect data quickly from various sources and store it
efficiently?”
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Benefits of a Data Lake – Schema on Read
“Is there a way I can apply multiple analytics and processing frameworks
to the same data?”
A Data Lake enables ad-hoc analysis by applying schemas
on read, not write.
BenefitsofanAWSS3DataLake
FixedClusterDataLake AWSS3DataLake
• Limitedtoonlythesingletoolcontainedonthecluster(i.e.HadoopordatawarehouseorCassandra,etc.).Usecases&ecosystemtoolschangerapidly
• Expensivetoaddnodestoaddstoragecapacity
• Expensivetoreplicatedataagainstnodeloss
• Complexityinscalinglocalstoragecapacity• Longrefreshcyclestoaddadditional
storageequipment
• DecouplestorageandcomputebymakingS3objectbasedstorage,notafixedtoolclusterthedatalake
• Flexibilitytouseanyandalltoolsintheecosystem.Therighttoolforthejob
• Futureproofyourarchitecture.Asnewusecasesandnewtoolsemergeyoucanplugandplaycurrentbestofbreed.
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Designed for 11 9s of durability
Designed for 99.99% availability
Durable Available High performance� Multiple upload� Range GET
� Store as much as you need� Scale storage and compute
independently� No minimum usage commitments
Scalable� Amazon EMR� Amazon Redshift� Amazon DynamoDB
Integrated� Simple REST API� AWS SDKs� Read-after-create consistency� Event notification� Lifecycle policies
Easy to use
S3 for data lake
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Building a Data Lake on AWS
Kinesis FirehoseAthena
Query Service
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Processing & Analytics
Real-time Batch
AI & Predictive
BI & Data Visualization
Transactional & RDBMS
AWS LambdaApache Storm
on EMR
Apache Flinkon EMR
Spark Streaming on EMR
ElasticsearchService
Kinesis Analytics, Kinesis Streams
DynamoDB
NoSQL DB Relational DatabaseAurora
EMRHadoop, Spark,
Presto
RedshiftData Warehouse
AthenaQuery Service
Amazon LexSpeech recognition
Amazon Rekognition
Amazon PollyText to speech
Machine LearningPredictive analytics
Kinesis Streams & Firehose
SummaryofAWSAnalytics,Database&AITools
AmazonRedshiftEnterpriseDataWarehouse
AmazonEMRHadoop/Spark
AmazonAthenaClusterless SQL
AmazonGlueClusterless ETL
AmazonAuroraManagedRelationalDatabase
AmazonMachineLearningPredictiveAnalytics
AmazonQuicksightBusinessIntelligence/Visualization
AmazonElasticSearch ServiceElasticSearch
AmazonElastiCacheRedis In-memoryDatastore
AmazonDynamoDBManagedNoSQLDatabase
AmazonRekognitionDeepLearning-basedImageRecognition
AmazonLexVoiceorTextChatbots
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Encryption ComplianceSecurity
� Identity and Access Management (IAM) policies
� Bucket policies� Access Control Lists (ACLs)� Private VPC endpoints to
Amazon S3
� SSL endpoints� Server Side Encryption
(SSE-S3)� S3 Server Side
Encryption with provided keys (SSE-C, SSE-KMS)
� Client-side Encryption
� Buckets access logs� Lifecycle Management
Policies� Access Control Lists
(ACLs)� Versioning & MFA
deletes� Certifications – HIPAA,
PCI, SOC 1/2/3 etc.
Implement the right cloud security controls
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
More Efficient Data Lake Architectures
Cloudwick
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
What is a Modern Enterprise Data Warehouse?
A Modern Enterprise Data warehouse (EDW), is designed to support rapid data growth, quick analytics over relational, non-relational as well as streaming data, with an easy and single interface to consume all these types of data.
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
DataConsumption
Building a Modern EDW on AWS
Amazon Kinesis
Firehose
Amazon S3
Amazon S3
AmazonEMR
Amazon
Redshift
Streaming
Un-structured
Relational
Amazon
RDS
Amazon
Athena
AmazonQuicksig
ht
Amazon
Machine
Learning
Analyze
Predict
Data Mart
EDW
Ad-hoc query
DataIngestion
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
What is Real-time analytics platform?
Real-time Analytics platform provides an ability to load and analyze streaming data, and build custom streaming data applications for specialized needs. This platform can handle staggering amounts of streaming data – sometimes TBs per hour – that need to be collected, stored, and processed continuously.
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Building a Real-time analytics platform (Streaming vs batching)
Stream Delivery
Stream Analytics
Stream processing
Event Driven
Kinesis Streaming
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Simplify Big Data Processing
ingest /collect
store process /analyze
consume / visualize
data answers
Time to Answer (Latency)Throughput
Cost
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Amazon QuickSight is a Business Analytics Service that lets business users quickly and easily visualize, explore, and share insights from their data.
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Amazon S3 Data Lake
Visualizing the Data Lake
Amazon Athena
Cloudwick © 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Cloudwick
DEMO
Cloudwick© 2015 Cloudwick. All rights reserved. Confidential and Proprietary.
Credits
EventSponsoredbyAmazonWebServicesVenuebyIntelPresentedbyArunKumarPalathumpattu(Cloudwick)Slidescredit:AWS+Cloudwick