MendeleyMendeley Manual - Getting Started Manual - Getting Started
Getting Started with Big Data and HPC in the Cloud - August 2015
58
Getting Started with Big Data and HPC in the Cloud KD Singh AWS Solutions Architect [email protected]
-
Upload
amazon-web-services -
Category
Technology
-
view
1.742 -
download
10
Transcript of Getting Started with Big Data and HPC in the Cloud - August 2015
- 1. Getting Started with Big Data and HPC in the Cloud KD Singh AWS Solutions Architect [email protected]
- 2. BIG DATA : Big Data Challenges: Capacity Planning & Scalability Lower Cost, OpEx Experiment & learn more Trad. DWHs IT Complexity Data Variety ..Volume, velocity Old Answers & Questions Managed Services Fully managed, secured & automated services that brings agility & focus S3, EMR, Kinesis, DynamoDB: Collect all data, do Complex computations and processing it, both in Real-Time & Batch Sensors (IoT) Social Images Videos E. Apps. Documents Web Logs Big Value Redshift: Super Fast, MPP, Petabyte Ready, analytical Data Warehouse available in minutes Virtually unlimited & Elastic Resources No heavy lifting & Reduced Time to Market, parallel processing on demand New Answers/questions & Business Ideas Extract the meaning from all your data & focus on new business Ideas, Models, etc.. High Cost & Commitment
- 3. Plethora of Tools Glacier S3 DynamoDB RDS EMR Redshift Data Pipeline Kinesis Cassandra CloudSearch
- 4. Ingest Store Analyze Visualize Data Answers Time Simplify Data Analytics Flow Multiple stages Storage decoupled from processing
- 5. Glacier S3 DynamoDB RDS Kinesis Spark Streaming EMR Ingest Store Process/Analyze Visualize Data Pipeline Storm Kafka Redshift Cassandra CloudSearch Kinesis Connector Kinesis enabled app App Server Web Server Devices
- 6. AWS Big Data Portfolio Collect / Ingest Kinesis Process / Analyze EMR EC2 Redshift Data Pipeline Visualize / ReportStore Glacier S3 DynamoDB RDS Import Export Direct Connect Amazon SQS
- 7. Industries using AWS for data analysis Mobile / Cable Telecom Oil and Gas Industrial Manufacturing Retail/Consumer Entertainment Hospitality Life Sciences Scientific Exploration Financial Services Publishing Media Advertising Online Media Social Network Gaming
- 8. Ingest: The act of collecting and storing data
- 9. Types of Data Ingest Transactional Database reads/writes File Media files, log files Stream Click-stream logs (sets of events) Database Cloud Storage Stream Storage LoggingFrameworksDevicesApps
- 10. Real-time processing of streaming data High throughput Elastic Easy to use Connectors for EMR, S3, Redshift, DynamoDB Amazon Kinesis
- 11. Kinesis Stream: Managed ability to capture and store data Streams are made of Shards Each Shard ingests data up to 1MB/sec, and up to 1000 TPS Each Shard emits up to 2 MB/sec All data is stored for 24 hours Scale Kinesis streams by adding or removing Shards Replay data inside of 24Hr. Window Kinesis Client Library (KCL) - Java client library, available on Github
- 12. Sending and Reading Data from Kinesis Streams HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Sending Reading Write Read
- 13. AWS Partners for Ingest, Data Load and Transformation Hparser, Big Data Edition Flume, Sqoop
- 14. Storage
- 15. Cloud Database and Storage Tier Anti-pattern App/Web Tier Client Tier Database & Storage Tier
- 16. Cloud Database and Storage Tier Use the Right Tool for the Job! App/Web Tier Client Tier Data Tier Database & Storage Tier Search Hadoop/HDFS Cache Blob Store SQL NoSQL
- 17. Database & Storage Tier Amazon RDSAmazon DynamoDB Amazon ElastiCache Amazon S3 Amazon Glacier Amazon CloudSearch HDFS on Amazon EMR Cloud Database and Storage Tier Use the Right Tool for the Job!
- 18. Structured Simple Query NoSQL Amazon DynamoDB Cache Amazon ElastiCache (Memcached, Redis) Structured Complex Query SQL Amazon RDS Data Warehouse Amazon Redshift Search Amazon CloudSearch Unstructured No Query Cloud Storage Amazon S3 Amazon Glacier Unstructured Custom Query Hadoop/HDFS Amazon Elastic Map Reduce DataStructureComplexity Query Structure Complexity What Database and Storage Should I Use?
- 19. What Data Store Should I Use? Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon CloudSearch Amazon EMR (HDFS) Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec sec,min,h rs ms,sec,min (~ size) hrs Data volume GB GBTBs (no limit) GBTB (3 TB Max) GBTB GBPB (~nodes) GBPB (no limit) GBPB (no limit) Item size B-KB KB (64 KB max) KB (~rowsize) KB (1 MB max) MB-GB KB-GB (5 TB max) GB (40 TB max) Request rate Very High Very High High High Low Very High Low Very High (no limit) Very Low (no limit) Storage cost $/GB/month $$ $ Durability Low - Moderate Very High High High High Very High Very High Hot Data Warm Data Cold Data
- 20. Store anything Object storage Scalable Designed for 99.999999999% durability Amazon S3
- 21. Aggregate All Data in S3 Surrounded by a collection of the right tools EMR Kinesis Redshift DynamoDB RDS Data Pipeline Spark Streaming Cassandra Storm S3 No limit on the number of objects Object size up to 5TB Central data storage for all systems High bandwidth 99.999999999% durability Versioning, Lifecycle Policies Glacier Integration
- 22. Fully-managed NoSQL database service Built on solid-state drives (SSDs) Consistent low latency performance Any throughput rate No storage limits Amazon DynamoDB
- 23. DynamoDB: Managed High Availability and Durability Scaling without down-time Automatic sharding Security inspections, patches, upgrades Automatic hardware failover Multi-AZ replication Hardware configuration designed specifically for DynamoDB Performance tuning
- 24. Relational Databases Fully managed; zero admin MySQL, PostgreSQL, Oracle, SQL Server and Aurora Amazon RDS
- 25. Process and Analyze
- 26. Characterizing HPC Embarrassingly parallel Elastic Batch workloads Loosely Coupled Interconnected jobs Network sensitivity Job specific algorithms Tightly Coupled Data management Task distribution Workflow management Supporting Services EC2 instance types Auto Scaling CLI, User Data SQL, NoSQL, Object Stores Simple Queuing Service Simple Workflow Service Implement MPI GPU Programming
- 27. Tightly coupled Compute and Memory optimized instances Implement HVM process execution Intel Xeon processors 10 Gigabit Ethernet with Enhanced Networking, SR-IOV c4.8xlarge 36 vCPUs 2.9 GHz Intel Xeon E5-2666v3 Haswell 60 GB RAM 4000 Mbps dedicated EBS throughput r3.8xlarge 32 vCPUs 2.8 GHz Intel Xeon E5-2670v2 Ivy Bridge 244GB RAM 2 x 320 GB Local SSD
- 28. GPU Processing CG1 instances Intel Xeon X5570 processors 2 x NVIDIA Tesla M2050 GPUs CUDA, OpenCL G2 instances Intel Xeon E5-2670 processors 4 x NVIDIA GRID K520 GPUs Each GPU with 1536 CUDA cores and 4GB of video memory DirectX, OpenGL
- 29. Network Placement Groups Cluster instances deployed in a Placement Group enjoy low latency, full bisection 10 Gbps bandwidth Connect multiple Placement Groups to create very large clusters Enhanced networking with SR-IOV provides higher packet per second (PPS) performance, lower inter- instance latencies, and very low network jitter 10Gbps
- 30. Low cost with flexible pricing Efficient clusters Unlimited infrastructure Faster time to results Concurrent Clusters on-demand Increased collaboration Why AWS for HPC?
- 31. Customers running HPC workloads on AWS
- 32. Popular HPC workloads on AWS Genome processing Modeling and Simulation Government and Educational Research Monte Carlo Simulations Transcoding and Encoding Computational Chemistry
- 33. Across several key industry verticals Utilities Biopharma Materials Design Manufacturing Academic Research Auto & Aerospace
- 34. TOP500: 76th fastest supercomputer on-demand Jun 2014 Top 500 list 484.2 TFlop/s 26,496 cores in a cluster of EC2 C3 instances Intel Xeon E5-2680v2 10C 2.800GHzprocessors LinPack Benchmark
- 35. Three types of data-driven development Retrospective/ Batch analysis and reporting Here-and-now/ Stream real-time processing and dashboards Predictions to enable smart applications Amazon Kinesis Amazon EC2 AWS Lambda Amazon Redshift Amazon RDS Amazon EMR Amazon Machine Learning
- 36. Columnar data warehouse ANSI SQL compatible Massively parallel Petabyte scale Fully-managed Very cost-effective Amazon Redshift
- 37. Amazon Redshift architecture Leader Node SQL endpoint Stores metadata Coordinates query execution Compute Nodes Local, columnar storage Execute queries in parallel Load, backup, restore via Amazon S3 Parallel load from Amazon DynamoDB Hardware optimized for data processing Two hardware platforms DW1: HDD; scale from 2TB to 1.6PB DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
- 38. Amazon Redshift dramatically reduces I/O Data compression Zone maps Direct-attached storage Large data block sizes ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 With row storage you do unnecessary I/O To get total amount, you have to read everything
- 39. Amazon Redshift dramatically reduces I/O Data compression Zone maps Direct-attached storage Large data block sizes With column storage, you only read the data you need ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
- 40. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes Columnar compression saves space & reduces I/O Amazon Redshift analyzes and compresses your data analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw
- 41. Amazon Redshift dramatically reduces I/O Column storage Data compression Direct-attached storage Large data block sizes Track of the minimum and maximum value for each block Skip over blocks that dont contain the data needed for a given query Minimize unnecessary I/O
- 42. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes Use direct-attached storage to maximize throughput Hardware optimized for high performance data processing Large block sizes to make the most of each read Amazon Redshift manages durability for you
- 43. Hadoop/HDFS clusters Hive, Pig, Impala, HBase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Kinesis Amazon Elastic MapReduce
- 44. EMR Cluster S3 1. Put the data into S3 2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop apps like Hive/Pig/HBase 4. Get the output from S3 3. Launch the cluster using the EMR console, CLI, SDK, or APIs How Does EMR Work?
- 45. EMR EMR Cluster S3 You can easily resize the cluster And launch parallel clusters using the same data How Does EMR Work?
- 46. EMR EMR Cluster S3 Use Spot nodes to save time and money How Does EMR Work?
- 47. The Hadoop Ecosystem works inside of EMR
- 48. Easy to use, managed machine learning service built for developers Robust, powerful machine learning technology based on Amazons internal systems Create models using your data already stored in the AWS cloud Deploy models to production in seconds Amazon Machine Learning
- 49. Three Supported Types of Predictions Binary Classification Predict the answer to a Yes/No question Multi-class classification Predict the correct category from a list Regression Predict the value of a numeric variable
- 50. Partners Advances Analytics (Scientific, algorithmic, predictive, etc)
- 51. Visualize
- 52. AWS Partners for BI & Data Visualization
- 53. Putting it all together
- 54. Spark Streaming, Apache Storm, Native Amazon Redshift Spark, Impala, Presto Hive Amazon Redshift Hive Spark, Presto Amazon Kinesis/ Kafka Amazon DynamoDB Amazon S3Data Hot ColdData Temperature QueryLatency Low High HDFS Hive Native Client Data Temperature vs Query Latency Real-Time Batch Answers
- 55. Amazon EMR as ETL Grid and Analysis Amazon Redshift Production DWH VisualizationLogs Traffic Statistics Demo
- 56. Video Recording of Demo
- 57. https://aws.amazon.com/big-data/ https://aws.amazon.com/testdrive/bigdata/ https://aws.amazon.com/training/course- descriptions/bigdata/ https://aws.amazon.com/hpc/ Thank You!