Getting Started with Big Data and HPC in the Cloud - August 2015

58
Getting Started with Big Data and HPC in the Cloud KD Singh AWS Solutions Architect [email protected]

Transcript of Getting Started with Big Data and HPC in the Cloud - August 2015

  1. 1. Getting Started with Big Data and HPC in the Cloud KD Singh AWS Solutions Architect [email protected]
  2. 2. BIG DATA : Big Data Challenges: Capacity Planning & Scalability Lower Cost, OpEx Experiment & learn more Trad. DWHs IT Complexity Data Variety ..Volume, velocity Old Answers & Questions Managed Services Fully managed, secured & automated services that brings agility & focus S3, EMR, Kinesis, DynamoDB: Collect all data, do Complex computations and processing it, both in Real-Time & Batch Sensors (IoT) Social Images Videos E. Apps. Documents Web Logs Big Value Redshift: Super Fast, MPP, Petabyte Ready, analytical Data Warehouse available in minutes Virtually unlimited & Elastic Resources No heavy lifting & Reduced Time to Market, parallel processing on demand New Answers/questions & Business Ideas Extract the meaning from all your data & focus on new business Ideas, Models, etc.. High Cost & Commitment
  3. 3. Plethora of Tools Glacier S3 DynamoDB RDS EMR Redshift Data Pipeline Kinesis Cassandra CloudSearch
  4. 4. Ingest Store Analyze Visualize Data Answers Time Simplify Data Analytics Flow Multiple stages Storage decoupled from processing
  5. 5. Glacier S3 DynamoDB RDS Kinesis Spark Streaming EMR Ingest Store Process/Analyze Visualize Data Pipeline Storm Kafka Redshift Cassandra CloudSearch Kinesis Connector Kinesis enabled app App Server Web Server Devices
  6. 6. AWS Big Data Portfolio Collect / Ingest Kinesis Process / Analyze EMR EC2 Redshift Data Pipeline Visualize / ReportStore Glacier S3 DynamoDB RDS Import Export Direct Connect Amazon SQS
  7. 7. Industries using AWS for data analysis Mobile / Cable Telecom Oil and Gas Industrial Manufacturing Retail/Consumer Entertainment Hospitality Life Sciences Scientific Exploration Financial Services Publishing Media Advertising Online Media Social Network Gaming
  8. 8. Ingest: The act of collecting and storing data
  9. 9. Types of Data Ingest Transactional Database reads/writes File Media files, log files Stream Click-stream logs (sets of events) Database Cloud Storage Stream Storage LoggingFrameworksDevicesApps
  10. 10. Real-time processing of streaming data High throughput Elastic Easy to use Connectors for EMR, S3, Redshift, DynamoDB Amazon Kinesis
  11. 11. Kinesis Stream: Managed ability to capture and store data Streams are made of Shards Each Shard ingests data up to 1MB/sec, and up to 1000 TPS Each Shard emits up to 2 MB/sec All data is stored for 24 hours Scale Kinesis streams by adding or removing Shards Replay data inside of 24Hr. Window Kinesis Client Library (KCL) - Java client library, available on Github
  12. 12. Sending and Reading Data from Kinesis Streams HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Sending Reading Write Read
  13. 13. AWS Partners for Ingest, Data Load and Transformation Hparser, Big Data Edition Flume, Sqoop
  14. 14. Storage
  15. 15. Cloud Database and Storage Tier Anti-pattern App/Web Tier Client Tier Database & Storage Tier
  16. 16. Cloud Database and Storage Tier Use the Right Tool for the Job! App/Web Tier Client Tier Data Tier Database & Storage Tier Search Hadoop/HDFS Cache Blob Store SQL NoSQL
  17. 17. Database & Storage Tier Amazon RDSAmazon DynamoDB Amazon ElastiCache Amazon S3 Amazon Glacier Amazon CloudSearch HDFS on Amazon EMR Cloud Database and Storage Tier Use the Right Tool for the Job!
  18. 18. Structured Simple Query NoSQL Amazon DynamoDB Cache Amazon ElastiCache (Memcached, Redis) Structured Complex Query SQL Amazon RDS Data Warehouse Amazon Redshift Search Amazon CloudSearch Unstructured No Query Cloud Storage Amazon S3 Amazon Glacier Unstructured Custom Query Hadoop/HDFS Amazon Elastic Map Reduce DataStructureComplexity Query Structure Complexity What Database and Storage Should I Use?
  19. 19. What Data Store Should I Use? Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon CloudSearch Amazon EMR (HDFS) Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec sec,min,h rs ms,sec,min (~ size) hrs Data volume GB GBTBs (no limit) GBTB (3 TB Max) GBTB GBPB (~nodes) GBPB (no limit) GBPB (no limit) Item size B-KB KB (64 KB max) KB (~rowsize) KB (1 MB max) MB-GB KB-GB (5 TB max) GB (40 TB max) Request rate Very High Very High High High Low Very High Low Very High (no limit) Very Low (no limit) Storage cost $/GB/month $$ $ Durability Low - Moderate Very High High High High Very High Very High Hot Data Warm Data Cold Data
  20. 20. Store anything Object storage Scalable Designed for 99.999999999% durability Amazon S3
  21. 21. Aggregate All Data in S3 Surrounded by a collection of the right tools EMR Kinesis Redshift DynamoDB RDS Data Pipeline Spark Streaming Cassandra Storm S3 No limit on the number of objects Object size up to 5TB Central data storage for all systems High bandwidth 99.999999999% durability Versioning, Lifecycle Policies Glacier Integration
  22. 22. Fully-managed NoSQL database service Built on solid-state drives (SSDs) Consistent low latency performance Any throughput rate No storage limits Amazon DynamoDB
  23. 23. DynamoDB: Managed High Availability and Durability Scaling without down-time Automatic sharding Security inspections, patches, upgrades Automatic hardware failover Multi-AZ replication Hardware configuration designed specifically for DynamoDB Performance tuning
  24. 24. Relational Databases Fully managed; zero admin MySQL, PostgreSQL, Oracle, SQL Server and Aurora Amazon RDS
  25. 25. Process and Analyze
  26. 26. Characterizing HPC Embarrassingly parallel Elastic Batch workloads Loosely Coupled Interconnected jobs Network sensitivity Job specific algorithms Tightly Coupled Data management Task distribution Workflow management Supporting Services EC2 instance types Auto Scaling CLI, User Data SQL, NoSQL, Object Stores Simple Queuing Service Simple Workflow Service Implement MPI GPU Programming
  27. 27. Tightly coupled Compute and Memory optimized instances Implement HVM process execution Intel Xeon processors 10 Gigabit Ethernet with Enhanced Networking, SR-IOV c4.8xlarge 36 vCPUs 2.9 GHz Intel Xeon E5-2666v3 Haswell 60 GB RAM 4000 Mbps dedicated EBS throughput r3.8xlarge 32 vCPUs 2.8 GHz Intel Xeon E5-2670v2 Ivy Bridge 244GB RAM 2 x 320 GB Local SSD
  28. 28. GPU Processing CG1 instances Intel Xeon X5570 processors 2 x NVIDIA Tesla M2050 GPUs CUDA, OpenCL G2 instances Intel Xeon E5-2670 processors 4 x NVIDIA GRID K520 GPUs Each GPU with 1536 CUDA cores and 4GB of video memory DirectX, OpenGL
  29. 29. Network Placement Groups Cluster instances deployed in a Placement Group enjoy low latency, full bisection 10 Gbps bandwidth Connect multiple Placement Groups to create very large clusters Enhanced networking with SR-IOV provides higher packet per second (PPS) performance, lower inter- instance latencies, and very low network jitter 10Gbps
  30. 30. Low cost with flexible pricing Efficient clusters Unlimited infrastructure Faster time to results Concurrent Clusters on-demand Increased collaboration Why AWS for HPC?
  31. 31. Customers running HPC workloads on AWS
  32. 32. Popular HPC workloads on AWS Genome processing Modeling and Simulation Government and Educational Research Monte Carlo Simulations Transcoding and Encoding Computational Chemistry
  33. 33. Across several key industry verticals Utilities Biopharma Materials Design Manufacturing Academic Research Auto & Aerospace
  34. 34. TOP500: 76th fastest supercomputer on-demand Jun 2014 Top 500 list 484.2 TFlop/s 26,496 cores in a cluster of EC2 C3 instances Intel Xeon E5-2680v2 10C 2.800GHzprocessors LinPack Benchmark
  35. 35. Three types of data-driven development Retrospective/ Batch analysis and reporting Here-and-now/ Stream real-time processing and dashboards Predictions to enable smart applications Amazon Kinesis Amazon EC2 AWS Lambda Amazon Redshift Amazon RDS Amazon EMR Amazon Machine Learning
  36. 36. Columnar data warehouse ANSI SQL compatible Massively parallel Petabyte scale Fully-managed Very cost-effective Amazon Redshift
  37. 37. Amazon Redshift architecture Leader Node SQL endpoint Stores metadata Coordinates query execution Compute Nodes Local, columnar storage Execute queries in parallel Load, backup, restore via Amazon S3 Parallel load from Amazon DynamoDB Hardware optimized for data processing Two hardware platforms DW1: HDD; scale from 2TB to 1.6PB DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC
  38. 38. Amazon Redshift dramatically reduces I/O Data compression Zone maps Direct-attached storage Large data block sizes ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 With row storage you do unnecessary I/O To get total amount, you have to read everything
  39. 39. Amazon Redshift dramatically reduces I/O Data compression Zone maps Direct-attached storage Large data block sizes With column storage, you only read the data you need ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375
  40. 40. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes Columnar compression saves space & reduces I/O Amazon Redshift analyzes and compresses your data analyze compression listing; Table | Column | Encoding ---------+----------------+---------- listing | listid | delta listing | sellerid | delta32k listing | eventid | delta32k listing | dateid | bytedict listing | numtickets | bytedict listing | priceperticket | delta32k listing | totalprice | mostly32 listing | listtime | raw
  41. 41. Amazon Redshift dramatically reduces I/O Column storage Data compression Direct-attached storage Large data block sizes Track of the minimum and maximum value for each block Skip over blocks that dont contain the data needed for a given query Minimize unnecessary I/O
  42. 42. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes Use direct-attached storage to maximize throughput Hardware optimized for high performance data processing Large block sizes to make the most of each read Amazon Redshift manages durability for you
  43. 43. Hadoop/HDFS clusters Hive, Pig, Impala, HBase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Kinesis Amazon Elastic MapReduce
  44. 44. EMR Cluster S3 1. Put the data into S3 2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop apps like Hive/Pig/HBase 4. Get the output from S3 3. Launch the cluster using the EMR console, CLI, SDK, or APIs How Does EMR Work?
  45. 45. EMR EMR Cluster S3 You can easily resize the cluster And launch parallel clusters using the same data How Does EMR Work?
  46. 46. EMR EMR Cluster S3 Use Spot nodes to save time and money How Does EMR Work?
  47. 47. The Hadoop Ecosystem works inside of EMR
  48. 48. Easy to use, managed machine learning service built for developers Robust, powerful machine learning technology based on Amazons internal systems Create models using your data already stored in the AWS cloud Deploy models to production in seconds Amazon Machine Learning
  49. 49. Three Supported Types of Predictions Binary Classification Predict the answer to a Yes/No question Multi-class classification Predict the correct category from a list Regression Predict the value of a numeric variable
  50. 50. Partners Advances Analytics (Scientific, algorithmic, predictive, etc)
  51. 51. Visualize
  52. 52. AWS Partners for BI & Data Visualization
  53. 53. Putting it all together
  54. 54. Spark Streaming, Apache Storm, Native Amazon Redshift Spark, Impala, Presto Hive Amazon Redshift Hive Spark, Presto Amazon Kinesis/ Kafka Amazon DynamoDB Amazon S3Data Hot ColdData Temperature QueryLatency Low High HDFS Hive Native Client Data Temperature vs Query Latency Real-Time Batch Answers
  55. 55. Amazon EMR as ETL Grid and Analysis Amazon Redshift Production DWH VisualizationLogs Traffic Statistics Demo
  56. 56. Video Recording of Demo
  57. 57. https://aws.amazon.com/big-data/ https://aws.amazon.com/testdrive/bigdata/ https://aws.amazon.com/training/course- descriptions/bigdata/ https://aws.amazon.com/hpc/ Thank You!