Getting Started with Big Data and HPC in the Cloud - August 2015

1. Getting Started with Big Data and HPC in the Cloud KD Singh AWS Solutions Architect [email protected]

2. BIG DATA : Big Data Challenges: Capacity Planning & Scalability Lower Cost, OpEx Experiment & learn more Trad. DWHs IT Complexity Data Variety ..Volume, velocity Old Answers & Questions Managed Services Fully managed, secured & automated services that brings agility & focus S3, EMR, Kinesis, DynamoDB: Collect all data, do Complex computations and processing it, both in Real-Time & Batch Sensors (IoT) Social Images Videos E. Apps. Documents Web Logs Big Value Redshift: Super Fast, MPP, Petabyte Ready, analytical Data Warehouse available in minutes Virtually unlimited & Elastic Resources No heavy lifting & Reduced Time to Market, parallel processing on demand New Answers/questions & Business Ideas Extract the meaning from all your data & focus on new business Ideas, Models, etc.. High Cost & Commitment

3. Plethora of Tools Glacier S3 DynamoDB RDS EMR Redshift Data Pipeline Kinesis Cassandra CloudSearch

4. Ingest Store Analyze Visualize Data Answers Time Simplify Data Analytics Flow Multiple stages Storage decoupled from processing

5. Glacier S3 DynamoDB RDS Kinesis Spark Streaming EMR Ingest Store Process/Analyze Visualize Data Pipeline Storm Kafka Redshift Cassandra CloudSearch Kinesis Connector Kinesis enabled app App Server Web Server Devices

6. AWS Big Data Portfolio Collect / Ingest Kinesis Process / Analyze EMR EC2 Redshift Data Pipeline Visualize / ReportStore Glacier S3 DynamoDB RDS Import Export Direct Connect Amazon SQS

7. Industries using AWS for data analysis Mobile / Cable Telecom Oil and Gas Industrial Manufacturing Retail/Consumer Entertainment Hospitality Life Sciences Scientific Exploration Financial Services Publishing Media Advertising Online Media Social Network Gaming

8. Ingest: The act of collecting and storing data

9. Types of Data Ingest Transactional Database reads/writes File Media files, log files Stream Click-stream logs (sets of events) Database Cloud Storage Stream Storage LoggingFrameworksDevicesApps

10. Real-time processing of streaming data High throughput Elastic Easy to use Connectors for EMR, S3, Redshift, DynamoDB Amazon Kinesis

11. Kinesis Stream: Managed ability to capture and store data Streams are made of Shards Each Shard ingests data up to 1MB/sec, and up to 1000 TPS Each Shard emits up to 2 MB/sec All data is stored for 24 hours Scale Kinesis streams by adding or removing Shards Replay data inside of 24Hr. Window Kinesis Client Library (KCL) - Java client library, available on Github

12. Sending and Reading Data from Kinesis Streams HTTP Post AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Sending Reading Write Read

13. AWS Partners for Ingest, Data Load and Transformation Hparser, Big Data Edition Flume, Sqoop

14. Storage

15. Cloud Database and Storage Tier Anti-pattern App/Web Tier Client Tier Database & Storage Tier

16. Cloud Database and Storage Tier Use the Right Tool for the Job! App/Web Tier Client Tier Data Tier Database & Storage Tier Search Hadoop/HDFS Cache Blob Store SQL NoSQL

17. Database & Storage Tier Amazon RDSAmazon DynamoDB Amazon ElastiCache Amazon S3 Amazon Glacier Amazon CloudSearch HDFS on Amazon EMR Cloud Database and Storage Tier Use the Right Tool for the Job!

18. Structured Simple Query NoSQL Amazon DynamoDB Cache Amazon ElastiCache (Memcached, Redis) Structured Complex Query SQL Amazon RDS Data Warehouse Amazon Redshift Search Amazon CloudSearch Unstructured No Query Cloud Storage Amazon S3 Amazon Glacier Unstructured Custom Query Hadoop/HDFS Amazon Elastic Map Reduce DataStructureComplexity Query Structure Complexity What Database and Storage Should I Use?

19. What Data Store Should I Use? Amazon ElastiCache Amazon DynamoDB Amazon RDS Amazon CloudSearch Amazon EMR (HDFS) Amazon S3 Amazon Glacier Average latency ms ms ms, sec ms,sec sec,min,h rs ms,sec,min (~ size) hrs Data volume GB GBTBs (no limit) GBTB (3 TB Max) GBTB GBPB (~nodes) GBPB (no limit) GBPB (no limit) Item size B-KB KB (64 KB max) KB (~rowsize) KB (1 MB max) MB-GB KB-GB (5 TB max) GB (40 TB max) Request rate Very High Very High High High Low Very High Low Very High (no limit) Very Low (no limit) Storage cost $/GB/month $$ $ Durability Low - Moderate Very High High High High Very High Very High Hot Data Warm Data Cold Data

20. Store anything Object storage Scalable Designed for 99.999999999% durability Amazon S3

21. Aggregate All Data in S3 Surrounded by a collection of the right tools EMR Kinesis Redshift DynamoDB RDS Data Pipeline Spark Streaming Cassandra Storm S3 No limit on the number of objects Object size up to 5TB Central data storage for all systems High bandwidth 99.999999999% durability Versioning, Lifecycle Policies Glacier Integration

22. Fully-managed NoSQL database service Built on solid-state drives (SSDs) Consistent low latency performance Any throughput rate No storage limits Amazon DynamoDB

23. DynamoDB: Managed High Availability and Durability Scaling without down-time Automatic sharding Security inspections, patches, upgrades Automatic hardware failover Multi-AZ replication Hardware configuration designed specifically for DynamoDB Performance tuning

24. Relational Databases Fully managed; zero admin MySQL, PostgreSQL, Oracle, SQL Server and Aurora Amazon RDS

25. Process and Analyze

26. Characterizing HPC Embarrassingly parallel Elastic Batch workloads Loosely Coupled Interconnected jobs Network sensitivity Job specific algorithms Tightly Coupled Data management Task distribution Workflow management Supporting Services EC2 instance types Auto Scaling CLI, User Data SQL, NoSQL, Object Stores Simple Queuing Service Simple Workflow Service Implement MPI GPU Programming

27. Tightly coupled Compute and Memory optimized instances Implement HVM process execution Intel Xeon processors 10 Gigabit Ethernet with Enhanced Networking, SR-IOV c4.8xlarge 36 vCPUs 2.9 GHz Intel Xeon E5-2666v3 Haswell 60 GB RAM 4000 Mbps dedicated EBS throughput r3.8xlarge 32 vCPUs 2.8 GHz Intel Xeon E5-2670v2 Ivy Bridge 244GB RAM 2 x 320 GB Local SSD

28. GPU Processing CG1 instances Intel Xeon X5570 processors 2 x NVIDIA Tesla M2050 GPUs CUDA, OpenCL G2 instances Intel Xeon E5-2670 processors 4 x NVIDIA GRID K520 GPUs Each GPU with 1536 CUDA cores and 4GB of video memory DirectX, OpenGL

29. Network Placement Groups Cluster instances deployed in a Placement Group enjoy low latency, full bisection 10 Gbps bandwidth Connect multiple Placement Groups to create very large clusters Enhanced networking with SR-IOV provides higher packet per second (PPS) performance, lower inter- instance latencies, and very low network jitter 10Gbps

30. Low cost with flexible pricing Efficient clusters Unlimited infrastructure Faster time to results Concurrent Clusters on-demand Increased collaboration Why AWS for HPC?

31. Customers running HPC workloads on AWS

32. Popular HPC workloads on AWS Genome processing Modeling and Simulation Government and Educational Research Monte Carlo Simulations Transcoding and Encoding Computational Chemistry

33. Across several key industry verticals Utilities Biopharma Materials Design Manufacturing Academic Research Auto & Aerospace

34. TOP500: 76th fastest supercomputer on-demand Jun 2014 Top 500 list 484.2 TFlop/s 26,496 cores in a cluster of EC2 C3 instances Intel Xeon E5-2680v2 10C 2.800GHzprocessors LinPack Benchmark

35. Three types of data-driven development Retrospective/ Batch analysis and reporting Here-and-now/ Stream real-time processing and dashboards Predictions to enable smart applications Amazon Kinesis Amazon EC2 AWS Lambda Amazon Redshift Amazon RDS Amazon EMR Amazon Machine Learning

36. Columnar data warehouse ANSI SQL compatible Massively parallel Petabyte scale Fully-managed Very cost-effective Amazon Redshift

37. Amazon Redshift architecture Leader Node SQL endpoint Stores metadata Coordinates query execution Compute Nodes Local, columnar storage Execute queries in parallel Load, backup, restore via Amazon S3 Parallel load from Amazon DynamoDB Hardware optimized for data processing Two hardware platforms DW1: HDD; scale from 2TB to 1.6PB DW2: SSD; scale from 160GB to 256TB 10 GigE (HPC) Ingestion Backup Restore JDBC/ODBC

38. Amazon Redshift dramatically reduces I/O Data compression Zone maps Direct-attached storage Large data block sizes ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375 With row storage you do unnecessary I/O To get total amount, you have to read everything

39. Amazon Redshift dramatically reduces I/O Data compression Zone maps Direct-attached storage Large data block sizes With column storage, you only read the data you need ID Age State Amount 123 20 CA 500 345 25 WA 250 678 40 FL 125 957 37 WA 375

41. Amazon Redshift dramatically reduces I/O Column storage Data compression Direct-attached storage Large data block sizes Track of the minimum and maximum value for each block Skip over blocks that dont contain the data needed for a given query Minimize unnecessary I/O

42. Amazon Redshift dramatically reduces I/O Column storage Data compression Zone maps Direct-attached storage Large data block sizes Use direct-attached storage to maximize throughput Hardware optimized for high performance data processing Large block sizes to make the most of each read Amazon Redshift manages durability for you

43. Hadoop/HDFS clusters Hive, Pig, Impala, HBase Easy to use; fully managed On-demand and spot pricing Tight integration with S3, DynamoDB, and Kinesis Amazon Elastic MapReduce

44. EMR Cluster S3 1. Put the data into S3 2. Choose: Hadoop distribution, # of nodes, types of nodes, Hadoop apps like Hive/Pig/HBase 4. Get the output from S3 3. Launch the cluster using the EMR console, CLI, SDK, or APIs How Does EMR Work?

45. EMR EMR Cluster S3 You can easily resize the cluster And launch parallel clusters using the same data How Does EMR Work?

46. EMR EMR Cluster S3 Use Spot nodes to save time and money How Does EMR Work?

47. The Hadoop Ecosystem works inside of EMR

48. Easy to use, managed machine learning service built for developers Robust, powerful machine learning technology based on Amazons internal systems Create models using your data already stored in the AWS cloud Deploy models to production in seconds Amazon Machine Learning

49. Three Supported Types of Predictions Binary Classification Predict the answer to a Yes/No question Multi-class classification Predict the correct category from a list Regression Predict the value of a numeric variable

50. Partners Advances Analytics (Scientific, algorithmic, predictive, etc)

51. Visualize

52. AWS Partners for BI & Data Visualization

53. Putting it all together

54. Spark Streaming, Apache Storm, Native Amazon Redshift Spark, Impala, Presto Hive Amazon Redshift Hive Spark, Presto Amazon Kinesis/ Kafka Amazon DynamoDB Amazon S3Data Hot ColdData Temperature QueryLatency Low High HDFS Hive Native Client Data Temperature vs Query Latency Real-Time Batch Answers

55. Amazon EMR as ETL Grid and Analysis Amazon Redshift Production DWH VisualizationLogs Traffic Statistics Demo

56. Video Recording of Demo

57. https://aws.amazon.com/big-data/ https://aws.amazon.com/testdrive/bigdata/ https://aws.amazon.com/training/course- descriptions/bigdata/ https://aws.amazon.com/hpc/ Thank You!

Getting Started with Big Data and HPC in the Cloud - August 2015

Technology

Transcript of Getting Started with Big Data and HPC in the Cloud - August 2015