Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard...
Transcript of Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard...
![Page 1: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/1.jpg)
10/21/19
1
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Rahul Bhartia, Ecosystem Solution Architect
15th September 2015
Building your first Big Data application on AWS
1
Amazon S3
Amazon Kinesis
Amazon DynamoDB
Amazon RDS (Aurora)MySQL
AWS Lambda
KCL AppsAmazon
EMRAmazon Redshift
Amazon MachineLearning
Collect Process AnalyzeStore
Data Collectionand Storage
DataProcessing
EventProcessing
Data Analysis
Data Answers
Big Data ecosystem on AWS
2
Your first Big Data application on AWS
?
3
![Page 2: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/2.jpg)
10/21/19
2
Big Data ecosystem on AWS - Collect
Collect Process AnalyzeStore
Data Answers
4
Big Data ecosystem on AWS - Process
Collect Process AnalyzeStore
Data Answers
5
Big Data ecosystem on AWS - Analyze
Collect Process AnalyzeStore
Data Answers
SQL
6
![Page 3: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/3.jpg)
10/21/19
3
Setup
7
Resources
1. AWS Command Line Interface (aws-cli) configured
2. Amazon Kinesis stream with a single shard
3. Amazon S3 bucket to hold the files
4. Amazon EMR cluster (two nodes) with Spark and Hive
5. Amazon Redshift data warehouse cluster (single node)
8
Amazon Kinesis
Create an Amazon Kinesis stream to hold incoming data:
aws kinesis create-stream \
--stream-name AccessLogStream \
--shard-count 1
9
![Page 4: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/4.jpg)
10/21/19
4
Amazon S3
YOUR-BUCKET-NAME
10
Amazon EMR
Launch a 3-instance cluster with Spark and Hive installed:
m3.xlarge
YOUR-AWS-REGION
YOUR-AWS-SSH-KEY
11
Amazon Redshift
\
CHOOSE-A-REDSHIFT-PASSWORD
12
![Page 5: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/5.jpg)
10/21/19
5
Your first Big Data application on AWS
1. COLLECT: Stream data into Kinesis with Log4J
2. PROCESS: Process datawith EMR using Spark & Hive
3. ANALYZE: Analyze data in Redshift using SQL
STORE
SQL
13
1. Collect
14
Amazon Kinesis Log4J Appender
15
![Page 6: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/6.jpg)
10/21/19
6
16
Log file format
17
Spark
•Fast and general engine for large-scale data processing
•Write applications quickly in Java, Scala or Python
•Combine SQL, streaming, and complex analytics.
18
![Page 7: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/7.jpg)
10/21/19
7
Using Spark on EMR
YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME
Start Spark shell:
--jars /usr/lib/spark/extras/lib/spark-streaming-kinesis-asl.jar,amazon-kinesis-client-1.5.1.jar
19
Amazon Kinesis and Spark Streaming
Producer AmazonKinesis
AmazonS3
DynamoDB
KCL
Spark-Streaming uses KCL for Kinesis
AmazonEMR
Spark-Streaming application to read from Kinesis and write to S3
21
Spark-streaming - Reading from Kinesis
/* Setup the KinesisClient */
/* Determine the number of shards from the stream */
22
![Page 8: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/8.jpg)
10/21/19
8
Spark-streaming – Writing to S3/* Merge the worker Dstreams and translate the byteArray to string */
/* Write each RDD to Amazon S3*/
23
24
View the output files in Amazon S3
YOUR-S3-BUCKET
YOUR-S3-BUCKETyyyy mm dd HH
25
![Page 9: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/9.jpg)
10/21/19
9
26
2. Process
27
Amazon EMR’s Hive
Adapts a SQL-like (HiveQL) query to run on Hadoop
Schema on read: map table to the input data
Access data in Amazon S3, Amazon DymamoDB, and Amazon Kinesis
Query complex input formats using SerDe
Transform data with User Defined Functions (UDF)
28
![Page 10: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/10.jpg)
10/21/19
10
Using Hive on Amazon EMR
YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME
Start Hive:
hive
29
Create a table that points to your Amazon S3 bucket
CREATE EXTERNAL TABLE access_log_raw(
host STRING, identity STRING, user STRING, request_time STRING, request STRING, status STRING,size STRING, referrer STRING, agent STRING
)PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^
\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?") LOCATION 's3://YOUR-S3-BUCKET/access-log-raw';
msck repair table access_log_raw;
30
31
![Page 11: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/11.jpg)
10/21/19
11
Process data using Hive
We will transform the data that is returned by the query before writing it to our Amazon S3-stored external Hive table
Hive User Defined Functions (UDF) in use for the text transformations: from_unixtime, unix_timestamp and hour
The “hour” value is important: this is what’s used to split and organize the output files before writing to Amazon S3. These splits will allow us to more efficiently load the data into Amazon Redshift later in the lab using the parallel “COPY” command
32
Create an external Hive table in Amazon S3
YOUR-S3-BUCKET
33
Configure partition and compression
-- setup Hive's "dynamic partioning"
-- this will split output files when writing to Amazon S3
-- compress output files on Amazon S3 using Gzip
34
![Page 12: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/12.jpg)
10/21/19
12
Query Hive and write output to Amazon S3
-- convert the Apache log timestamp to a UNIX timestamp
-- split files in Amazon S3 by the hour in the log linesINSERT OVERWRITE TABLE access_log_processed PARTITION (hour)
SELECT from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')),
host,request,
status,referrer,
agent,hour(from_unixtime(unix_timestamp(request_time,
'[dd/MMM/yyyy:HH:mm:ss Z]'))) as hourFROM access_log_raw;
35
36
Viewing Job statushttp://127.0.0.1/9026
37
![Page 13: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/13.jpg)
10/21/19
13
View the output files in Amazon S3
YOUR-S3-BUCKET
YOUR-S3-BUCKET
38
39
Spark SQL
Spark's module for working with structured data using SQL
Run unmodified Hive queries on existing data.
40
![Page 14: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/14.jpg)
10/21/19
14
Using Spark-SQL on Amazon EMR
YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME
Start Hive:
spark-sql
41
Query the data with Spark
-- return the first row in the stream
-- return count all items in the Stream
-- find the top 10 hosts
42
43
![Page 15: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/15.jpg)
10/21/19
15
3. Analyze
44
Connect to Amazon Redshift
# using the PostgreSQL CLI
YOUR-REDSHIFT-ENDPOINT
Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x drivers or native Redshift support
• Aginity Workbench for Amazon Redshift• SQL Workbench/J
45
Create an Amazon Redshift table to hold your data
46
![Page 16: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/16.jpg)
10/21/19
16
Loading data into Amazon Redshift
“COPY” command loads files in parallel
COPY accesslogs
FROM 's3://YOUR-S3-BUCKET/access-log-processed'
CREDENTIALS
'aws_access_key_id=YOUR-IAM-ACCESS_KEY;
aws_secret_access_key=YOUR-IAM-SECRET-KEY'
DELIMITER '\t' IGNOREHEADER 0
MAXERROR 0
GZIP;
47
48
49
![Page 17: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/17.jpg)
10/21/19
17
Amazon Redshift test queries
-- find distribution of status codes over days
-- find the 404 status codes
-- show all requests for status as PAGE NOT FOUND
50
Your first Big Data application on AWS
A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors
51
…around the same cost as a cup of coffee
Try it yourself on the AWS Cloud…
Service Est. Cost*
Amazon Kinesis $1.00
Amazon S3 (free tier) $0
Amazon EMR $0.44
Amazon Redshift $1.00
Est. Total $2.44
*Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage.
$3.50
52
![Page 18: Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard 3.Amazon S3 bucket to hold the files 4.Amazon EMR cluster (two nodes) with Spark and](https://reader033.fdocuments.us/reader033/viewer/2022042218/5ec45efb4e0713237519e66a/html5/thumbnails/18.jpg)
10/21/19
18
Thank you
AWS Big Data blogblogs.aws.amazon.com/bigdata
53