Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard...

10/21/19

1

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Rahul Bhartia, Ecosystem Solution Architect

15th September 2015

Building your first Big Data application on AWS

1

Amazon S3

Amazon Kinesis

Amazon DynamoDB

Amazon RDS (Aurora)MySQL

AWS Lambda

KCL AppsAmazon

EMRAmazon Redshift

Amazon MachineLearning

Collect Process AnalyzeStore

Data Collectionand Storage

DataProcessing

EventProcessing

Data Analysis

Data Answers

Big Data ecosystem on AWS

2

Your first Big Data application on AWS

?

3

10/21/19

2

Big Data ecosystem on AWS - Collect


Data Answers

4

Big Data ecosystem on AWS - Process


Data Answers

5

Big Data ecosystem on AWS - Analyze


Data Answers

SQL

6

10/21/19

3

Setup

7

Resources

1. AWS Command Line Interface (aws-cli) configured

2. Amazon Kinesis stream with a single shard

3. Amazon S3 bucket to hold the files

4. Amazon EMR cluster (two nodes) with Spark and Hive

5. Amazon Redshift data warehouse cluster (single node)

8

Amazon Kinesis

Create an Amazon Kinesis stream to hold incoming data:

aws kinesis create-stream \

--stream-name AccessLogStream \

--shard-count 1

9

10/21/19

4

Amazon S3

YOUR-BUCKET-NAME

10

Amazon EMR

Launch a 3-instance cluster with Spark and Hive installed:

m3.xlarge

YOUR-AWS-REGION

YOUR-AWS-SSH-KEY

11

Amazon Redshift

\

CHOOSE-A-REDSHIFT-PASSWORD

12

10/21/19

5


1. COLLECT: Stream data into Kinesis with Log4J

2. PROCESS: Process datawith EMR using Spark & Hive

3. ANALYZE: Analyze data in Redshift using SQL

STORE

SQL

13

1. Collect

14

Amazon Kinesis Log4J Appender

15

10/21/19

6

16

Log file format

17

Spark

•Fast and general engine for large-scale data processing

•Write applications quickly in Java, Scala or Python

•Combine SQL, streaming, and complex analytics.

18

10/21/19

7

Using Spark on EMR

YOUR-AWS-SSH-KEY YOUR-EMR-HOSTNAME

Start Spark shell:

--jars /usr/lib/spark/extras/lib/spark-streaming-kinesis-asl.jar,amazon-kinesis-client-1.5.1.jar

19

Amazon Kinesis and Spark Streaming

Producer AmazonKinesis

AmazonS3

DynamoDB

KCL

Spark-Streaming uses KCL for Kinesis

AmazonEMR

Spark-Streaming application to read from Kinesis and write to S3

21

Spark-streaming - Reading from Kinesis

/* Setup the KinesisClient */

/* Determine the number of shards from the stream */

22

10/21/19

8

Spark-streaming – Writing to S3/* Merge the worker Dstreams and translate the byteArray to string */

/* Write each RDD to Amazon S3*/

23

24

View the output files in Amazon S3

YOUR-S3-BUCKET

YOUR-S3-BUCKETyyyy mm dd HH

25

10/21/19

9

26

2. Process

27

Amazon EMR’s Hive

Adapts a SQL-like (HiveQL) query to run on Hadoop

Schema on read: map table to the input data

Access data in Amazon S3, Amazon DymamoDB, and Amazon Kinesis

Query complex input formats using SerDe

Transform data with User Defined Functions (UDF)

28

10/21/19

10

Using Hive on Amazon EMR


Start Hive:

hive

29

Create a table that points to your Amazon S3 bucket

CREATE EXTERNAL TABLE access_log_raw(

host STRING, identity STRING, user STRING, request_time STRING, request STRING, status STRING,size STRING, referrer STRING, agent STRING

)PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ("input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^

\"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?") LOCATION 's3://YOUR-S3-BUCKET/access-log-raw';

msck repair table access_log_raw;

30

31

10/21/19

11

Process data using Hive

We will transform the data that is returned by the query before writing it to our Amazon S3-stored external Hive table

Hive User Defined Functions (UDF) in use for the text transformations: from_unixtime, unix_timestamp and hour

The “hour” value is important: this is what’s used to split and organize the output files before writing to Amazon S3. These splits will allow us to more efficiently load the data into Amazon Redshift later in the lab using the parallel “COPY” command

32

Create an external Hive table in Amazon S3

YOUR-S3-BUCKET

33

Configure partition and compression

-- setup Hive's "dynamic partioning"

-- this will split output files when writing to Amazon S3

-- compress output files on Amazon S3 using Gzip

34

10/21/19

12

Query Hive and write output to Amazon S3

-- convert the Apache log timestamp to a UNIX timestamp

-- split files in Amazon S3 by the hour in the log linesINSERT OVERWRITE TABLE access_log_processed PARTITION (hour)

SELECT from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')),

host,request,

status,referrer,

agent,hour(from_unixtime(unix_timestamp(request_time,

'[dd/MMM/yyyy:HH:mm:ss Z]'))) as hourFROM access_log_raw;

35

36

Viewing Job statushttp://127.0.0.1/9026

37

http://127.0.0.1/19026/cluster

10/21/19

13

View the output files in Amazon S3

YOUR-S3-BUCKET

YOUR-S3-BUCKET

38

39

Spark SQL

Spark's module for working with structured data using SQL

Run unmodified Hive queries on existing data.

40

10/21/19

14

Using Spark-SQL on Amazon EMR


Start Hive:

spark-sql

41

Query the data with Spark

-- return the first row in the stream

-- return count all items in the Stream

-- find the top 10 hosts

42

43

10/21/19

15

3. Analyze

44

Connect to Amazon Redshift

# using the PostgreSQL CLI

YOUR-REDSHIFT-ENDPOINT

Or use any JDBC or ODBC SQL client with the PostgreSQL 8.x drivers or native Redshift support

• Aginity Workbench for Amazon Redshift• SQL Workbench/J

45

Create an Amazon Redshift table to hold your data

46

10/21/19

16

Loading data into Amazon Redshift

“COPY” command loads files in parallel

COPY accesslogs

FROM 's3://YOUR-S3-BUCKET/access-log-processed'

CREDENTIALS

'aws_access_key_id=YOUR-IAM-ACCESS_KEY;

aws_secret_access_key=YOUR-IAM-SECRET-KEY'

DELIMITER '\t' IGNOREHEADER 0

MAXERROR 0

GZIP;

47

48

49

10/21/19

17

Amazon Redshift test queries

-- find distribution of status codes over days

-- find the 404 status codes

-- show all requests for status as PAGE NOT FOUND

50


A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors

51

…around the same cost as a cup of coffee

Try it yourself on the AWS Cloud…

Service Est. Cost*

Amazon Kinesis $1.00

Amazon S3 (free tier) $0

Amazon EMR $0.44

Amazon Redshift $1.00

Est. Total $2.44

*Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage.

$3.50

52

10/21/19

18

Thank you

AWS Big Data blogblogs.aws.amazon.com/bigdata

53

Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard...

Documents

Transcript of Building your first Big Data application on AWS · 2.Amazon Kinesis stream with a single shard...