AWS September Webinar Series - Building Your First Big Data Application on AWS

Rahul Bhartia, Ecosystem Solution Architect

15th September 2015

Building your first Big Data application on AWS

Amazon S3

Amazon Kinesis

Amazon DynamoDB

Amazon RDS (Aurora)

AWS Lambda

KCL Apps

Amazon EMR

Amazon Redshift

Amazon MachineLearning

CollectCollect ProcessProcess AnalyzeAnalyzeStoreStore

Data Collectionand Storage

DataProcessing

EventProcessing

Data Analysis

Data Answers

Big Data ecosystem on AWS

Your first Big Data application on AWS

Big Data ecosystem on AWS - Collect

Data Answers

Big Data ecosystem on AWS - Process

Data Answers

Big Data ecosystem on AWS - Analyze

Data Answers

Resources

1. AWS Command Line Interface (aws-cli) configured

2. Amazon Kinesis stream with a single shard

3. Amazon S3 bucket to hold the files

4. Amazon EMR cluster (two nodes) with Spark and Hive

5. Amazon Redshift data warehouse cluster (single node)

Amazon Kinesis

Create an Amazon Kinesis stream to hold incoming data:

aws kinesis create-stream \ --stream-name AccessLogStream \ --shard-count 1

Amazon S3

Amazon EMR

Amazon Redshift

1. COLLECT: Stream data into Kinesis with Log4J

2. PROCESS: Process datawith EMR using Spark & Hive

3. ANALYZE: Analyze data in Redshift using SQL

1. Collect

Amazon Kinesis Log4J Appender

Log file format

•Fast and general engine for large-scale data processing

•Write applications quickly in Java, Scala or Python

•Combine SQL, streaming, and complex analytics.

Using Spark on EMR

Amazon Kinesis and Spark Streaming

Producer AmazonKinesis

AmazonS3

DynamoDB

Spark-Streaming uses KCL for Kinesis

AmazonEMR

Spark-Streaming application to read from Kinesis and write to S3

Spark-streaming - Reading from Kinesis

Spark-streaming – Writing to S3

View the output files in Amazon S3

2. Process

Amazon EMR’s Hive

Adapts a SQL-like (HiveQL) query to run on Hadoop

Schema on read: map table to the input data

Access data in Amazon S3, Amazon DymamoDB, and Amazon Kinesis

Query complex input formats using SerDe

Transform data with User Defined Functions (UDF)

Using Hive on Amazon EMR

Create a table that points to your Amazon S3 bucket

CREATE EXTERNAL TABLE access_log_raw( host STRING, identity STRING, user STRING, request_time STRING, request STRING, status STRING, size STRING, referrer STRING, agent STRING)PARTITIONED BY (year INT, month INT, day INT, hour INT, min INT)ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'WITH SERDEPROPERTIES ( "input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[[^\\]]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\"[^\"]*\") ([^ \"]*|\"[^\"]*\"))?") LOCATION 's3://YOUR-S3-BUCKET/access-log-raw';

msck repair table access_log_raw;

Process data using Hive

We will transform the data that is returned by the query before writing it to our Amazon S3-stored external Hive table

Hive User Defined Functions (UDF) in use for the text transformations: from_unixtime, unix_timestamp and hour

The “hour” value is important: this is what’s used to split and organize the output files before writing to Amazon S3. These splits will allow us to more efficiently load the data into Amazon Redshift later in the lab using the parallel “COPY” command

Create an external Hive table in Amazon S3

Configure partition and compression

Query Hive and write output to Amazon S3

-- convert the Apache log timestamp to a UNIX timestamp-- split files in Amazon S3 by the hour in the log linesINSERT OVERWRITE TABLE access_log_processed PARTITION (hour) SELECT from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]')), host, request, status, referrer, agent, hour(from_unixtime(unix_timestamp(request_time, '[dd/MMM/yyyy:HH:mm:ss Z]'))) as hour FROM access_log_raw;

Viewing Job statushttp://127.0.0.1/9026

View the output files in Amazon S3

Spark SQLSpark's module for working with structured data using SQL

Run unmodified Hive queries on existing data.

Using Spark-SQL on Amazon EMR

Query the data with Spark

3. Analyze

Connect to Amazon Redshift

Create an Amazon Redshift table to hold your data

Loading data into Amazon Redshift

“COPY” command loads files in parallel

COPY accesslogs FROM 's3://YOUR-S3-BUCKET/access-log-processed' CREDENTIALS 'aws_access_key_id=YOUR-IAM-ACCESS_KEY; aws_secret_access_key=YOUR-IAM-SECRET-KEY'DELIMITER '\t' IGNOREHEADER 0 MAXERROR 0 GZIP;

Amazon Redshift test queries

A favicon would fix 398 of the total 977 PAGE NOT FOUND (404) errors

…around the same cost as a cup of coffee

Try it yourself on the AWS Cloud…

Service Est. Cost*

Amazon Kinesis $1.00

Amazon S3 (free tier) $0

Amazon EMR $0.44

Amazon Redshift $1.00

Est. Total $2.44

*Estimated costs assumes: use of free tier where available, lower cost instances, dataset no bigger than 10MB and instances running for less than 4 hours. Costs may vary depending on options selected, size of dataset, and usage.

Thank you

AWS Big Data blogblogs.aws.amazon.com/bigdata

AWS September Webinar Series - Building Your First Big Data Application on AWS

Technology

Transcript of AWS September Webinar Series - Building Your First Big Data Application on AWS

AWS January 2016 Webinar Series - Introduction to Deploying Applications on AWS

AWS June Webinar Series - Deep Dive: Big Data Analytics and Business Intelligence

AWS Webcast - AWS Webinar Series for Education #2 - Getting Started with AWS

Improving Infrastructure Governance on AWS - AWS June 2016 Webinar Series

AWS APAC Webinar Week - Understanding AWS Storage Options

AWS April 2016 Webinar Series - Getting Started with AWS

Big Data Analytics With AWS Athena - baqend.com€¦ · Wolfram Wingerath Big Data Analytics With AWS Athena Big Data • • • • • • • •

AWS October Webinar Series - Getting Started with AWS IoT

AWS April Webinar Series - Getting Started with AWS CodeDeploy

AWS June 2016 Webinar Series - AWS Quarterly Update

AWS January 2016 Webinar Series - Getting Started with AWS IoT

AWS December 2015 Webinar Series - AWS Service Overview

Partner webinar presentation aws pebble_treasure_data

AWS APAC Webinar Week - Running Microsoft SQL server on AWS

20141021 AWS Cloud Taekwon - Big Data on AWS

AWS October Webinar Series - Introducing AWS Import / Export Snowball

AWS January 2016 Webinar Series - Introduction to Docker on AWS

AWS APAC Webinar Week - Launching Your First Big Data Project on AWS

February 2016 Webinar Series - Architectural Patterns for Big Data on AWS

[Webinar] AWS Monitoring with Site24x7