AWS re:Invent 2016: ↑↑↓↓←→←→ BA Lambda Start (SVR305)

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Vyom Nagrani - Manager, Product Management, Amazon Web Services

Richard McFarland - VP Data Services and Chief Data Scientist, Hearst Corp

November 2016

↑↑↓↓←→←→ BA Lambda StartSVR305

What to Expect from the Session

Working with AWS Lambda

Customer example

Hearst clickstream and data pipeline

Best practices and hacks across the lifecycle

Development and testing

Deployment and ALM

Security and scaling

Debugging and operations

Questions & answers


EVENT SOURCE FUNCTION SERVICES (ANYTHING)

Changes in

data state

Requests to

endpoints

Changes in

resource state

Node

Python

Java

C#

Cost-effective and

efficient

No infrastructure

to manage

Pay only for what you use

Bring your

own code

Productivity-focused compute platform to build powerful, dynamic, modular

applications in the cloud

Run code in standard

languages

Focus on business logic

Benefits of AWS Lambda

1 2 3

Amazon

S3

Amazon

DynamoDB

Amazon

Kinesis

AWS

CloudFormation

AWS

CloudTrail

Amazon

CloudWatch

Amazon

SNS

Amazon

SES

Amazon

API Gateway

Amazon

Cognito

Amazon

Alexa

Cron events

DATA STORES ENDPOINTS

CONFIGURATION REPOSITORIES EVENT/MESSAGE SERVICES

Event sources that trigger AWS Lambda

… and the list will continue to grow!

AWS

CodeCommit

AWS

IoT

Key scenarios and use cases for AWS Lambda

Data processing

Stateless processing of

discrete or streaming

updates to your data-store

or message bus

Control systems

Customize responses and

response workflows to state

and data changes within

AWS

App backend development

Execute server side

backend logic for web,

mobile, device, or voice user

interactions

Customer example:

Hearst clickstream and data

pipeline

Cron-ified Clickstream

Lambda-fy! Lessons Learned

What I will be talking about

What business is Hearst in?

Magazines20 U.S. titles & nearly 300 international titles

Newspapers15 daily & 34 weekly titlesBroadcasting

30 television & 2 radio stations

Business MediaOperates more than 20 business-to businesses with

significant holdings in the auto, electronic, medical and financial industriesHearst has over 300 websites world-wide, which

results in 1TB of data per day and over 20 billion pageviews per year.

“Hearst is in the Data Creation Business”

VARIETY

Structured Data

Unstructured

Data

VELOCITY

Batches

Streaming

VALUE

EXTRACTION

DBA and

Analysts

Cloud Engineering

And

Machine Learning

“Managing our clickstream is necessary for Hearst to extract

business value from our big data”

VOLUME

Single Source

Many Sources

NormalData

Big Data

Clickstream

Hearst’s Cron-based Clickstream

Buzzing API

API

Ready

Data

Amazon

Kinesis

Node.JS

App- Proxy

Clickstream

Data Science

Application

Amazon Redshift

ETL on EMR

Models

Agg Data

Amazon

S3

Users to

Hearst

Properties

Hearst’s data pipeline: cron-based

LATENCY

THROUGHPUT

Milliseconds

100GB/Day

30 Seconds

5GB/Day

100 Seconds

1GB/Day

5 Seconds

1GB/Day

DynamoDB API

Gateway

5 min

cron

5 min

cron5 min

cron

5 min

cron

Lambda-fy it!

Code must execute in 5 minutes or less

Lambda

Limit

For every Lambda process, create a “watchdog” that checks for failures and fills in the gaps

Lambda

Tip

Lambdaetl_main

etl_watchdog

Lambdads_main

ds_watchdog

Lambdatranslate

Lambdapush_to_DynamoDB

Lambdaapi_integration

Add “triggers” in S3 that are 0 byte files with the name of the Lambda function

Lambda

Tip

trigger trigger trigger

Convert existing cron-driven process into trigger-based process

Buzzing API

API

Ready

Data

Data Science

Application

Amazon Redshift

ETL on EMR

DynamoDB API

GatewayAmazon

Kinesis

LambdaKinesis Firehose_to_S3

Deep dive: Python frameworks

What really “exploded” the use of Lambda functions at Hearst was the

introduction of Frameworks

Problem: Using Lambda functions to access multiple AWS tools and perform data

science requires access credentials and database frameworks

psycopg2boto3

gzip

pgpasslib

pandas pytz

numpy httplib2

Programmers have to configure Python modules not in the standard Python 2.7

library set

So Hearst created a standard set of Python frameworks that make this easy

hearst_frameworks.zip

from redshift_framework.redshift_session import RedshiftSession

# initiate Redshift sessionrs = RedshiftSession(pgpass_key='HOSTNAME:PORT:DB:USERNAME')

# read table into pandas dataframedf = rs.get_df(query='select url,title from {tbl} limit 10',tbl='tmp_fbinst')

# execute sql stored in S3, replace {dt} values in file with 2016/02/21rs.execute_file(file_name='s3://hearstdataservices/code/FBINST22.sql',dt='2016/02/21')

# execute query and save to tsv in S3rs.save_query_to_csv(query='select * from tmp_fbinst where url is not null order by 12 desc;',

file_name='s3://hearstdataservices/report/test.csv',sep='\t')

# execute sql and save table to json file in S3rs.save_query_to_json(query='select * from tmp_fbinst where url is not null order by 12 desc;',

file_name='s3://hearstdataservices/report/test.json')

Deep dive: Redshift framework Redshift Framework is our core framework that makes it easy to create Lambda functions that communicate with Amazon Redshift

Lambda

TipLoad framework

No password needed

“macro”

variables!

Easily write

query results

S3

Helpers framework

import redshift_framework.helpers as helpers

#write a data frame to a csv/jsonhelpers.df_to_csv(df1, 's3://hearst/df1.csv')helpers.df_to_json(df1, 's3://hearst/df1.json')

#download/upload files to S3helpers.download_s3_file('s3://my-bucket/prefix/sub-prefix/file-name','/path/to/file-name')helpers.upload_s3_file('/path/to/file-name','s3://my-bucket/prefix/sub-prefix/file-name‘)

#file exists in S3file_exists = helpers.file_exists_in_s3('my-bucket','prefix/sub-prefix/my-file')

#get file from S3 and read into data framedf = helpers.get_df_from_csv('s3://prefix/sub-prefix/my-file.csv', sep='\t')

#get gzip file from S3 and read into stringcontent = helpers.get_file_content('s3://prefix/sub-prefix/my-file.csv.gz', compression='gzip')

Create Helpers Framework to make it easier to perform frequently executed actions as well as reading and writing to S3

Lambda

Tip

Load framework

Simpler packaging of the pandas

function with direct connection to

S3

Common task

Quickly get data in

S3 into a data

frame

Hearst’s serverless data pipeline

Amazon S3

Amazon

DynamoDB

Amazon

Kinesis

Amazon

API Gateway

Amazon Redshift

Lambda

etl_main

etl_watchdog

Lambda

ds_main

ds_watchdog

Lambda

translate

Lambda

push_to_DynamoDB

Lambda

Kinesis Firehose_to_S3

DATA API

DATA STORAGE

DATA

PROCESSING

A look at our lessons learned

Amazon

Kinesis

Spark-

Scala

Amazon

RedshiftS3

Dynamo

DB &

API

Gateway

<

5min

$$$$ $$$

Lambda

Amazon

Kinesis

Amazon

RedshiftS3

Dynamo

DB &

API

Gateway

<

2min

$$$ $

AWS Lambda allows you to manage your clickstream with less

You can actually “Do More With

Less”

You don’t need a big team: With

the right frameworks in

place, this can all be done with a

team of 2-3 FTEs

…Or one very rare individual

Best practices and hacks

across the lifecycle

Getting started on AWS Lambda

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

Bring your own code

• Node.js 4.3, Java 8,

Python 2.7, C#

Simple resource model

• Select power rating from

128 MB to 1.5 GB

• CPU and network

allocated proportionately

Stateless

• Persist data using

external storage

• No affinity or access to

underlying infrastructure

Flexible use

• Synchronous or

asynchronous

• Integrated with other

AWS services

NEW !

Anatomy of a Lambda function

Handler() function

• The method in your

code where AWS

Lambda begins

execution

Event object

• Pre-defined object

format for AWS

integrations & events

• Java & C# support

simple data types,

POJOs/POCOs, and

Stream input/output

Context object

• Use methods and

properties like

getRemainingTimeIn

Millis(), identity,

awsRequestId,

invokedFunctionArn,

clientContext,

logStreamName

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

FunctionConfiguration metadata

VpcConfig

• Enables private

communication with

other resources

within your VPC

• Provide EC2 security

group and subnets,

auto-creates ENIs

• Internet access can

be added though

NAT Gateway

DeadLetterConfig

• Failed events sent to

your SQS queue /

SNS topic

• Redrive messages

that Lambda could

not process

• Currently available

for asynchronous

invocations only

Environment

• Add custom

key/value pairs as

part of configuration

• Reuse code across

different setups or

passwords

• Encrypted with

specified KMS key

on server, decrypted

at container init

NEW ! NEW !

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

AWS Lambda limits

Resource Limits Default Limit

Ephemeral disk capacity ("/tmp" space) 512 MB

Number of file descriptors 1024

Number of processes and threads (combined total) 1024

Maximum execution duration per request 300 seconds

Invoke request body payload size (RequestResponse) 6 MB

Invoke request body payload size (Event) 128 K

Invoke response body payload size (RequestResponse) 6 MB

Dead-letter payload size (Event) 128 K

Deployment Limits Default Limit

Lambda function deployment package size (.zip/.jar file) 50 MB

Size of code/dependencies that you can zip into a deployment package (uncompressed zip/jar size) 250 MB

Total size of all the deployment packages that can be uploaded per region 75 GB

Total size of environment variables set 4 KB

Throttling Limits (can request service limit increase) Default Limit

Concurrent executions 100

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

The container model

Container reuse

• Declarations in your Lambda function

code outside handler()

• Disk content in /tmp

• Background processes or callbacks

• Make use of container reuse

opportunistically, e.g.

• Load additional libraries

• Cache static data

• Database connections

Cold starts

• Time to set up a new container and do

necessary bootstrapping when a

Lambda function is invoked for the first

time or after it has been updated

• Ways to reduce cold start latency

• More memory = faster

performance, lower start up time

• Smaller function ZIP loads faster

• Node.js and Python start execution

faster than Java and C#

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

The execution environment

Underlying OS

• Public Amazon Linux AMI version

(amzn-ami-hvm-2016.03.3.x86_64-gp2)

• Linux kernel version (4.4.23-

31.54.amzn1.x86_64)

• Compile native binaries against this

environment – can be used to bring

your own runtime!

• Changes over time, always check the

latest versions supported here

Available libraries

• ImageMagick (nodejs wrapper and

native binary)

• OpenJDK 1.8, .NET Core 1.0.1

• AWS SDK for JavaScript version 2.6.9

• AWS SDK for Python (Boto 3) version

1.4.1, Botocore version 1.4.61

• Embed your own SDK/libraries if you

depend on a specific version

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

http://docs.aws.amazon.com/lambda/latest/dg/current-supported-versions.html

Building a deployment package

Node.js & Python

• .zip file consisting of

your code and any

dependencies

• Use npm/pip to

install libraries

• All dependencies

must be at root level

Java

• Either .zip file with all

code/dependencies,

or standalone .jar

• Use Maven / Eclipse

IDE plugins

• Compiled class &

resource files at root

level, required jars in

/lib directory

C# (.NET Core)

• Either .zip file with all

code/dependencies,

or a standalone .dll

• Use Nuget /

VisualStudio plugins

• All assemblies (.dll)

at root level, platform

specific libraries

managed by VS

tooling

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

NEW !

Managing continuous delivery

Source Build Test Deploy

Amazon S3 AWS Lambda (DIY)

AWS CodeCommit

GitHub

AWS CodePipeline

CodeshipJenkins

AWS CodeBuild

NEW !

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

… OR …

Deployment tools and frameworks available

CloudFormation

• AWS Serverless

Application Model -

extension optimized

for Serverless

• New Serverless

resources – APIs,

Functions, Tables

• Open specification

(Apache 2.0)

Chalice

• Python serverless

micro-framework

• Quickly create and

deploy applications

• Set up AWS Lambda

and Amazon API

Gateway endpoint

• https://github.com/aw

slabs/chalice

Third-party tools

• Serverless

Framework

(https://serverless.com/)

• Apex Serverless

Architecture

(http://apex.run/)

• DEEP Framework by

Mitoc Group

(https://github.com/Mitoc

Group/deep-framework)

NEW !

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

https://github.com/awslabs/chalice

https://serverless.com/

http://apex.run/

https://github.com/MitocGroup/deep-framework

Function versioning and aliases

• Versions = immutable copies of code +

configuration

• Aliases = mutable pointers to versions

• Development against $LATEST version

• Each version/alias gets its own ARN

• Enables rollbacks, staged promotions,

“locked” behavior for client

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

The push model and resource policies

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

Function (resource) policy

• Permissions you grant to your Lambda

function determine which service or

event source can invoke your function

• Resource policies make it easy to

grant cross-account permissions to

invoke your Lambda function

The pull model and IAM roles

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

IAM (execution) role

• Permissions you grant to this role

determine what your AWS Lambda

function can do

• If event source is Amazon DynamoDB

or Amazon Kinesis, then add read

permissions in IAM role

Concurrent executions and throttling

Determining concurrency

• For stream-based event sources:

Number of shards per stream is the

unit of concurrency

• For all other event sources: Request

rate and duration drives concurrency

(concurrency = requests per second *

function duration)

Throttle behavior


Automatically retried until data expires

• For Asynchronous invocations:

Automatically retried for up to six

hours, with delays between retries

• For Synchronous invocations: Invoking

application receives a 429 error and is

responsible for retries

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

Other scaling considerations

For Lambda

• Remember, a throttle is NOT an error!

• If you expect sudden large spikes in

demand, consider Asynchronous

invocations to Lambda

• Proactively engage AWS Support to

increase your throttling limits

For upstream/downstream services

• Build retries/backoff in client

applications and upstream setup

• Make sure your downstream setup

“keeps up” with Lambda scaling

• Limit concurrency when connecting to

relational databases

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

Errors and retries

Types of errors

• 4xx Client Error: Can be fixed by

developer, e.g. InvalidParameterValue

(400), ResourceNotFound (404),

RequestTooLarge (413), etc.

• 5xx Server Error: Most can be fixed by

admin, e.g. EC2 ENI management

errors (502)

Retry policy


Automatically retried until data expires

• For Asynchronous invocations:

Automatically retried 2 extra times,

then published to dead-letter queue

• For Synchronous invocations: Invoking

application receives an error code and

is responsible for retries

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

Tracing and tracking

Integration with AWS X-Ray

• Collects data about requests that your

application serves

• Visibility into the AWS Lambda service

(dwell time, number of retries, latency

and errors)

• Detailed breakdown of your function’s

performance, including calls made to

downstream services and endpoints

Integration with AWS CloudTrail

• Captures calls made to AWS Lambda

API; delivers log files to Amazon S3

• Tracks the request made to AWS

Lambda, the source IP address from

which the request was made, who

made the request, when it was made

• All control plane APIs can be tracked

(no versioning/aliasing and invoke API)

Development

and TestingDeployment

and ALM

Security

and Scaling

Debugging

and Operations

COMING

SOON!

Troubleshooting and monitoring

Logs

• Every invocation generates START, END,

and REPORT entries in CloudWatch Logs

• User logs included• Node.js – console.log(), console.error(),

console.warn(), console.info()

• Java – log4j.*, LambdaLogger.log(),

system.out(), system.err()

• Python – print, logging.*

• C# – LambdaLogger.Log(),

ILambdaContext.Logger.Log(),

console.write(), console.writeline()

Metrics

• Default (Free) Metrics: Invocations,

Duration, Throttles, Errors – available as

CloudWatch Metrics

• Additional Metrics: Create custom

metrics for tracking health/status

• Function code vs log-filters

• Ops-centric vs. business-centric

Development

and Testing

Deployment

and ALM

Security

and Scaling

Debugging

and Operations

Conclusion and next steps

Key takeaway

AWS Lambda is one of the core components of the

platform AWS provides to develop serverless applications

Next steps

1. Stay up to date with AWS Lambda on the Compute blog

and check out our detail page for more scenarios.

2. Send us your questions, comments, and feedback on

the AWS Lambda Forums.

https://aws.amazon.com/blogs/compute/

aws.amazon.com/lambda

https://forums.aws.amazon.com/forum.jspa?forumID=186

Questions & Answers

Thank you!

Follow us on Twitter

@vyomnagrani

@statsrick

Remember to complete

your evaluations!

Related Sessions

SVR202 – What’s New with AWS Lambda

SVR301 – Real-time Data Processing Using AWS Lambda

SVR302 – Optimizing the Data Tier in Serverless Web Applications

SVR304 – bots + serverless = ❤

SVR307 – Application Lifecycle Management in a Serverless World

SVR311 – The State of Serverless Computing

SVR401 – Using AWS Lambda to Build Control Systems for Your AWS Infrastructure

SVR402 – Operating Your Production API

CMP211 – Getting Started with Serverless Architectures

DEV205 – Monitoring, Hold the Infrastructure: Getting the Most from AWS Lambda

DEV301 – Amazon CloudWatch Logs and AWS Lambda: A Match Made in Heaven

DEV308 – Chalice: A Serverless Microframework for Python

https://www.portal.reinvent.awsevents.com/connect/sessionDetail.ww?SESSION_ID=8538












See you at the re:Play party!

AWS re:Invent 2016: ↑↑↓↓←→←→ BA Lambda Start (SVR305)

Technology

Transcript of AWS re:Invent 2016: ↑↑↓↓←→←→ BA Lambda Start (SVR305)