AWS re:Invent 2016: ↑↑↓↓←→←→ BA Lambda Start (SVR305)
-
Upload
amazon-web-services -
Category
Technology
-
view
119 -
download
1
Transcript of AWS re:Invent 2016: ↑↑↓↓←→←→ BA Lambda Start (SVR305)
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Vyom Nagrani - Manager, Product Management, Amazon Web Services
Richard McFarland - VP Data Services and Chief Data Scientist, Hearst Corp
November 2016
↑↑↓↓←→←→ BA Lambda StartSVR305
What to Expect from the Session
Working with AWS Lambda
Customer example
Hearst clickstream and data pipeline
Best practices and hacks across the lifecycle
Development and testing
Deployment and ALM
Security and scaling
Debugging and operations
Questions & answers
Working with AWS Lambda
Working with AWS Lambda
EVENT SOURCE FUNCTION SERVICES (ANYTHING)
Changes in
data state
Requests to
endpoints
Changes in
resource state
Node
Python
Java
C#
Cost-effective and
efficient
No infrastructure
to manage
Pay only for what you use
Bring your
own code
Productivity-focused compute platform to build powerful, dynamic, modular
applications in the cloud
Run code in standard
languages
Focus on business logic
Benefits of AWS Lambda
1 2 3
Amazon
S3
Amazon
DynamoDB
Amazon
Kinesis
AWS
CloudFormation
AWS
CloudTrail
Amazon
CloudWatch
Amazon
SNS
Amazon
SES
Amazon
API Gateway
Amazon
Cognito
Amazon
Alexa
Cron events
DATA STORES ENDPOINTS
CONFIGURATION REPOSITORIES EVENT/MESSAGE SERVICES
Event sources that trigger AWS Lambda
… and the list will continue to grow!
AWS
CodeCommit
AWS
IoT
Key scenarios and use cases for AWS Lambda
Data processing
Stateless processing of
discrete or streaming
updates to your data-store
or message bus
Control systems
Customize responses and
response workflows to state
and data changes within
AWS
App backend development
Execute server side
backend logic for web,
mobile, device, or voice user
interactions
Customer example:
Hearst clickstream and data
pipeline
Cron-ified Clickstream
Lambda-fy! Lessons Learned
What I will be talking about
What business is Hearst in?
Magazines20 U.S. titles & nearly 300 international titles
Newspapers15 daily & 34 weekly titlesBroadcasting
30 television & 2 radio stations
Business MediaOperates more than 20 business-to businesses with
significant holdings in the auto, electronic, medical and financial industriesHearst has over 300 websites world-wide, which
results in 1TB of data per day and over 20 billion pageviews per year.
“Hearst is in the Data Creation Business”
VARIETY
Structured Data
Unstructured
Data
VELOCITY
Batches
Streaming
VALUE
EXTRACTION
DBA and
Analysts
Cloud Engineering
And
Machine Learning
“Managing our clickstream is necessary for Hearst to extract
business value from our big data”
VOLUME
Single Source
Many Sources
NormalData
Big Data
Clickstream
Hearst’s Cron-based Clickstream
Buzzing API
API
Ready
Data
Amazon
Kinesis
Node.JS
App- Proxy
Clickstream
Data Science
Application
Amazon Redshift
ETL on EMR
Models
Agg Data
Amazon
S3
Users to
Hearst
Properties
Hearst’s data pipeline: cron-based
LATENCY
THROUGHPUT
Milliseconds
100GB/Day
30 Seconds
5GB/Day
100 Seconds
1GB/Day
5 Seconds
1GB/Day
DynamoDB API
Gateway
5 min
cron
5 min
cron5 min
cron
5 min
cron
Lambda-fy it!
Code must execute in 5 minutes or less
Lambda
Limit
For every Lambda process, create a “watchdog” that checks for failures and fills in the gaps
Lambda
Tip
Lambdaetl_main
etl_watchdog
Lambdads_main
ds_watchdog
Lambdatranslate
Lambdapush_to_DynamoDB
Lambdaapi_integration
Add “triggers” in S3 that are 0 byte files with the name of the Lambda function
Lambda
Tip
trigger trigger trigger
Convert existing cron-driven process into trigger-based process
Buzzing API
API
Ready
Data
Data Science
Application
Amazon Redshift
ETL on EMR
DynamoDB API
GatewayAmazon
Kinesis
LambdaKinesis Firehose_to_S3
Deep dive: Python frameworks
What really “exploded” the use of Lambda functions at Hearst was the
introduction of Frameworks
Problem: Using Lambda functions to access multiple AWS tools and perform data
science requires access credentials and database frameworks
psycopg2boto3
gzip
pgpasslib
pandas pytz
numpy httplib2
Programmers have to configure Python modules not in the standard Python 2.7
library set
So Hearst created a standard set of Python frameworks that make this easy
hearst_frameworks.zip
from redshift_framework.redshift_session import RedshiftSession
# initiate Redshift sessionrs = RedshiftSession(pgpass_key='HOSTNAME:PORT:DB:USERNAME')
# read table into pandas dataframedf = rs.get_df(query='select url,title from {tbl} limit 10',tbl='tmp_fbinst')
# execute sql stored in S3, replace {dt} values in file with 2016/02/21rs.execute_file(file_name='s3://hearstdataservices/code/FBINST22.sql',dt='2016/02/21')
# execute query and save to tsv in S3rs.save_query_to_csv(query='select * from tmp_fbinst where url is not null order by 12 desc;',
file_name='s3://hearstdataservices/report/test.csv',sep='\t')
# execute sql and save table to json file in S3rs.save_query_to_json(query='select * from tmp_fbinst where url is not null order by 12 desc;',
file_name='s3://hearstdataservices/report/test.json')
Deep dive: Redshift framework Redshift Framework is our core framework that makes it easy to create Lambda functions that communicate with Amazon Redshift
Lambda
TipLoad framework
No password needed
“macro”
variables!
Easily write
query results
S3
Helpers framework
import redshift_framework.helpers as helpers
#write a data frame to a csv/jsonhelpers.df_to_csv(df1, 's3://hearst/df1.csv')helpers.df_to_json(df1, 's3://hearst/df1.json')
#download/upload files to S3helpers.download_s3_file('s3://my-bucket/prefix/sub-prefix/file-name','/path/to/file-name')helpers.upload_s3_file('/path/to/file-name','s3://my-bucket/prefix/sub-prefix/file-name‘)
#file exists in S3file_exists = helpers.file_exists_in_s3('my-bucket','prefix/sub-prefix/my-file')
#get file from S3 and read into data framedf = helpers.get_df_from_csv('s3://prefix/sub-prefix/my-file.csv', sep='\t')
#get gzip file from S3 and read into stringcontent = helpers.get_file_content('s3://prefix/sub-prefix/my-file.csv.gz', compression='gzip')
Create Helpers Framework to make it easier to perform frequently executed actions as well as reading and writing to S3
Lambda
Tip
Load framework
Simpler packaging of the pandas
function with direct connection to
S3
Common task
Quickly get data in
S3 into a data
frame
Hearst’s serverless data pipeline
Amazon S3
Amazon
DynamoDB
Amazon
Kinesis
Amazon
API Gateway
Amazon Redshift
Lambda
etl_main
etl_watchdog
Lambda
ds_main
ds_watchdog
Lambda
translate
Lambda
push_to_DynamoDB
Lambda
Kinesis Firehose_to_S3
DATA API
DATA STORAGE
DATA
PROCESSING
A look at our lessons learned
Amazon
Kinesis
Spark-
Scala
Amazon
RedshiftS3
Dynamo
DB &
API
Gateway
<
5min
$$$$ $$$
Lambda
Amazon
Kinesis
Amazon
RedshiftS3
Dynamo
DB &
API
Gateway
<
2min
$$$ $
AWS Lambda allows you to manage your clickstream with less
You can actually “Do More With
Less”
You don’t need a big team: With
the right frameworks in
place, this can all be done with a
team of 2-3 FTEs
…Or one very rare individual
Best practices and hacks
across the lifecycle
Getting started on AWS Lambda
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
Bring your own code
• Node.js 4.3, Java 8,
Python 2.7, C#
Simple resource model
• Select power rating from
128 MB to 1.5 GB
• CPU and network
allocated proportionately
Stateless
• Persist data using
external storage
• No affinity or access to
underlying infrastructure
Flexible use
• Synchronous or
asynchronous
• Integrated with other
AWS services
NEW !
Anatomy of a Lambda function
Handler() function
• The method in your
code where AWS
Lambda begins
execution
Event object
• Pre-defined object
format for AWS
integrations & events
• Java & C# support
simple data types,
POJOs/POCOs, and
Stream input/output
Context object
• Use methods and
properties like
getRemainingTimeIn
Millis(), identity,
awsRequestId,
invokedFunctionArn,
clientContext,
logStreamName
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
FunctionConfiguration metadata
VpcConfig
• Enables private
communication with
other resources
within your VPC
• Provide EC2 security
group and subnets,
auto-creates ENIs
• Internet access can
be added though
NAT Gateway
DeadLetterConfig
• Failed events sent to
your SQS queue /
SNS topic
• Redrive messages
that Lambda could
not process
• Currently available
for asynchronous
invocations only
Environment
• Add custom
key/value pairs as
part of configuration
• Reuse code across
different setups or
passwords
• Encrypted with
specified KMS key
on server, decrypted
at container init
NEW ! NEW !
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
AWS Lambda limits
Resource Limits Default Limit
Ephemeral disk capacity ("/tmp" space) 512 MB
Number of file descriptors 1024
Number of processes and threads (combined total) 1024
Maximum execution duration per request 300 seconds
Invoke request body payload size (RequestResponse) 6 MB
Invoke request body payload size (Event) 128 K
Invoke response body payload size (RequestResponse) 6 MB
Dead-letter payload size (Event) 128 K
Deployment Limits Default Limit
Lambda function deployment package size (.zip/.jar file) 50 MB
Size of code/dependencies that you can zip into a deployment package (uncompressed zip/jar size) 250 MB
Total size of all the deployment packages that can be uploaded per region 75 GB
Total size of environment variables set 4 KB
Throttling Limits (can request service limit increase) Default Limit
Concurrent executions 100
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
The container model
Container reuse
• Declarations in your Lambda function
code outside handler()
• Disk content in /tmp
• Background processes or callbacks
• Make use of container reuse
opportunistically, e.g.
• Load additional libraries
• Cache static data
• Database connections
Cold starts
• Time to set up a new container and do
necessary bootstrapping when a
Lambda function is invoked for the first
time or after it has been updated
• Ways to reduce cold start latency
• More memory = faster
performance, lower start up time
• Smaller function ZIP loads faster
• Node.js and Python start execution
faster than Java and C#
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
The execution environment
Underlying OS
• Public Amazon Linux AMI version
(amzn-ami-hvm-2016.03.3.x86_64-gp2)
• Linux kernel version (4.4.23-
31.54.amzn1.x86_64)
• Compile native binaries against this
environment – can be used to bring
your own runtime!
• Changes over time, always check the
latest versions supported here
Available libraries
• ImageMagick (nodejs wrapper and
native binary)
• OpenJDK 1.8, .NET Core 1.0.1
• AWS SDK for JavaScript version 2.6.9
• AWS SDK for Python (Boto 3) version
1.4.1, Botocore version 1.4.61
• Embed your own SDK/libraries if you
depend on a specific version
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
Building a deployment package
Node.js & Python
• .zip file consisting of
your code and any
dependencies
• Use npm/pip to
install libraries
• All dependencies
must be at root level
Java
• Either .zip file with all
code/dependencies,
or standalone .jar
• Use Maven / Eclipse
IDE plugins
• Compiled class &
resource files at root
level, required jars in
/lib directory
C# (.NET Core)
• Either .zip file with all
code/dependencies,
or a standalone .dll
• Use Nuget /
VisualStudio plugins
• All assemblies (.dll)
at root level, platform
specific libraries
managed by VS
tooling
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
NEW !
Managing continuous delivery
Source Build Test Deploy
Amazon S3 AWS Lambda (DIY)
AWS CodeCommit
GitHub
AWS CodePipeline
CodeshipJenkins
AWS CodeBuild
NEW !
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
… OR …
Deployment tools and frameworks available
CloudFormation
• AWS Serverless
Application Model -
extension optimized
for Serverless
• New Serverless
resources – APIs,
Functions, Tables
• Open specification
(Apache 2.0)
Chalice
• Python serverless
micro-framework
• Quickly create and
deploy applications
• Set up AWS Lambda
and Amazon API
Gateway endpoint
• https://github.com/aw
slabs/chalice
Third-party tools
• Serverless
Framework
(https://serverless.com/)
• Apex Serverless
Architecture
(http://apex.run/)
• DEEP Framework by
Mitoc Group
(https://github.com/Mitoc
Group/deep-framework)
NEW !
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
Function versioning and aliases
• Versions = immutable copies of code +
configuration
• Aliases = mutable pointers to versions
• Development against $LATEST version
• Each version/alias gets its own ARN
• Enables rollbacks, staged promotions,
“locked” behavior for client
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
The push model and resource policies
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
Function (resource) policy
• Permissions you grant to your Lambda
function determine which service or
event source can invoke your function
• Resource policies make it easy to
grant cross-account permissions to
invoke your Lambda function
The pull model and IAM roles
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
IAM (execution) role
• Permissions you grant to this role
determine what your AWS Lambda
function can do
• If event source is Amazon DynamoDB
or Amazon Kinesis, then add read
permissions in IAM role
Concurrent executions and throttling
Determining concurrency
• For stream-based event sources:
Number of shards per stream is the
unit of concurrency
• For all other event sources: Request
rate and duration drives concurrency
(concurrency = requests per second *
function duration)
Throttle behavior
• For stream-based event sources:
Automatically retried until data expires
• For Asynchronous invocations:
Automatically retried for up to six
hours, with delays between retries
• For Synchronous invocations: Invoking
application receives a 429 error and is
responsible for retries
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
Other scaling considerations
For Lambda
• Remember, a throttle is NOT an error!
• If you expect sudden large spikes in
demand, consider Asynchronous
invocations to Lambda
• Proactively engage AWS Support to
increase your throttling limits
For upstream/downstream services
• Build retries/backoff in client
applications and upstream setup
• Make sure your downstream setup
“keeps up” with Lambda scaling
• Limit concurrency when connecting to
relational databases
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
Errors and retries
Types of errors
• 4xx Client Error: Can be fixed by
developer, e.g. InvalidParameterValue
(400), ResourceNotFound (404),
RequestTooLarge (413), etc.
• 5xx Server Error: Most can be fixed by
admin, e.g. EC2 ENI management
errors (502)
Retry policy
• For stream-based event sources:
Automatically retried until data expires
• For Asynchronous invocations:
Automatically retried 2 extra times,
then published to dead-letter queue
• For Synchronous invocations: Invoking
application receives an error code and
is responsible for retries
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
Tracing and tracking
Integration with AWS X-Ray
• Collects data about requests that your
application serves
• Visibility into the AWS Lambda service
(dwell time, number of retries, latency
and errors)
• Detailed breakdown of your function’s
performance, including calls made to
downstream services and endpoints
Integration with AWS CloudTrail
• Captures calls made to AWS Lambda
API; delivers log files to Amazon S3
• Tracks the request made to AWS
Lambda, the source IP address from
which the request was made, who
made the request, when it was made
• All control plane APIs can be tracked
(no versioning/aliasing and invoke API)
Development
and TestingDeployment
and ALM
Security
and Scaling
Debugging
and Operations
COMING
SOON!
Troubleshooting and monitoring
Logs
• Every invocation generates START, END,
and REPORT entries in CloudWatch Logs
• User logs included• Node.js – console.log(), console.error(),
console.warn(), console.info()
• Java – log4j.*, LambdaLogger.log(),
system.out(), system.err()
• Python – print, logging.*
• C# – LambdaLogger.Log(),
ILambdaContext.Logger.Log(),
console.write(), console.writeline()
Metrics
• Default (Free) Metrics: Invocations,
Duration, Throttles, Errors – available as
CloudWatch Metrics
• Additional Metrics: Create custom
metrics for tracking health/status
• Function code vs log-filters
• Ops-centric vs. business-centric
Development
and Testing
Deployment
and ALM
Security
and Scaling
Debugging
and Operations
Conclusion and next steps
Key takeaway
AWS Lambda is one of the core components of the
platform AWS provides to develop serverless applications
Next steps
1. Stay up to date with AWS Lambda on the Compute blog
and check out our detail page for more scenarios.
2. Send us your questions, comments, and feedback on
the AWS Lambda Forums.
Questions & Answers
Thank you!
Follow us on Twitter
@vyomnagrani
@statsrick
Remember to complete
your evaluations!
Related Sessions
SVR202 – What’s New with AWS Lambda
SVR301 – Real-time Data Processing Using AWS Lambda
SVR302 – Optimizing the Data Tier in Serverless Web Applications
SVR304 – bots + serverless = ❤
SVR307 – Application Lifecycle Management in a Serverless World
SVR311 – The State of Serverless Computing
SVR401 – Using AWS Lambda to Build Control Systems for Your AWS Infrastructure
SVR402 – Operating Your Production API
CMP211 – Getting Started with Serverless Architectures
DEV205 – Monitoring, Hold the Infrastructure: Getting the Most from AWS Lambda
DEV301 – Amazon CloudWatch Logs and AWS Lambda: A Match Made in Heaven
DEV308 – Chalice: A Serverless Microframework for Python
See you at the re:Play party!