Data Processing without Servers | AWS Public Sector Summit 2016

47
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Jed Sundwall, Global Open Data Lead June 21, 2016 Data Processing Without Servers: Serverless Processing of Landsat 8 Imagery Using AWS Lambda with Landsat on AWS

Transcript of Data Processing without Servers | AWS Public Sector Summit 2016

Page 1: Data Processing without Servers | AWS Public Sector Summit 2016

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Jed Sundwall, Global Open Data LeadJune 21, 2016

Data Processing Without Servers: Serverless Processing of Landsat 8 Imagery

Using AWS Lambda with Landsat on AWS

Page 2: Data Processing without Servers | AWS Public Sector Summit 2016

What is Landsat?

Page 3: Data Processing without Servers | AWS Public Sector Summit 2016

Landsat

The Landsat program is a joint effort of the U.S. Geological Survey and NASA. It is the longest running program to gather Earth imagery from space and is considered the gold standard for natural resources satellite imagery.

Page 4: Data Processing without Servers | AWS Public Sector Summit 2016

Landsat—not just pretty pictures

Landsat scenes are made up of multiple files, each of which includes data about different kinds of light reflected off of Earth.

Each pixel of each Landsat 8 file represents a 12-bit measurement of light reflected off a 30m2 part of our planet. Each Landsat 8 scene contains about 840 million pixels and takes up about 800 MB.

We currently host over 400,000 Landsat 8 scenes and make about 700 new scenes available on Amazon S3 every day.

That’s 588 billion pixels a day.

Page 5: Data Processing without Servers | AWS Public Sector Summit 2016

RGBvisible light

Infraredvegetation

Shortwave infraredurban areas

Wellington, New Zealand

Page 6: Data Processing without Servers | AWS Public Sector Summit 2016

What does “serverless” mean?

Page 7: Data Processing without Servers | AWS Public Sector Summit 2016

“Serverless” is an approach to software development that eliminates the need for maintaining and administering servers

What does “serverless” mean?

Page 8: Data Processing without Servers | AWS Public Sector Summit 2016

Application design is facilitated through interaction with third-party APIs/services and self-created non-server based APIs.

What does “serverless” mean?

Page 9: Data Processing without Servers | AWS Public Sector Summit 2016

AWS Lambda

AWS Lambda

Serverless compute service that runs code in response to events and automatically manages the underlying compute resources

Page 10: Data Processing without Servers | AWS Public Sector Summit 2016

AWS Lambda

COMPUTE SERVICE

EVENT DRIVEN

Run code at any scale without thinking about

servers

Code only runs when it needs to run, charged on execution time

Page 11: Data Processing without Servers | AWS Public Sector Summit 2016

AWS Lambda + Landsat

Page 12: Data Processing without Servers | AWS Public Sector Summit 2016

Landsat on AWS

Landsat on AWS makes each band of each scene readily available as objects on Amazon S3.

Data can be accessed programmatically via HTTP and quickly deployed to any of our products for analysis and processing.

An Amazon SNS topic publishes a notification whenever a new scene is available.

Page 13: Data Processing without Servers | AWS Public Sector Summit 2016

Landsat on AWS

Landsat TIFFs represent individual wavelengths of light, and need to be combined to be interpretable by most people.

Using image processing tools, we can combine multiple bands into one “true color” image.

Page 14: Data Processing without Servers | AWS Public Sector Summit 2016

Our goal is to create true color images automatically as each scene is made publically available.

AWSLambda

AmazonDynamoDB

AmazonS3

AmazonSNS

Page 15: Data Processing without Servers | AWS Public Sector Summit 2016

We can seamlessly integrate various Amazon Web Services products to create a serverless architecture that will achieve this quickly and cost-effectively.

AWSLambda

AmazonDynamoDB

AmazonS3

AmazonSNS

Page 16: Data Processing without Servers | AWS Public Sector Summit 2016

Serverless architecture

AWS Lambda

Landsat 8 bucket

Amazon SNS Target bucket

Amazon DynamoDB

Page 17: Data Processing without Servers | AWS Public Sector Summit 2016

{ "Records": [ { "EventVersion": "1.0", "EventSubscriptionArn": "arn:aws:sns:EXAMPLE", "EventSource": "aws:sns", "Sns": { "SignatureVersion": "1", "Timestamp": "1970-01-01T00:00:00.000Z", "Signature": "EXAMPLE", "SigningCertUrl": "EXAMPLE", "MessageId": "95df01b4-ee98-5cb9-9903-4c221d41eb5e", "Message": "{\"Records\":[{\"eventVersion\":\"2.0\",\"eventSource\":\"aws:s3\",\"awsRegion\":\"us-west-2\",\"eventTime\":\"2016-01-16T01:36:55.014Z\",\"eventName\":\"ObjectCreated:Put\",\"userIdentity\":{\"principalId\":\"AWS:AIDAILHHXPNIKSGVUGOZK\"},\"requestParameters\":{\"sourceIPAddress\":\"52.27.39.85\"},\"responseElements\":{\"x-amz-request-id\":\"078952E6C7CC52B4\",\"x-amz-id-2\":\"Xboo1ULzd7PxY27iIaGXjUStV8TmG52JAbiWQpiRJWuRqfaBhLcc0XMUKNmXgd5fbIfRd1IcrgE=\"},\"s3\":{\"s3SchemaVersion\":\"1.0\",\"configurationId\":\"NewHTML\",\"bucket\":{\"name\":\"landsat-pds\",\"ownerIdentity\":{\"principalId\":\"A3LZTVCZQ87CNW\"},\"arn\":\"arn:aws:s3:::landsat-pds\"},\"object\":{\"key\":\"L8/169/060/LC81690602016015LGN00/index.html\",\"size\":3780,\"eTag\":\"736e4e5a36cb8a1c6cbfc58659126ff1\",\"sequencer\":\"0056999EB6F8BDBB8D\"}}}]}", "Type": "Notification", "UnsubscribeUrl": "EXAMPLE", "TopicArn": "arn:aws:sns:EXAMPLE", "Subject": "TestInvoke" } } ]

An Amazon SNS topic publishes a notification whenever a new scene is available.

This is what a notification looks like. It’s a JavaScript Object Notation (JSON) object.

Page 18: Data Processing without Servers | AWS Public Sector Summit 2016

{ "Records": [ { "EventVersion": "1.0", "EventSubscriptionArn": "arn:aws:sns:EXAMPLE", "EventSource": "aws:sns", "Sns": { "SignatureVersion": "1", "Timestamp": "1970-01-01T00:00:00.000Z", "Signature": "EXAMPLE", "SigningCertUrl": "EXAMPLE", "MessageId": "95df01b4-ee98-5cb9-9903-4c221d41eb5e", "Message": "{\"Records\":[{\"eventVersion\":\"2.0\",\"eventSource\":\"aws:s3\",\"awsRegion\":\"us-west-2\",\"eventTime\":\"2016-01-16T01:36:55.014Z\",\"eventName\":\"ObjectCreated:Put\",\"userIdentity\":{\"principalId\":\"AWS:AIDAILHHXPNIKSGVUGOZK\"},\"requestParameters\":{\"sourceIPAddress\":\"52.27.39.85\"},\"responseElements\":{\"x-amz-request-id\":\"078952E6C7CC52B4\",\"x-amz-id-2\":\"Xboo1ULzd7PxY27iIaGXjUStV8TmG52JAbiWQpiRJWuRqfaBhLcc0XMUKNmXgd5fbIfRd1IcrgE=\"},\"s3\":{\"s3SchemaVersion\":\"1.0\",\"configurationId\":\"NewHTML\",\"bucket\":{\"name\":\"landsat-pds\",\"ownerIdentity\":{\"principalId\":\"A3LZTVCZQ87CNW\"},\"arn\":\"arn:aws:s3:::landsat-pds\"},\"object\":{\"key\":\"L8/169/060/LC81690602016015LGN00/index.html\",\"size\":3780,\"eTag\":\"736e4e5a36cb8a1c6cbfc58659126ff1\",\"sequencer\":\"0056999EB6F8BDBB8D\"}}}]}", "Type": "Notification", "UnsubscribeUrl": "EXAMPLE", "TopicArn": "arn:aws:sns:EXAMPLE", "Subject": "TestInvoke" } } ]

An Amazon SNS topic publishes a notification whenever a new scene is available.

This is what a notification looks like. It’s a JavaScript Object Notation (JSON) object.

Page 19: Data Processing without Servers | AWS Public Sector Summit 2016

Programmatic access to dataL8/169/060/LC81690602016015LGN00/index.html → LC81690602016015LGN00_B1.TIF → LC81690602016015LGN00_B2.TIF → LC81690602016015LGN00_B3.TIF … → LC81690602016015LGN00_MTL.txt

The notification has given us everything we need to find the data for our task. AWS Lambda can do all of this automatically.

Page 20: Data Processing without Servers | AWS Public Sector Summit 2016

Serverless architecture

AWS Lambda

Landsat 8 bucket

Amazon SNS Target bucket

Amazon DynamoDB

Page 21: Data Processing without Servers | AWS Public Sector Summit 2016

The SNS message object is available to the Lambda function on execution.

Page 22: Data Processing without Servers | AWS Public Sector Summit 2016

From this object, we obtain the base Landsat scene information (Path, Row, Scene ID), as well as the MTL text file containing the detailed metadata for the scene.

Page 23: Data Processing without Servers | AWS Public Sector Summit 2016

Native JSONNext, the Lambda function retrieves the text file containing the scene metadata.

The metadata is parsed and converted to JSON.

Page 24: Data Processing without Servers | AWS Public Sector Summit 2016

Native JSONHaving the metadata available in JSON will allow for much easier storage of the metadata in DynamoDB.

Page 25: Data Processing without Servers | AWS Public Sector Summit 2016

After storing the scene metadata, the function then invokes an additional fleet of Lambda functions.

Page 26: Data Processing without Servers | AWS Public Sector Summit 2016

Each function is tasked with downloading the .TIF corresponding to the three bands to generate a true color image, converting them to a .JPG, and uploading them back to S3 to make them available to the parent Lambda function.

Page 27: Data Processing without Servers | AWS Public Sector Summit 2016

Lambda functions natively include the open source image processing library ImageMagick.

Page 28: Data Processing without Servers | AWS Public Sector Summit 2016

We call this library to retrieve the three compressed .JPG bands, assemble them into a single .JPG, and then make color/contrast adjustments.

Page 29: Data Processing without Servers | AWS Public Sector Summit 2016

The parent Lambda function uploads the converted bands and the processed true color image to S3.

Page 30: Data Processing without Servers | AWS Public Sector Summit 2016

We can then make these finished .JPGs publically available, or available only to a specific application, depending on the use case

Page 31: Data Processing without Servers | AWS Public Sector Summit 2016

Thank you!Jed Sundwall, Global Open Data Lead – [email protected] Opsitos, Solutions Architect – [email protected]

Page 32: Data Processing without Servers | AWS Public Sector Summit 2016

© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Matthew Hanson, Development Seed, @geoskeptic

June 21, 2016

OSM-STATSGamification for Humanitarian Mapping

Page 33: Data Processing without Servers | AWS Public Sector Summit 2016

OpenStreetMap

Open map data Roads, rivers, buildings (e.g., hospitals)

Crowd-sourced mapping platform Users create vectors from satellite imagery OSM tasking manager identifies critical areas

Page 34: Data Processing without Servers | AWS Public Sector Summit 2016

Missing Maps

An initiative to map out areas most in need Humanitarian response Third-world regions with poor coverage

Organize marathons Events with groups of volunteers focus on a region

Website of statistics from marathons Keep track of contributions by hashtags users include in commits

Page 35: Data Processing without Servers | AWS Public Sector Summit 2016
Page 36: Data Processing without Servers | AWS Public Sector Summit 2016

OSM-Stats

Website of statistics by users and hashtags Track different groups, different mapathons Offer a reward mechanism to encourage contributions

Users earn badges for different statistics e.g., km of roads, # of buildings

Leaderboards for users and hashtags Produce stats in real-time for added fun at mapathons

Page 37: Data Processing without Servers | AWS Public Sector Summit 2016

missingmaps.org

Page 38: Data Processing without Servers | AWS Public Sector Summit 2016

OSM infrastructure

Commits (changesets) by users published every minute Include metadata, but not geometries http://planet.osm.org/replication/changesets/

Geometries made available by minute via ‘overpass’ http://overpass-api.de/

Page 39: Data Processing without Servers | AWS Public Sector Summit 2016

OSM-Stats Architecture

Page 40: Data Processing without Servers | AWS Public Sector Summit 2016

planet-stream

Node app Streams metadata and geometries from sources

Combine them using Redis Push augmented changesets to Amazon Kinesis stream Docker container running on Amazon EC2

Page 41: Data Processing without Servers | AWS Public Sector Summit 2016

osm-stats-workers

AWS Lambda with Node v4.3.2Event mapping to Amazon Kinesis streamCalculates metrics from each changes

Geometry calculations from vector data Determination of countries edited Ancillary data: user, editor used

Add to Amazon RDS database

Page 42: Data Processing without Servers | AWS Public Sector Summit 2016

Deployment Use Python script and boto3 Deploy database

Create Amazon RDS and osm-stats database, with inbound rules Migrate and populate

Create Amazon Kinesis stream Create AWS Lambda

Create with appropriate permissions—Amazon Kinesis, Amazon RDS security group pair Create event mapping

Deploy Amazon EC2 Create instance, create security groups Use fabric to upload .env file (with URLs and names of above services), Dockerfiles docker-compose up -d: starts pushing to stream as soon as augmented changesets

created

Page 43: Data Processing without Servers | AWS Public Sector Summit 2016

Why Lambda and Amazon Kinesis?

Microservices architecture Smaller replaceable components Easier to scale pieces

Lambda provides low-cost solution at scale Activity can vary from a few to 100 changesets/min

Amazon Kinesis stream allows flexible input for historical processing

Page 44: Data Processing without Servers | AWS Public Sector Summit 2016

Lambda Invocations and Durations

Plots using librato

Page 45: Data Processing without Servers | AWS Public Sector Summit 2016

Lambda lessons

Local testing framework would have been useful Lambda logs take some work

aws-cli—combined with Python or Bash scripts can be useful to parse logs awslogs—Amazon CloudWatch logs for Humans (

https://github.com/jorgebastida/awslogs) Error handling

Lambda function design should handle all errors—don’t let it return a failure Include top-level catch to catch any errors, log, and return success

Database connections using Knex Database pools and Lambda container reuse (pool min=0 !)

Page 46: Data Processing without Servers | AWS Public Sector Summit 2016

Lambda Security and VPCs

Initially configured closed RDS with Lambda accessPaired security groups for RDS and Lambda

As part of VPC, Lambda is in a bubbleosm-stats-workers—makes requests elsewhere

OSM API for tasking manager data

Ended up opening up RDS to the worldSecurity groups also seem to cause intermittent pool errors