(CMP403) AWS Lambda: Simplifying Big Data Workloads

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Martin Holste, FireEye

October 2015

CMP403

AWS LambdaSimplifying Big Data Workloads

What to Expect from the Session

This is a deep-dive on general computing uses for

AWS Lambda.

• You will understand what makes Lambda a big deal for

big data.

• You will not learn about asynchronously triggered

workloads (see related sessions for that).

• You will see interactive, data-driven user experiences

that work with minimal ops overhead and at any scale.

Problem: Big data, little time

At FireEye, one of the ways we protect customers is by

analyzing mountains of event data to find “evil.”

Some of it we have online in indexes, some of it we have in

cold storage on Amazon S3.

We needed to be able to take advantage of the rich history

in our archived data without hurting our user experience.

Our app creates questions and finds answers

Lambda-

driven search

and analytics

EMR

analytic

output

EC2-based

proprietary

detection

Amazon EMR triggers

investigations

EC2-based

indexed

search

AWS Lambda provides context

Questions Answers

Amazon EMR

Scheduled jobs that process all

data for anomaly detection:

• K-means

• Linear regression

• Geographic time-lining

What analysis are we doing?

AWS Lambda

Free-form searching to drive ad

hoc:

• Reports

• Visualizations

• Analytical statistics (clustering,

correlation, linear regression,

etc.)

Visualize search results analytically

User-defined analytics

based on ad hoc features

of the search result set

draw attention to otherwise

uninteresting facets of the

data.

How big is our Big?

For an average customer:

Average security event size is about 3k bytes at 20k

events/sec ~= 60 MB/sec, which is about 5 TB/day.

One week = 35 TB, 12 billion events.

How long does this take?

A single process downloads, decompresses, greps, and

processes at about 35k events/sec (105 MB/sec).

To process a week of data:

Processes Time Scale

1 ~4 days

10 ~6 hours

100 ~1 hour

1000 ~5 minutes

10000 seconds0

50,000100,000150,000200,000250,000300,000350,000400,000

1 10 100 1000

Lambda FTW

What if you could spin up 10k

processes in 100 ms?

Standard map-reduce pattern

without the startup time or hassle

of map-reduce frameworks.

Write your simple worker code,

and let cascading Lambda

functions handle the heavy lifting.

Lambda cascade

AWS Big Data blog: “Building Scalable and Responsive Big Data Interfaces with AWS Lambda”

Code components

Basic web app

Handles UI request,

invokes cascade

functions, streams

results.

Cascade function

Invokes workers,

aggregates and

returns results. Can

be made recursive.

Worker function

Performs atomic

work, returns

results to invoker.

Basic web app

var listStream = new S3KeyListStream(searchParams);

var lambdaStream = new LambdaStream(maxWorkers);

listStream

.pipe(lambdaStream, { end: false })

.pipe(serverSentStream)

.pipe(httpResponse);

Basic web app key points

• Batched async execution within an async pipeline is very

unintuitive.

• Trick is to use end:false to manually call end in pipeline

code when all work is done.

• Pipeline will naturally queue up batches to stay under

configured Lambda provisioning limits.

Lambda cascade function

// Chop our given list of keys up into batches

var batches = [];

var batch = [];

for (var i = 0, len = allKeys.length; i < len; i++){

batch.push(allKeys[i]);

if (batch.length >= batchSize){

batches.push(batch.slice());

batch = [];

}

}

Lambda cascade function (continued)

// Invoke each batch in parallel, returning aggregated result when all are finished.

async.map(batches, invoke,

function (err, results) {

if (err) {

context.fail('async.map error: ' + err.toString());

return;

}

context.succeed(results);

Lambda cascade function key points

• Nature of the data and workload will dictate the correct

batch sizes to give a cascade function. Need to avoid

running out of memory to aggregate results.

• 100:1 seems to work well, good balance between low

cascade overhead and manageable intermediate result

size.

Worker function

var lineSplitter = new eventstream.split();

lineSplitter.on(‘data’, process).on(‘end’, cb);

// Create our pipeline

s3.getObject({

Bucket: srcBucket,

Key: srcKey

})

.createReadStream()

.pipe(zlib.createGunzip())

.pipe(lineSplitter);

Worker function key points

• Use the full 1.5 GB of memory.

• Download Amazon S3 keys concurrently.

• 5 seems to be the magic number for files in the 2-3 MB

range.

• Use a faster decompression algorithm like LZ4 high-

compression, which is up to 32x faster than zlib.

• Make sure warnings and failures percolate up with

results.

Non–Amazon S3–sourced workloads

Lambda can source from anything:

Amazon DynamoDB

Amazon RDS

Amazon Kinesis

Amazon EC2 endpoints

The Internet

Example Twitter App

How do my followers feel about _____

1. Enter in a keyword to the UI.

2. A Lambda worker executes for each follower.

3. Sentiment is reviewed (positive/negative/neutral).

4. Results are aggregated.

Streaming Results

Progressive results

Thirty seconds is an eternity in UX time.

Go beyond a progress bar, return streaming, progressive

results.

Show something meaningful in 3-5 seconds, final result in

30.

Graphically represent the updating data.

Mechanical sympathy

Visualizing the result stream as it matures communicates

the magnitude of the work being performed and shows

value.

Lambda Use Cases

Lambda is the future (and past)

It demonstrates the essence of AWS: capability through

simplicity.

These things are no longer needed:

• Servers

• Operating systems

• Networking

Dev effort focuses only on core competencies, not

infrastructure.

Dev advantages

• If the code works once, it works

at any scale.

• Unit and integration testing are

easy (no cluster setup required).

• Any failures are due to faulty

code or bad input, which are

caught by good unit tests.

Beyond containers

• No patching, all upgrades are core

competency updates

• No instance monitoring, only app

monitoring

• Goes beyond containers, devs

have ultra-consistent environment

Remember mainframes?

Mainframes offer attractive operating model,

unattractive graphical capabilities.

PCs take over by bringing the compute to

the people for a rich, graphical experience.

Ubiquitous mobile broadband centralizes the

compute again by allowing best of both

worlds.

1970’s

1990’s

2010’s

Related Sessions

ARC308 - The Serverless Company Using AWS Lambda:

Streamlining Architecture with AWS

CMP301 - AWS Lambda: Event-Driven Code in the Cloud

Remember to complete

your evaluations!

Thank you!

(CMP403) AWS Lambda: Simplifying Big Data Workloads

Technology

Transcript of (CMP403) AWS Lambda: Simplifying Big Data Workloads