Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov

32
Not Your Father’s WebApp: The Cloud-Native Architecture of images.nasa.gov Chris Shenton CTO at V! Studios NASA WESTPrime [email protected]

Transcript of Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov

Not Your Father’s WebApp:The Cloud-Native Architecture of

images.nasa.govChris Shenton

CTO at V! Studios

NASA WESTPrime

[email protected]

Presentation Overview● Evolution of webapps: simple to cloud

○ Problems with typical webapps: fault-intolerant, unscalable

○ Plan for failure, then plan to scale

○ Scalability of SQL vs NoSQL databases like DynamoDB

○ Cloud-native application design patterns

● images.nasa.gov architecture

○ Front-end decoupled from API

○ Dataflow of asset from upload to publishing

○ Fault-tolerant cloud network architecture

● DevOps

○ Infrastructure as Code

○ CI/CD

NIEP’s Problem: Users Can’t Find Content● In surveys, the public says “great images” when they think of NASA

● 60 different collections across Agency

● Uneven content quality and user interface

● No API for reuse and integration of content across apps

● Must be mobile friendly

● Shutterstock.com functionality -- too ambitious?

● Video and audio too

● We believed this functionality was possible for NASA

○ Cloud services for compute, storage, search

○ Modern, responsive web front-end

○ API for front-end and reuse by other applications

Your Father’s WebApps: they’ve got problems● Resilience to failure

● Scalability

Your Father’s WebApps: they’ve got problems [1a]

server

app

DB

Your Father’s WebApps: they’ve got problems [1b]

server

app

DB

#1: Single Point of Failure (SPoF): if the server dies, everything’s toast.

Your Father’s WebApps: they’ve got problems [2a]

server

app

DB

server

app

server

DB

#1: Single Point of Failure (SPoF): if the server dies, everything’s toast.

Your Father’s WebApps: they’ve got problems [2b]

server

app

DB

server

app

server

DB

#1: Single Point of Failure (SPoF): if the server dies, everything’s toast.

#2: Better performance (maybe), but now two SPoFs.

Your Father’s WebApps: they’ve got problems [3a]

server

app

DB

server

app

server

DB

serverapp

server

DB

serverapp

server

DB

load balancer

#1: Single Point of Failure (SPoF): if the server dies, everything’s toast.

#2: Better performance (maybe), but now two SPoFs.

Your Father’s WebApps: they’ve got problems [3b]

server

app

DB

server

app

server

DB

serverapp

server

DB

serverapp

server

DB

load balancer

#1: Single Point of Failure (SPoF): if the server dies, everything’s toast.

#2: Better performance (maybe), but now two SPoFs.

#3: Good, we’ve eliminated the SPoFs, but database synchronization and failover is difficult. It’s still not scalable.

Cloud Architecture: Plan for Outage, then Plan to Scale [1]● Use Elastic Load Balancers

○ redundant

○ fault-tolerant

○ globally distributed

● Use Auto-Scaling Servers (EC2 instances)

○ scale out under load

○ scale in when quiescent to save money

○ pay only for what you eat

● Use Managed Relational Database Service (RDS)

○ automatically performs synchronization

○ automatically performs failover

○ PostgreSQL, MySQL, MariaDB, MS SQL, Oracle, AWS Aurora

Cloud Architecture: Plan for Outage, then Plan to Scale [2]

EC2app

RDS

DB DB

Elastic Load Balancer

#1: Minimal cost with single EC2 instance. Fault-tolerant, automatically syncing database with fail-over.

Cloud Architecture: Plan for Outage, then Plan to Scale [3]

EC2app

RDS

DB DB

Elastic Load Balancer

#2: Auto-scale based on load triggers to handle increased load. Scales down when load subsides to contain cost. You may have load balancers in your datacenter, but you probably can’t add hundreds of servers effortlessly.

EC2app

EC2app

EC2app

Cost for 8 hours on 1 server is same as for 1 hour on 8 servers; for compute-intensive tasks, this gets you home sooner. And you still pay only for what you eat.

SQL vs NoSQL Cloud-Scale Databases, e.g., DynamoDB● SQL databases require forklift upgrade

○ when storage capacity reached

○ when I/O capacity reached

● NoSQL databases designed for “web-scale”

○ schemaless

○ expect faults, work around them

○ replicate, add partitions/shards as needed

○ do not support SQL features like JOIN

○ require good app design to leverage effectively

● AWS DynamoDB is a Cloud Scale NoSQL DB

○ < 10 millisecond latency at any scale

○ unlimited storage

○ partitions grow as data grows

○ hash key and optional sort key

○ other attributes are schemaless, store anything

○ throughput limited by a knob or API call

DynamoDB and Partitions

Hash Keyid

Sort Keyyear

Other Data(schemaless)

chris 1994 job=GSFC,state=MD

chris 2001 job=Koansys,title=Founder

chris 2012 job=VStudios,title=CTO,beer=SierraNevada

charles 2013 job=VStudios,title=FullStackEng

moe 1999 job=VStudios,title=Founder

victor 2012 job=VStudios,title=StreamEng

tim 2013 job=VStudios,title=COO,state=VA

earl 2012 job=VStudios,title=CloudEng

Cloud-Native Application Architectures● Servers are like cattle, not pets

● Servers are ephemeral and stateless

● Scale out processes

● Apps persist to DB or object store, not server

● Use queues and workers to process requests

● See Twelve-Factor App (https://12factor.net)

Job Queue

newjob

Auto-scaling workers

worker1

workerN

worker2

job99

...

job2

job1

DBObject

Storage

Decouple work from workers with queues to prevent overload or loss of jobs

images.nasa.gov: Built on Cloud-Native ServicesWe use AWS-provided services whenever possible instead of building our own. This

allows us to leverage tested, supported, backed-up, scalable services so we can

concentrate on building our own application.

● EC2, ELB: autoscaling compute for API, Image Resizer, Pipeline processes

● S3: object storage for incoming media, metadata, published assets

● ElasticTranscoder: video/audio transcoding for smaller versions including mobile

● CloudSearch: manage search services allow search by free text or fields

● DynamoDB: NoSQL database tracks incoming jobs, published assets, users

● SQS: message queues decouple incoming jobs from pipeline processes

● SNS: notification service indicates when new content uploaded, triggers pipeline

images.nasa.gov Front-End Architecture [1]● Old school webapps send HTTP requests to servers and get back HTML

● images.nasa.gov separates front-end webapp from back-end API

● Front-end is written in AngularJS

○ a webapp running in the browser

○ not just a web “page”

● Back-end written in Python (Pyramid), returns JSON data to the front-end

● Front-end then renders it per its app

● More interactive, the evolution of “AJAX”

images.nasa.gov Front-End Architecture [2]1. Browser gets FE app from S3

a. AngularJS code, HTML, CSS

b. renders home page with search box

2. Queries API (ASG)

a. gets results as JSON and renders as page

b. gets images from Assets S3 and renders

3. Connects to API to get details

a. gets details as JSON and renders as page

b. gets image from Assets S3 and renders

S3images.nasa.govwebapp code<html>

<body data-ng-app="availFeApp">

...</html>

GET /

{“collection”: {items”: [{“href”:

“https://images-assets.nasa.gov/image/…”}

,

...]}}

GET /search?q=cloud

S3images-assets.nasa.govmedia, metadata

APIimages-api.nasa.govmin=2

APIimages-api.nasa.govmin=2

GET /image/cloud-free-iceland.jpg

GET /image/…, GET /image/…, ...

GET /metadata/cloud-free-iceland/...

GET /asset/cloud-free-iceland/...

query

detail

results

Ingest Media Data Flow [1]: Browser Experience● User selects media (image, audio, video with caption)

● Browser sends media to the private S3 bucket

● Dashboard shows progress, including when more searchable metadata required

AWS SQS Queues

Ingest Media Data Flow [2]: Upload

AWS CloudSearch

API ASG

API

Uploaded

ErrorProcesses write failures to queue for cleanup

Transcoded

Published

Pipeline ASG

Uploaded

Transcoded

Published

privateS3

images-assetsS3

JobStateDBDynamoDB

All process write to state here, it drives the dashboard and answers queries about Incomplete Jobs

Image Resizer ASG

Image Resizer

ErrorTrash Index, S3, mark bad in DB

AWS ElasticTranscodervideo, audio

User uploads asset media and optional metadata

AssetDBDynamoDB

When the asset is published and indexed, an entry is recorded here.

● POST to API with optional metadata, captions

● API stores metadata, captions to Private S3

● API returns signed upload URL to browser

● Browser PUTs media directly to Private S3

● S3 sends SNS to SQS Uploaded queue,

triggering the start of the pipeline

AWS SQS Queues

Ingest Media Data Flow [3]: Transcode/Resize

AWS CloudSearch

API ASG

API

Uploaded

ErrorProcesses write failures to queue for cleanup

Transcoded

Published

Pipeline ASG

Uploaded

Transcoded

Published

privateS3

images-assetsS3

JobStateDBDynamoDB

All process write to state here, it drives the dashboard and answers queries about Incomplete Jobs

Image Resizer ASG

Image Resizer

ErrorTrash Index, S3, mark bad in DB

AWS ElasticTranscodervideo, audio

AssetDBDynamoDB

When the asset is published and indexed, an entry is recorded here.

● Uploaded Worker gets event from queue

● Transcode/Resize

○ image invokes ImageResizer ASG: JPG

○ audio/video invokes ElasticTranscoder: MP3, MP4

○ multiple smaller formats for download, mobile,

preview, thumbnails

○ artifacts stored in Private S3 with original

● Waits for completion

● Creates event in Transcoded queue

AWS SQS Queues

Ingest Media Data Flow [4]: Publish

AWS CloudSearch

API ASG

API

Uploaded

ErrorProcesses write failures to queue for cleanup

Transcoded

Published

Pipeline ASG

Uploaded

Transcoded

Published

privateS3

images-assetsS3

JobStateDBDynamoDB

All process write to state here, it drives the dashboard and answers queries about Incomplete Jobs

Image Resizer ASG

Image Resizer

ErrorTrash Index, S3, mark bad in DB

AWS ElasticTranscodervideo, audio

AssetDBDynamoDB

When the asset is published and indexed, an entry is recorded here.

● Transcoded Worker gets event from queue

● If we have valid metadata (and captions)

○ move media and artifacts to images-assets S3

○ at this point, it’s publicly accessible but not

yet findable by search

○ create event in Published queue

● Else, mark job as Incomplete

AWS SQS Queues

Ingest Media Data Flow [5]: Index

AWS CloudSearch

API ASG

API

Uploaded

ErrorProcesses write failures to queue for cleanup

Transcoded

Published

Pipeline ASG

Uploaded

Transcoded

Published

privateS3

images-assetsS3

JobStateDBDynamoDB

All process write to state here, it drives the dashboard and answers queries about Incomplete Jobs

Image Resizer ASG

Image Resizer

ErrorTrash Index, S3, mark bad in DB

AWS ElasticTranscodervideo, audio

AssetDBDynamoDB

When the asset is published and indexed, an entry is recorded here.

● Published Worker gets event from queue

● Sends metadata to CloudSearch for indexing

○ once indexed, it’s findable by search

● Marks job done in JobDB

● Creates an entry in the AssetDB

AWS SQS Queues

Ingest Media Data Flow [6]: Errors

AWS CloudSearch

API ASG

API

Uploaded

ErrorProcesses write failures to queue for cleanup

Transcoded

Published

Pipeline ASG

Uploaded

Transcoded

Published

privateS3

images-assetsS3

JobStateDBDynamoDB

All process write to state here, it drives the dashboard and answers queries about Incomplete Jobs

Image Resizer ASG

Image Resizer

ErrorTrash Index, S3, mark bad in DB

AWS ElasticTranscodervideo, audio

AssetDBDynamoDB

When the asset is published and indexed, an entry is recorded here.

● Any jobs that cause errors create events in

the Error queue

● Error Worker pulls events from queue

○ removes index from CloudSearch

○ removes media, artifacts from images-assets

and private S3

○ marks job as errored in JobStateDB

Regions and Availability Zones and Subnets, oh my!● AWS has 16 “Regions”

● There are 42 “Availability Zones” across them

● Each AZ is a physically separate datacenter

● AZs in a region have high-speed connections

to the others in the same region

● Multi-AZ deployment guards against

catastrophic AZ outage

● images.nasa.gov is in us-east-1 across 2 AZs

● Virtual Private Clouds provide isolation

● VPCs can be subnetted

AWS Regions and the number of AZs in each

Services Deployed Across AZs, VPC, SubnetsAWS Region: us-east-1

AWS Globally Managed Services

AZ: us-east-1b AZ: us-east-1c

web1subnet

web2subnet

app1subnet

no routing from public

app2subnet

no routing from public

API ELB

Pipeline ELB

ImageResizer ELB

ImageResizerImageResizer

ImageResizerImageResizer

PipelinePipeline

PipelinePipeline

APIAPI

APIAPI

images-assetsS3

privateS3

images (FE)S3

JobDBDynamoDB

AssetDBDynamoDB

UserDBDynamoDB

UploadedSQS

API ASG

ImageResizer ASG

Pipeline ASG

TranscodedSQS

PublishedSQS

ErrorSQS

CloudSearch

Public VPC

DevOps● Infrastructure as Code

● Continuous Integration / Continuous Delivery

Infrastructure as Code● We do not build or deploy networks, servers or services by hand

● All infrastructure is defined in code

○ resident with our application software

○ WESTPrime Stash code repository

● Troposphere: Python abstraction for AWS CloudFormation

● Generates 3500 lines of JSON CloudFormation

● EC2 machines use hardened AMIs provided by WESTPrime

● Nearly identical Dev, Stage, Prod environments

● Fast updates to existing infrastructure

● Reliable, repeatable, robust

Automation: Continuous Integration/Continuous Delivery● We do not deploy code by hand

● WESTPrime’s Bamboo CI/CD system:

○ watches commits to Stash code repo

○ builds code

○ runs unit and integration tests

○ creates deployment artifacts: tarballs sent to S3

○ restarts EC2 instances to run new code versions

● Immutable deployment:

○ Ansible provisions software, configs

○ instance lifetime: hours to days, not months

● Benefits:

○ faster dev cycle times

○ lower cost

○ repeatable

○ reliable

Benefits of Cloud-Native Architecture, Design, Practices● Managed cloud services

● Robust multi-AZ infrastructure

● Autoscale minimizes cost, handles surge

● Separate web front-end from API backend

● Infrastructure as Code

● CI/CD builds, test, deploy automation

Commercial Cloud Service Benefits for NASA Developers● Build things that are impossible in a datacenter or hosting provider.

● Accommodate failure in components without breaking your application.

● Define infrastructure in code to make changes reliable and easy.

● Don’t need to over-provision for worst case.

● Use managed services to save time.

● Experiment with little investment.

finQuestions?

[email protected]