Berlin AWS meetup: here.com on AWS

40
berlin aws meetup: here.com on aws Implementation timeline, pitfalls and lessons learned Cristian Măgherușan-Stanciu <[email protected]> @magheru_san June 16, 2015

Transcript of Berlin AWS meetup: here.com on AWS

berlin aws meetup: here.com on awsImplementation timeline, pitfalls and lessons learned

Cristian Măgherușan-Stanciu<[email protected]>@magheru_san

June 16, 2015

agenda

Introduction

Implementation timeline

Next steps

Conclusions

Pizza :-)

1

introduction

about here

HERE is a leading location company

∙ Over 6000 employees in 55 countries∙ We own our map data

∙ state-of-the-art offline capabilities∙ global map coverage∙ weekly map updates∙ location-based services around it

∙ Market leader in automotive∙ in 4 out of 5 cars sold in the Western hemisphere

∙ Powering a myriad of partners, including

∙ Free apps on major mobile platforms

3

here in berlin

About us

∙ 820 internal employees and hiring :-)∙ 56 countries and 25 languages, 36% germans∙ 20% female∙ Average age 36 years∙ Interesting mix of start-up/enterprise culture∙ AWS-first policy for all our services

4

about here.com

The main consumer website of HERE

∙ Designed to seamlessly integrate with, and complement thenative mobile apps

∙ Reference implementation for many capabilities∙ Re-written from scratch since Fall 2013∙ Modern technology stack

5

about here.com

About the new version of here.com

∙ Re-launched only 6 months ago∙ Monthly page loads in the tens of millions∙ Traffic growing fast, already 3x since the re-launch∙ Hosted on AWS

6

here.com screenshots

7

implementation timeline

oct 2013 - initial architecture

First AWS setup

9

oct 2013 - first commits

Simple way to run the application

∙ Relatively easy to bootstrap∙ reused as much as we could from our hello_world skeleton service

∙ Application running on EC2 instances∙ Single AWS region∙ Users would connect directly to the ELB∙ All AWS infrastructure defined using CloudFormation

∙ stacks based on a reused hello_world template

∙ Primitive continuous delivery pipeline: Jenkins, Puppet, cron

10

dec 2013 - first internal release

’Backstage’ launch

∙ Shared with all HERE employees∙ A few hundreds of daily users∙ Started to get valuable feedback, mostly about the UX∙ Production configuration snapshot-ed manually before thelaunch

∙ No major architecture changes

11

jan 2014 - infrastructure improvements

Deployment orchestration changes

∙ Fully controlled by Jenkins via ec2_collective/SQS∙ Production deployments triggered automatically after everycommit

∙ no longer relying on cron∙ we can easily see deployment failures in the Job output∙ automated configuration snapshot-ing for Production

12

jan 2014 - infrastructure improvements

Relatively large number of Dev environments

∙ Created and maintained manually via CloudFormation∙ Configurations started drifting∙ It became tedious to update them in case of a needed masschange

∙ The clouds tool was written during a ’Research Week’∙ makes it so much easier to manage diverging stacks∙ released on Github as GPL2∙ can be gem install-ed

13

aug 2014 - alpha release

Released to hundreds of selected preview users

∙ Capacity planning&load tests, all looked great∙ Architecture remained almost the same

∙ added ElastiCache(memcached) as shared temporary storage∙ worked around SQS limitations: split queues by environment

∙ Slow loading performance reports, triggered some actions∙ started using NewRelic for Real User Monitoring(RUM)∙ implemented WebPageTest(WPT) automation in our CI

14

oct 2014 - beta release

Opt-in release from the legacy website

∙ Beta invites implemented using SES∙ Thousands of users world-wide∙ More capacity planning∙ Added CloudFront CDN for static files

15

oct 2014 - beta release

Beta architecture

16

oct 2014 - beta release

CloudFront setup details

∙ S3 bucket as origin∙ Dev/prod S3 bucket sync, IAM cross-account bucket policy∙ Noticed worse performance in NewRelic, WTH?∙ CloudFront limitation: won’t compress content

∙ explicit gzip compression needed, scripted at build time∙ upload already compressed files to S3∙ only compress the files when it helps (>1KB size reduction)

∙ Required HTTP headers, set as S3 object metadata∙ MIME type∙ gzip encoding∙ caching duration (we use half a year by default)

17

oct 2014 - beta release

File path conventions

∙ File paths depend on the file content:/static_content/path/to/file.css_d34db33f

∙ ’d34db33f’ is the result ofsha256(plain_file_content)[0..7]

∙ path translation table∙ all files under one directory for easy filtering later∙ intentionally decoupled from what’s deployed on EC2∙ idempotent content updates

18

oct 2014 - beta release

Still single region

∙ Limitation of our custom continuous deployment automationwas fixed, but it was too late

∙ Initial test results∙ CloudFront static file caching would hide this well enough∙ NewRelic and WebPageTest results deemed acceptable

19

dec 2014 - launch

Launch architecture

20

dec 2014 - launch

All traffic from the legacy environment (HTTP redirect)

∙ Millions of users world-wide, more capacity planning needed∙ Extended CloudFront, now also used for dynamic content∙ Decided to implement dynamic-CloudFront before multi-region,more benefits for little extra costs

∙ OCSP Stapling - no more extra blocking call to your CA: 80-400mssaving

∙ early TCP termination: 50-500ms saving∙ long-living connections between CloudFront and ELB

∙ HTTP redirects to HTTPS: 50-500ms saving for plain HTTP users∙ Browsers: one less domain to resolve, less TCP connections tomaintain, less CPU usage

21

jan 2015 - multi-region

Desired setup

22

jan 2015 - multi-region

First expansion attempt

∙ Latency-based routing with Route53, really straightforward∙ No other architecture changes were needed∙ Deployed to Singapore and Frankfurt in addition to existingVirginia

∙ Soon realized that Frankfurt was broken a bit ’special’ :-)∙ different way to define ElastiCache SGs (VPC-only region)∙ ElastiCache was not yet supported by CloudFormation there

23

jan 2015 - multi-region

With Singapore added, we noticed almost no performanceimprovement - WTH?

∙ Investigation immediately revealed NewRelic setup errors∙ incorrectly included in HTML∙ we were missing metrics from the slowest clients! :-(

∙ Fixed the NewRelic configuration∙ noticed how slow we really were in most geographies

24

jan 2015 multi-region

Investigating the lack of performance improvements

∙ Backend performance issues in Singapore∙ Only shifting network latency, not overcoming it

∙ Root cause: some APIs we depend on when rendering HTMLwere deployed in remote regions

25

jan 2015 - multi-region

Speeding up Singapore

∙ Avoid blocking API calls from the landing page∙ replaced one with a local GeoIP database, removed another∙ backend performance improved 50x

26

apr-may 2015 - performance issues

Loading performance was lagging behind our competitors

∙ They improved significantly∙ We got many new users from emerging markets∙ Visible in user feedback and bounce rates∙ Had to take some actions

27

apr-may 2015 - magellan

Our current ways of working, Magellan, set up in Jan 2015

∙ Self-organizing, temporary, cross-functional teams mandated bymanagement to increase a metric

∙ Bottom-up innovation∙ everyone chooses their team∙ design, implementation and release is team’s responsibility∙ management reviews the progress and provides some advice

∙ First iteration (Jan - Apr): post-launch usability improvements∙ Second iteration: tech debt and performance fixes

28

apr-may 2015 - magellan

Improving our performance

∙ Goal of one of the teams∙ bring load performance back on par with the competition

∙ Actions that were taken∙ finally launched Frankfurt(fixed in the meantime)∙ also Sydney and California∙ refactored our CloudFormation stacks (now all identical)∙ instances were right-sized∙ devs heavily optimized the application for faster loading

∙ DevOps at its best

29

apr-may 2015 - magellan

Results

∙ Visual progress now comparable to Google maps ourcompetition :-)

∙ Global loading time average reduced by about a second

∙ Lots of improvement ideas were added to the backlog∙ More fixes to be implemented soon 30

next steps

next steps

More performance improvements

∙ Fix some remaining bugs∙ we’d finish loading 2-3 seconds earlier∙ but minimal visual progress changes

∙ SPDY HTTP2 on CloudFront∙ AWS has to implement it∙ eventual application changes∙ reverse proxy through CloudFront some of our client APIs

32

conclusions

conclusions

In no particular order

∙ Start small∙ Iterate continuously∙ Be data-driven in decision making (A/B, user feedback, RUM,WPT)

∙ Not all AWS regions are (born) equal∙ Expect and embrace AWS limitations∙ Workarounds sometimes lead to bigger improvements (cachebusting, clouds)

∙ CloudFront is excellent at HTTPS website acceleration, use it!∙ Automate anything that bothers you∙ DevOps FTW!

34

Questions

35

references and credits

Resources

∙ Clouds on GitHub https://github.com/cristim/clouds∙ Any used logos and images are © of their respective authors

36

Thank You!

37

pizza :-)

pizza!

39