Today has been a complete debacle. I do not quite understand how a website can crash almost...

2-642

Surviving Success: Architecting Web Sites and Services for Rapid Growth

Mark Simms (@mabsimms)Principal Group Program ManagerSurviving Success: Architecting Web Sites and Services for Rapid Growth

Designing sites and services that can survive rapid growth and demand spikes requires careful design and architecture choicesIn this session we’ll explore preparing a web site for growth, then handling extreme success when it suddenly arrives

Microsoft Azure Customer Advisory Team (AzureCAT) Works with internal and external customers to build out some of the largest applications on AzureGet our hands dirty on all aspects of delivery; design, implementation and all too often firefighting

Setting the Stage

This is meant to be an interactive discussion – if you don’t ask questions, we will!

Note: please use the mic.

This session will be an exploration of the journey to scale for a mostly fake web site and service

• This will not be a discussion of features• Focus on the journey, design and

architecture choices and their impact on scalability and availability

• Challenge: how to leave the door open for success, without “overbuilding” in advance

Warning: there will be anonymized or mashed up customer stories, code mockery and a high incidence of sarcasm.

Agenda and Expectations

Today has been a complete debacle. I

do not quite understand how a website can crash

almost immediately upon receiving traffic.

30k RPS

1 IaaS VM

File-based SQL CE

AzureCAT: Framing a Customer Discussion• Scalability. The ability to add

additional capacity to the service to handle increases in load and demand, together with efficient and effective use of resources allocated.

• Availability.

• Manageability.

• Feasibility.

Scalability

AvailabilityManageability

Feasibility

AzureCAT: Framing a Customer Discussion• Scalability.

• Availability. The ability of the solution to continue to provide value in the face of transient and enduring faults in the application and underlying service dependencies .

• Manageability.

• Feasibility.

Scalability


Feasibility


• Availability.

• Manageability. The ability to understand health and performance of the live system and manage site operations

• Feasibility.

Scalability


Feasibility


• Availability.

• Manageability.

• Feasibility. The ability to deliver and maintain the system, on time (ish) and under budget (ish).

Scalability


Feasibility

Adding more Capacity

• Identifying and breaking contention and choke points

• How to add additional capacity to a solution?

• There are subtle constraints to consider...

Using Capacity more Efficiently

• Traditional performance tuning

• Avoiding common anti-patterns traditionally hidden by capitalized infrastructure

• Identify unbalanced workloads (read vs. write)

Scalability == Capacity * Density

• What type of application or service are you building?

• What proportion of your budget is allocated for non-functional service fundamentals?• If you are designing a multi-release platform, and this number is zero…

Design Considerations

• Balance – You Ain’t Gonna Need It (YAGNI) with Oh, You Did Need That (OYDNT?)

• Balance – testing before production vs. testing in production• Hint, you’re always testing in production

Design Considerations

• Web site and Mobile Application for tracking sports games• Burst mode during “events”.

Application experiences high and unpredictable load during scheduled events.

• Viral growth potential. Have to be ready for rapid adoption (triggers – social inflection point, successful ad?).

Design Scenario

Workload Decomposition

Key to design choices is understanding the inherent workloads in the system

Pay careful attention to state and consistency

Workload Characterization

List of events • Read workload• (mostly) scheduled updates• minimal consistency requirements

(minutes)

Status of active event

• Read workload, continuous concurrent updates during events.

• ~ 3-5 second read consistency

Status of active event (mobile)

• Read workload, continuous concurrent updates during events.

• ~ 3-5 second read consistency. • Push notification on “interesting”

update.

• “Classic” 3-tier enterprise relational design

• Three VM configuration:• IIS VM• App tier VM• SQL Server VM

• Software stack:• ASP.NET, WebApi, Entity Framework• .NET 4.5

Stage 1 – It Works Great on my Dev Box

IIS IIS

SQL Server

• Capacity. Challenging to add additional capacity to front-end (need to manually configure VMs, deploy config+software, integrate with load balancer).

• Density. Application tier adds latency, VMs tuned by default to protect VM – not protect application. Unbalanced workload (read/write) use single store (SQL).

Stage 1 – It Works Great on my Dev Box (Scale)

• Failure points. Everything is a single point of failure

Stage 1 – It Works Great on my Dev Box (Availability)

• Operational insight and visibility. See next slide.

Stage 1 – It Works Great on my Dev Box (Manageability)

Operational Visibility

This slide left intentionally blank, as this is probably your operational monitoring experience

• Limited resources (time, people and money)

• How to prioritize?• Frame the door – enable deployment of additional resources and capacity

• Turn the lights on – operational visibility• Use data to drive investment – system response under load

Mapping the First Stage of the Journey

• Need a higher semantic level for components – make it easy to trade $$ for capacity• Money is always faster to spend than engineering time.

• Migrate / rehost application components to their PaaS equivalents• IIS VMs (front end / mid tier) -> Web Apps (formerly Web

Sites)• SQL Server -> Azure SQL DB (we’ll get to sizing based on

data in a bit)

Frame the Door – Enable “Adding”

• Psychic debugging is not a recipe for success

• Rent your way to victory (through insight and data)

• Evaluate options against your workload, “test for ergonomics”

• You may need more logging/diag later – prove the need with data

Turn the Lights On – Data Driven Engineering

• Every system has a breaking strain – find it or your users will

• Use to-destruction load testing to determine the stress curve of the system • Do you need to do performance optimization, or can you simply throw more resources at the problem?

• If you need to optimize, can you target specific improvements?

Evaluate Current State – System Response

Stage 2 – Baseline Established

Use insight against live system to

understand load profile

Pay attention to the “this looks weird”. Those are hindsight moments waiting to

happen.Once you have

telemetry, you have to look at it.

Seriously.

• Limited resources (time, people and money)

• How to prioritize?• Identify availability points and mitigations• Identify scale bottlenecks and contention points

Mapping the Second Stage of the Journey

What are the metered resources?

Azure Web Apps

Azure SQL DB

1 Compute instances for front-end / back-end web sites

• 10 dedicated instances (call support for more)

2 Concurrent active sockets (e.g. WebSocket)

• 350 / dedicated instance (can be increased)

3 Requests/sec per VM • Metered by efficiency of implementation

4 Connections / database

• Capped at 180 (default ASP.NET pool size is 100)

5 Database throughput • Multiple metered resources, efficiency of implementation

1

2

3

4

5

http://azure.microsoft.com/en-us/documentation/articles/azure-subscription-service-limits/

What are the failure / availability points?

Azure Web Apps

Azure SQL DB

1 Internal logic errors or exceptions in application code

• By and large, IIS will save you from intermittent errors (or at least limit them to a single request)

2 Single SQL DB instance • Any transient or enduring errors here will have a drastic effect on the web experience

1

2

• Can we reduce/shift critical work for differential workloads (read/write/consistency) away from single-point resources?

• Primarily read workload – very suitable for caching• Note; this is something you need to invest

engineering time in advance. • Adding caching during a live event is not something I

want anybody else to experience

Targeting Efficiency

?

Caching is not a magical solution..

Unless you have a primary read workload against a slow changing state store.. then it is pretty magical.

• Chatty I/O• Extraneous fetching• Improper instantiation• No caching• Synchronous I/O• Etc..

Common Density Barriers

Read more at github.com/mspnp

For more on scaling and density approaches, check out:

Lessons From Scale: Building Applications for AzureMark Russinovitch, April 30th @ 11:30am in Hall 1B

If you have high fidelity load testing,

can use your telemetry data to find workflows to optimize.

If your load profiles are not based around

extrapolating real customer load – you

are telling yourself lies

Don’t try and optimize everything. Optimize the primary path(s), observe, measure,

react.

• Blueprint for success in a single data center

• How to get ready for the world? How to go to multiple data centers?

Mapping the Third Stage of the Journey

• Only three numbers: 0, 1 and N. How do we go from 1 -> N.

• Resources, state and affinity:• Front end / mid tier web resources. Low state, need code replication approach.

• Back end database. Highly stateful, need eventually consistent data replication

• Routing. Need performance/locality based DNS routing.

Moving beyond one data center

Moving to N - Baseline

Start from your baseline 1 data center deployment.

Ensure that you can build a production environment from automation.

DC1

Azure Web Apps

PROD

PROD

STAG

STAG

P Redis

Moving to N - ALM

Enable automated git publishing (or other ALM approach) to your staging environment.

Automated global deployment to production creates “learning moments”

DC1

Azure Web Apps

PROD

PROD

STAG

STAG

P Redis

Visual Studio Online

GIT

App Insights

Moving to N - Telemetry

Ensure that your telemetry service(s) are enabled and flowing data.

Ensure that your operations and dev staff are comfortable with working with the data.

DC1

Azure Web Apps

PROD

PROD

STAG

STAG

P Redis


GIT

App Insights

Moving to N – Global Routing

Enable Azure Traffic Manager with a single region endpoint.

Set the stage to go to N deployments.

DC1

Azure Web Apps

PROD

PROD

STAG

STAG

P Redis


GIT

Azure Traffic Manager

App Insights

Moving to N – State Replication

Use your deployment scripts to roll out additional data centers.

Connect the additional data centers to git publishing

Do NOT connect them to traffic manager (yet)

DC1

Azure Web Apps

PROD

PROD

STAG

STAG

P Redis

DC1

Azure Web Apps

PROD

PROD

STAG

STAG

S Redis

DC1

Azure Web Apps

PROD

PROD

STAG

STAG

S Redis

Azure Traffic ManagerVisual Studio Online

GIT

App Insights

Moving to N – State Replication

Enable Azure SQL DB geo replication to create readable secondaries in other data centers (note; requires Premium).

Enable the other data centers via Azure Traffic Manager.

DC1

Azure Web Apps

PROD

PROD

STAG

STAG

P Redis

DC1

Azure Web Apps

PROD

PROD

STAG

STAG

S Redis

DC1

Azure Web Apps

PROD

PROD

STAG

STAG

S Redis

Azure Traffic ManagerVisual Studio Online

GIT

App Insights

• None of these changes involved writing code

• When you need to grow and go quickly, this is a good thing.

• But it won’t work unless you have a strong foundation to build on.

• No psychic debugging!

Moving to N - Recap

• Success is exhilarating and terrifying. If you can expect wild and sudden success, lay the foundations.

• Insight is life. Psychic debugging during a crisis leads to flailing. Rent your telemetry – and look at it!

• Be ready to spend your way to victory. Design your system to allow adding more resources without rewriting (too much) code.

Takeaways

Azure Clinicpowered by Microsoft AzureCAT

1) Talk to the folks who build world class, highly scalable, high available systems on Azure today 2) Bring your ideas for your application of the future and havethem design it with you right there3) Bring your questions and your problems and get them fixed in the clinic on the spot4) Learn about Azure implementation best practices

• Follow our patterns & practices guidance on github at github.com/mspnp (contributions welcomed!)• Cloud pattern guidance - https://github.com/mspnp/azure-guidance• Common performance related anti-patterns - https://

github.com/mspnp/performance-optimization

• Visit us at the AzureCAT clinic to discuss your scenarios and architecture

Call to Action

https://github.com/mspnp/azure-guidance

https://github.com/mspnp/azure-guidance

https://github.com/mspnp/performance-optimization

https://github.com/mspnp/performance-optimization

Improve your skills by enrolling in our free cloud development courses at the Microsoft Virtual Academy.

Try Microsoft Azure for free and deploy your first cloud solution in under 5 minutes!

Easily build web and mobile apps for any platform with AzureAppService for free.

Resources

http://www.microsoft.com/click/services/Redirect2.ashx?CR_CC=200623237





Today has been a complete debacle. I do not quite understand how a website can crash almost...

Documents

Transcript of Today has been a complete debacle. I do not quite understand how a website can crash almost...