Building Big: Lessons learned from Windows Azure customers – Part Two

Building Big: Lessons learned from Windows Azure customers – Part TwoMark Simms(@mabsimms) Simon Davies(@simongdavies)Principal Program Manager Windows Azure Technical Specialist Microsoft Microsoft3-030

Session ObjectivesDesigning large-scale services requires careful design and architecture choicesThis session will explore customer deployments on Azure and illustrate the key choices, tradeoffs and learningsTwo part session:• Part 1: Building for Scale• Part 2: Building for Availability

Other Great SessionsThis session will focus on architecture and design choices for delivering highly available services.If this isn’t a compelling topic, there are many other great sessions happening right now!

Room Level Title PresenterNexus/Normandy

300 Designing awesome XAML apps in Visual Studio and Blend for Windows 8 and Windows Phone 8

Jeffrey Ferman

Trident/Thunder 300 Developing Mobile Solutions with Windows Azure Part II

Nick HarrisChris Risner

Odyssey 200 Desktop apps: WPF 4.5 and Visual Studio 2012 Pete Brown (DPE)

Magellan 200 WP8 HTML5/IE10 for Developers Rick XuJorge Peraza

Building Big – the availability challengeEverything will Fail –design for failureGet Insight – instrument everything

Agenda

Designing and Deploying Internet Scale ServicesJames Hamilton, https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

Partition the serviceSupport geo-distribution

Design for Failure• Do not trust underlying

components• Decouple components• Avoid single points of failureInstrument everything• Implement inter-service

monitoring and alerting• Instrument for production testing• Configurable logging

Part 1: Design for Scale Part 2: Design for Availability

Optimize for density

http://static.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

http://static.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf

What are the 9’s?

The Hard Reality of the 9’s

Design for FailureGiven enough time and pressure, everything failsHow will your application behave?• Gracefully handle failure modes, continue to

deliver value• Not so gracefully …Fault types:• Transient. Temporary service interruptions,

self-healing• Enduring. Require intervention.

Failure ScopeRegion

Service

Node Individual Nodes May FailConnectivity Issues (transient failures), hardware failures, configuration and code errors

Entire Services May FailService dependencies (internal and external)

Regions may become unavailableConnectivity Issues, acts of nature

Use fault-handling frameworks that recognize transient errors:

CloudFXP+P TFH

Appropriate retry and backoff policies

Node Failures

Don’t do this – why?

Sample Retry PoliciesPlatform Context Sample Target

e2e latency max“Fast First”

Retry Count

Delay Backoff

SQL Database

Synchronous (e.g. render web page)

200 ms Yes 3 50 ms Linear

Asynchronous (e.g. process queue item)

60 seconds No 4 5 s Exponential

Azure Cache

Synchronous (e.g. render web page)

100 ms Yes 3 10 ms Linear

Asynchronous (e.g. process queue item)

500 ms Yes 3 100 ms

Exponential

At some point, your request is blocking the line

Fail gracefully, and get out of the queue!

Too much retry, too much trust of downstream service

Decoupling Components

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290

50000

100000

150000

200000

250000

300000

350000

400000

450000 Web Request Response Latency

Avg Latency Response latency

Decoupling ComponentsLeverage asynchronous I/O Beware – not all apparently async calls are “purely” async

Ensure that all external service calls are boundedBound the overall call latency (including retries); beware of thread pool pressure

Beware of convoy effects on failure recoveryTrying too hard to catch up can flood newly recovered services

Service Level FailuresEntire Services will have outagesSQL Azure ,Windows Azure Storage – SLA < 100%External services may be unavailable or unreachable

Application needs to workaround theseReturn fail code to user (please try again later)Queue and try later (we’ve received your order…)

Region Level FailureRegional failure will occur

Load needs to be spread over multiple regions

Route around failures

8datacentres

Digital Watermar

ks

Mobile Integratio

n

Digimarc

Example Distribution with Traffic Manager

Slide 18

Global load does not necessarily give uniform distribution

• Hosted service(s) per data centre

• Each service is autonomous –services independently receive or pull data from source

• Azure traffic manager can direct traffic to “nearest” service

• Use probing to determine service health*

Information publishingAzure Traffic Manager

Source Data

Web Role

Worker Role

Cache Role

DBAzure Storage

Web Role

Worker Role

Cache Role

DBAzure Storage

Web Role

Worker Role

Cache Role

DBAzure Storage

Region 1 Region 2 Region 3

Service InsightDeep and detailed data needed for management, monitoring, alerting and failure diagnosis

Capture, transport, storage and analysis of this data requires careful design

Characterizing Insight•

••

••

•

•••

•••

Build and Buy (or rent)No “one size fits all” for all perspectives at scaleNear real-time monitoring & alerting, deep diagnostics, long term trending

Mix of platform components and servicesWindows Azure Diagnostics, application logging, Azure portal, 3rd party services

New RelicFree/$24/$149 pricing model(/month/server)Agent installation on server (role instance)Hooks application via Profiling API

App DynamicsFree -> $979.00 (6 agents) Agent based, hooking profiling APICross-instance correlation

OpsTeraLeverages Windows Azure Diagnostics (WAD) dataGraphing, alerts, auto-scaling

PagerDutyOn-call scheduling alerting and incident management$9\$18 per user per monthIntegration with monitoring tools e.g. NewRelic , others , HTTP API, email

• Azure platform service (agent) for collection and distribution of telemetry• Standard structured storage formats

(perf counters, events)• Code or XML driven configuration• Partially dynamic (post updated file to

blob store)

Windows Azure Diagnostics (WAD)

Windows Azure Diagnostics (WAD)

Perf Counters

Windows Events

Diag Events

IIS Log FilesFailed Logs

Crash Dumps

WAD Performance Counters Table

WAD Windows Events Logs Table

WAD Logs Table

Wad-iis-log filesWad-iis-failed log files

Wad-crash-dumps

Limitations of Default Configuration

• Azure table storage is the target for performance counter and application log data

• General maximum throughput is 1000 entities / partition / table• Performance Counters:

• Uses part of timestamp as partition key (limits number of concurrent entity writes)• Each partition key is 60 seconds wide, and are written asynchronously in bulk

• The more entities in a partition (i.e. the number of performance counter entries * the number of role instances) the slower the queries

• Impact: to maintain acceptable read performance in large scale sites may need to • Increase performance counter collection period (1 minute -> 5 minutes)• Decrease the number of log records written into the activity table (by increasing the filtering level

– WARN or ERROR, no INFO)

Understanding Azure Table Store

Managing the DelugePer-Application Server

Data Sources- IIS logs- Application logs- Performance counters

High value data- Filter- Aggregate - Publish

High volume data- Batch- Partition- Archive

High value data consumer- Generate alerts- Display dashboard- Operational intelligence

High volume data consumer- Data mining / analysis- Historical trends- Root Cause Analysis

• Add high-bandwidth (chunky) logging and telemetry channels for verbose data logging• Capture tracing via core System.Diagnostics (or log4net, NLog, etc) with:

• WARN/ERROR -> Table storage• VERBOSE/INFO -> Blob storage

• Run-time configurable logging channels to enable selective verbose logging to table (i.e. just log database information)

• Leverage the features of the core Diagnostic Monitor• Use custom directory monitoring to copy files to blob storage

Extending the Experience

Extending Diagnostics

Perf Counters

Windows Events

Diag Events

IIS Log FilesFailed Logs

Crash Dumps

WAD Performance Counters Table

WAD Windows Events Logs Table

WAD Logs Table

Wad-iis-log filesWad-iis-failed log files

Wad-crash-dumps

Verbose Perf Ctrs

Verbose Event logsVerbose Perfcounter logsVerbose Events

Handling transient failures

Logging transient failures

Logging all external API calls with timingLogging full exception

(not .ToString())

Logging and Retry with CloudFX

Demo: Multiple Logging Channelsusing NLog and WAD

Logging ConfigurationTraditional .NET log configuration (System.Diagnostics) is hard coded against System.Configuration (app.config/web.config)Anti-pattern for Azure deploymentLeverage external configuration store (e.g. Service Configuration or blob storage) for run-time dynamic configuration

Recap and ResourcesBuilding big: • The Availability Challenge• Design for Failure• Get Insight into Everything

Resources:Best Practices for the Design of Large-Scale Services on Windows Azure Cloud ServicesTODO: failsafe doc link

http://msdn.microsoft.com/en-us/library/windowsazure/jj717232.aspx




• Follow us on Twitter @WindowsAzure

• Get Started: www.windowsazure.com/build

Resources

Please submit session evals on the Build Windows 8 App or at http://aka.ms/BuildSessions

http://www.windowsazure.com/build

http://aka.ms/BuildSessions

© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Building Big: Lessons learned from Windows Azure customers – Part Two

Documents

Transcript of Building Big: Lessons learned from Windows Azure customers – Part Two