Windows Azure Conference 2014 Lessons Learned From Large Scale Migrations to Windows Azure IaaS.
Building Big: Lessons learned from Windows Azure customers – Part Two
description
Transcript of Building Big: Lessons learned from Windows Azure customers – Part Two
Building Big: Lessons learned from Windows Azure customers – Part TwoMark Simms(@mabsimms) Simon Davies(@simongdavies)Principal Program Manager Windows Azure Technical Specialist Microsoft Microsoft3-030
Session ObjectivesDesigning large-scale services requires careful design and architecture choicesThis session will explore customer deployments on Azure and illustrate the key choices, tradeoffs and learningsTwo part session:• Part 1: Building for Scale• Part 2: Building for Availability
Other Great SessionsThis session will focus on architecture and design choices for delivering highly available services.If this isn’t a compelling topic, there are many other great sessions happening right now!
Room Level Title PresenterNexus/Normandy
300 Designing awesome XAML apps in Visual Studio and Blend for Windows 8 and Windows Phone 8
Jeffrey Ferman
Trident/Thunder 300 Developing Mobile Solutions with Windows Azure Part II
Nick HarrisChris Risner
Odyssey 200 Desktop apps: WPF 4.5 and Visual Studio 2012 Pete Brown (DPE)
Magellan 200 WP8 HTML5/IE10 for Developers Rick XuJorge Peraza
Building Big – the availability challengeEverything will Fail –design for failureGet Insight – instrument everything
Agenda
Designing and Deploying Internet Scale ServicesJames Hamilton, https://www.usenix.org/events/lisa07/tech/full_papers/hamilton/hamilton.pdf
Partition the serviceSupport geo-distribution
Design for Failure• Do not trust underlying
components• Decouple components• Avoid single points of failureInstrument everything• Implement inter-service
monitoring and alerting• Instrument for production testing• Configurable logging
Part 1: Design for Scale Part 2: Design for Availability
Optimize for density
What are the 9’s?
The Hard Reality of the 9’s
Design for FailureGiven enough time and pressure, everything failsHow will your application behave?• Gracefully handle failure modes, continue to
deliver value• Not so gracefully …Fault types:• Transient. Temporary service interruptions,
self-healing• Enduring. Require intervention.
Failure ScopeRegion
Service
Node Individual Nodes May FailConnectivity Issues (transient failures), hardware failures, configuration and code errors
Entire Services May FailService dependencies (internal and external)
Regions may become unavailableConnectivity Issues, acts of nature
Use fault-handling frameworks that recognize transient errors:
CloudFXP+P TFH
Appropriate retry and backoff policies
Node Failures
Don’t do this – why?
Sample Retry PoliciesPlatform Context Sample Target
e2e latency max“Fast First”
Retry Count
Delay Backoff
SQL Database
Synchronous (e.g. render web page)
200 ms Yes 3 50 ms Linear
Asynchronous (e.g. process queue item)
60 seconds No 4 5 s Exponential
Azure Cache
Synchronous (e.g. render web page)
100 ms Yes 3 10 ms Linear
Asynchronous (e.g. process queue item)
500 ms Yes 3 100 ms
Exponential
At some point, your request is blocking the line
Fail gracefully, and get out of the queue!
Too much retry, too much trust of downstream service
Decoupling Components
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 290
50000
100000
150000
200000
250000
300000
350000
400000
450000 Web Request Response Latency
Avg Latency Response latency
Decoupling ComponentsLeverage asynchronous I/O Beware – not all apparently async calls are “purely” async
Ensure that all external service calls are boundedBound the overall call latency (including retries); beware of thread pool pressure
Beware of convoy effects on failure recoveryTrying too hard to catch up can flood newly recovered services
Service Level FailuresEntire Services will have outagesSQL Azure ,Windows Azure Storage – SLA < 100%External services may be unavailable or unreachable
Application needs to workaround theseReturn fail code to user (please try again later)Queue and try later (we’ve received your order…)
Region Level FailureRegional failure will occur
Load needs to be spread over multiple regions
Route around failures
8datacentres
Digital Watermar
ks
Mobile Integratio
n
Digimarc
Example Distribution with Traffic Manager
Slide 18
Global load does not necessarily give uniform distribution
• Hosted service(s) per data centre
• Each service is autonomous –services independently receive or pull data from source
• Azure traffic manager can direct traffic to “nearest” service
• Use probing to determine service health*
Information publishingAzure Traffic Manager
Source Data
Web Role
Worker Role
Cache Role
DBAzure Storage
Web Role
Worker Role
Cache Role
DBAzure Storage
Web Role
Worker Role
Cache Role
DBAzure Storage
Region 1 Region 2 Region 3
Service InsightDeep and detailed data needed for management, monitoring, alerting and failure diagnosis
Capture, transport, storage and analysis of this data requires careful design
Characterizing Insight•
••
••
•
•••
•••
Build and Buy (or rent)No “one size fits all” for all perspectives at scaleNear real-time monitoring & alerting, deep diagnostics, long term trending
Mix of platform components and servicesWindows Azure Diagnostics, application logging, Azure portal, 3rd party services
New RelicFree/$24/$149 pricing model(/month/server)Agent installation on server (role instance)Hooks application via Profiling API
App DynamicsFree -> $979.00 (6 agents) Agent based, hooking profiling APICross-instance correlation
OpsTeraLeverages Windows Azure Diagnostics (WAD) dataGraphing, alerts, auto-scaling
PagerDutyOn-call scheduling alerting and incident management$9\$18 per user per monthIntegration with monitoring tools e.g. NewRelic , others , HTTP API, email
• Azure platform service (agent) for collection and distribution of telemetry• Standard structured storage formats
(perf counters, events)• Code or XML driven configuration• Partially dynamic (post updated file to
blob store)
Windows Azure Diagnostics (WAD)
Windows Azure Diagnostics (WAD)
Perf Counters
Windows Events
Diag Events
IIS Log FilesFailed Logs
Crash Dumps
WAD Performance Counters Table
WAD Windows Events Logs Table
WAD Logs Table
Wad-iis-log filesWad-iis-failed log files
Wad-crash-dumps
Limitations of Default Configuration
• Azure table storage is the target for performance counter and application log data
• General maximum throughput is 1000 entities / partition / table• Performance Counters:
• Uses part of timestamp as partition key (limits number of concurrent entity writes)• Each partition key is 60 seconds wide, and are written asynchronously in bulk
• The more entities in a partition (i.e. the number of performance counter entries * the number of role instances) the slower the queries
• Impact: to maintain acceptable read performance in large scale sites may need to • Increase performance counter collection period (1 minute -> 5 minutes)• Decrease the number of log records written into the activity table (by increasing the filtering level
– WARN or ERROR, no INFO)
Understanding Azure Table Store
Managing the DelugePer-Application Server
Data Sources- IIS logs- Application logs- Performance counters
High value data- Filter- Aggregate - Publish
High volume data- Batch- Partition- Archive
High value data consumer- Generate alerts- Display dashboard- Operational intelligence
High volume data consumer- Data mining / analysis- Historical trends- Root Cause Analysis
• Add high-bandwidth (chunky) logging and telemetry channels for verbose data logging• Capture tracing via core System.Diagnostics (or log4net, NLog, etc) with:
• WARN/ERROR -> Table storage• VERBOSE/INFO -> Blob storage
• Run-time configurable logging channels to enable selective verbose logging to table (i.e. just log database information)
• Leverage the features of the core Diagnostic Monitor• Use custom directory monitoring to copy files to blob storage
Extending the Experience
Extending Diagnostics
Perf Counters
Windows Events
Diag Events
IIS Log FilesFailed Logs
Crash Dumps
WAD Performance Counters Table
WAD Windows Events Logs Table
WAD Logs Table
Wad-iis-log filesWad-iis-failed log files
Wad-crash-dumps
Verbose Perf Ctrs
Verbose Event logsVerbose Perfcounter logsVerbose Events
Handling transient failures
Logging transient failures
Logging all external API calls with timingLogging full exception
(not .ToString())
Logging and Retry with CloudFX
Demo: Multiple Logging Channelsusing NLog and WAD
Logging ConfigurationTraditional .NET log configuration (System.Diagnostics) is hard coded against System.Configuration (app.config/web.config)Anti-pattern for Azure deploymentLeverage external configuration store (e.g. Service Configuration or blob storage) for run-time dynamic configuration
Recap and ResourcesBuilding big: • The Availability Challenge• Design for Failure• Get Insight into Everything
Resources:Best Practices for the Design of Large-Scale Services on Windows Azure Cloud ServicesTODO: failsafe doc link
• Follow us on Twitter @WindowsAzure
• Get Started: www.windowsazure.com/build
Resources
Please submit session evals on the Build Windows 8 App or at http://aka.ms/BuildSessions
© 2012 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.