Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers...

Igal FiglinSr Program Manager LeadWindows Azure Fabric Controller Internals: Building and Updating Highly Available Applications

Windows Azure Fabric Controller Internals: Building and Updating Highly Available Applications

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Session Objective(s): Understand how Azure works behind the scenes and as a result be more of an expert to customers.Answer questions such as:• How are customer services updated without downtime? Which updateability

parameters can I fine tune?• Which types of hardware failures exist and how can I configure a service to avoid

impact?• How does Windows Azure update its infrastructure and how does this impact my

service?• How can Azure customers build automation to implement custom update workflows?

Key Takeaways:• Plan for Updates and Redundancy when Designing a Service• Select the best update modes for PaaS services• Use Availability Sets for IaaS service tiers• Design IaaS updates to ensure high availability for a redundant service

Session Objectives And Takeaways

MICROSOFT CONFIDENTIAL – INTERNAL ONLY

Scope and Agenda

Infrastructure Operations Impacting Customer Services

Running Highly Available Cloud

Services

Intro to WA Internals

Running Highly Available Cloud Virtual Machines

In-Scope• Understanding how Windows Azure Fabric Controller works• Understanding update and fault recovery of the Windows

Azure Compute Services• Understanding how to design Windows Azure Compute

Services to ensure high availability and fault recovery

Out of Scope• Windows Azure Introduction• Storage considerations• Performance• Broad guidance on service architecture / app patterns

Questions in the End of Each Section

Running Highly Available Cloud Services

Windows Azure

Windows Azure is an OS for the data center Handles resource management, provisioning, and monitoring Manages application lifecycle Allows developers to concentrate on business logic

Windows Azure provides common building blocks for distributed applications Compute resources, like Virtual Machines and Cloud Services Reliable queuing, simple structured storage, SQL storage Application services like access control, caching, and connectivity

Fabric Controller (FC) manages compute infrastructure Deploys and manages the health of the compute services Manages datacenter infrastructure (hardware & software), recovers from failures Drives infrastructure updates

RDFEService

Deploying a Service to theCloud: The 10K foot view

Develop using Visual Studio or any other IDE

Package upload to Windows Azure portal Optionally using Visual Studio developer upload

experience … or Powershell/Rest APIs through automation Service package passed to RDFE

Red Dog Front End (RDFE) sends service to a Fabric Controller (FC) based on service requirements Region, affinity groups Available resources

Fabric Controller (FC) deploys service to the right racks/servers

Visual Studio

Windows Azure Portal

Rest APIs / PS Cmdlets

US-North Central Region

FCFC FC FC

Datacenter Architecture Region can be comprised of multiple datacenters Datacenters are divided into “clusters”

Each rack provides a unit of fault isolation

Cluster 5Cluster 4Cluster 3Cluster 2Cluster 1

Agg Agg Agg Agg Agg

DatacenterRouters

……… … ……

Agg Agg Agg

Aggregation Routers andLoad Balancers

Cluster Network Aggregation

Top of RackSwitches

Power Distribution Units

Inside a Cluster Each cluster is managed by a Fabric Controller (FC)

Manages DC hardware and services Allocates resources and manages lifecycles

FC is a distributed, stateful application running on servers spread across racks One FC instance is the primary and all others keep view of world in sync

Cluster

Fabric Controller

TOR Switch TOR Switch

Fabric Controller

TOR Switch

Fabric Controller

………

Cluster

Rack 1 Rack 2 Rack 20

Inside a Physical Server CPU, memory, disk & networking resources are

committed when allocating the service.

Physical Server

FC Host Agent

Host Partition

Trust boundaryPDU

TOR Switch

Fabric Controller

Unallocated CPUs

VMVMVM

Guest Agent Guest Agent

To Fabric Controller

PaaS VM Role Instance

IaaS VM Role

CPU CPUCPU CPUCPU CPU CPU CPU

Leveraging Fault Domains

Fault Domain is a physical unit of failure Rack can be considered a fault domain. Node Healing = moving VMs off the faulted server while

keeping the allocation constraints

Rack 1Fault Domain 1

Rack 2Fault Domain 2

TOR Switch TOR Switch

PDU PDU

PaaS: Leveraging Fault Domains

FC deploys the role instances in (at least) two different fault domains. Different roles are allocated to fault

domains independently An even distribution is maintained

when scaling up or down No way to control the Fault

Domain mapping, but it can be queried for each role instance: Portal REST service mgmt. APIs

(“FaultDomain”) Queuing can be defined

between the layers (only LB by default)

WorkerRole

Web Role

Azure Load Balancer

Web RoleInstance 0

Fault Domain 0

Web RoleInstance 1

Fault Domain 1

Web RoleInstance 2

Fault Domain 0

Worker RoleInstance 0

Fault Domain 0

Fault Domain 1

PaaS: Leveraging Update Domains

Update Domains (UD) control how to the service is updated. A single UD is being updated for a role at a

time. Scenarios:

User Initiated: PaaS service owner updates the service package or chooses a different Guest OS

Platform Initiated: Update Guest OS for PaaS services when a new version is released (e.g. security fixes); Update the server (hypervisor)

Implementation Details: Role instances are assigned into different

UDs, circularly

Alignment between UDs of the different roles

Up to 20 UDs per Service (5 by default)

WorkerRole

Web Role

Azure Load Balancer

Web RoleInstance 0

Update Domain 0

Web RoleInstance 1

Update Domain 1

Web RoleInstance 2

Update Domain 2

Update Domain 0

Update Domain 1

PaaS Service Setup & ManagementDemo

Mapping Instances to UDs/FDs- PaaS Update domains are always spread across fault domains

Usage: Setting update domains count:

You can change the number of update domains in the ServiceConfiguration.csfg file (upgradeDomainCount=n, requires service redeployment)

Role instance count can be changed dynamically through REST API change configuration request Determining Update and Fault Domains for your instance through REST management APIs:

RoleInstance element => read only, get role instance names and counts RoleInstance.Role.Name RoleInstance.ID RoleInstance.FaultDomain, RoleInstance.UpdateDomain

Web Role FD0 FD1

UD0 IN_0

UD1 IN_1

UD2 IN_2

Worker Role FD0 FD1

UD0 IN_0

UD1 IN_1

Method 1: Defining a service update mode (“mode” in configuration):

• Auto – UD Walk by Fabric Controller when new package is uploaded (default)• Manual – Call Walk Upgrade Domain for each domain• Simultaneous – update all role instances simultaneously, ignoring Upgrade Domains

Notes on usage:• By utilizing manual UD walk it is possible to manually control speed of the update (risk

management); Rollback update when unsuccessful• By utilizing simultaneous update and update events in each role instance, it is possible to build

custom flows

Method 2: Swapping Staging vs Production Environments• Define staging environment and swap between production and stage environment• Allows final validation before production entry

Changing service size based on load, time of day, auto-scaling, etc

• Calling Change Deployment Configuration with the new service instance count• Calling Delete Role Instances

Automating PaaS Service UpdatesInfrastructure Operations Impacting

Customer Services

Services

Migrated 3-Tier Enterprise Application Sample

• Sample application to demonstrate Windows Azure Usage (application migrated from customer premise).

• Sample application specifics:• High redundancy for each component• Load balancer for the front end• Data layer can be implemented by SQL

Server or SQL Azure (here); alternatively, can utilize Windows Azure storage

• Set up the whole application in the same affinity group to gain physical proximity

Services

BackendAvailability Set

FrontendAvailability Set

Front End

Backend

Front End

Backend

Azure Load Balancer

Front End

Queueing or load-balancing

Geo-Distributed StorageOr SQL Azure

IaaS: Leveraging Availability Sets Infrastructure Operations Impacting Customer Services

Services

• Availability sets instruct how to allocate VMs in the datacenters to isolate impact for hardware faults and infrastructure updates.

• Availability sets are defined through portal or REST APIs.

• Availability sets has to be defined for each redundant application tier to achieve 99.95% SLA• We do not offer SLA unless there are 2 VM

instances defined and used in each availability set

• Application SLA is compositional and dependent on the multiplication of the SLA components (each tier, compute, networking, etc)• e.g. Front End may cause unavailability of the

entire service.

• No correspondence between fault domains used in different availability sets• Thus, queuing or load-balancing is being added

between the availability sets

Front EndFault Domain 1

BackendFault Domain 2

BackendFault Domain 1

Azure Load Balancer

IaaS: Updating the Service and the Infrastructure Infrastructure Operations Impacting Customer Services

Services

• Scenario: Platform initiated update of the servers which run the IaaS VM instances.

• Goal: high redundancy for the IaaS service

• Each role is allocated to a different update domain (up to 5)• When physical servers are updated, only

fraction of the capacity will be touched at a time (or less).

• No mapping between update domains in different availability sets.

• IaaS service update is under the customer responsibility.• In some cases customer VM update and

infrastructure update can happen in the same time. • IaaS update notifications are sent to avoid

this.• Hardware failures can occur any time. Thus,

platform update + hardware failure could still cause service outage for dual VM availability sets.

Front EndUpdate Domain

BackendUpdate Domain

Front EndUpdate Domain

Azure Load Balancer

BackendUpdate Domain

Managing IaaS ServiceDemo

Services

Using IaaS VMs Correctly Analogous to deploying different PaaS services for each tier Update strategy should be clear/upfront Use Availability Sets to get platform scenario working correctly Do not use single instance availability sets for production

applications Each availability set is completely independent from the

Infrastructure Standpoint Mix PaaS roles and IaaS availability sets as needed Use Affinity Groups to enforce physical proximity of the

different services

High Availability IaaS VMs Usage GuidanceInfrastructure Operations Impacting

Customer Services

Services

Guidance for administrator initiated IaaS update Update one update domain at a time When removing/restarting/shutting down VMs, make sure to

keep the remaining VMs evenly distributed in FDs and UDs Prepare for/detect platform update happening in parallel;

same for server hardware failures Validate VMs status before walking next UD 3 UDs will minimize collision risk with platform update

Single IaaS instances will get a notification before the update Add service auto-scaling

Capture role for an existing stopped VM or pre-create it; Add a new role from it; Shutdown / Delete role when scaling down

Defining Updating VMs in Availability Set

Services

Infrastructure Operations

Services

Server, VM and Role Health Maintenance

FC maintains service availability by monitoring the software & hardware health Based primarily on

heartbeats Automatically “heals”

affected roles

Symptom Healing Operation

Potential Causes

Issue with a customer code or customer VM

Reboot the VM(s)

• Role instance or Guest OS crash (PaaS)

• Customer OS Crash (IaaS)

Issue with physical server or rack

Allocate the impacted customer VMs to the different server(s)

• Physical server software failure

• Physical server hardware failure

• Rack / PDU / ToR Failure

Services

Updating the Host OS

Initiated by the Windows Azure team (~once per month) Goal: update all machines as quickly as possible VMs might be rebooted when the server is updated (new

OS, BIOS, etc). Constraint: must not violate service SLA Algorithm: Fabric Controller performs UD walk (keep the UD

constraints for each service) Each server is updated in such a way it won’t violate UD & FD constraints for the services

utilizing it Might take many hours for services with large UD count

Note: your role instance keeps the same VM and VHDs, preserving cached data in the resource volume

Services

Summary: Highly Available Cloud Service vs Azure VMsAspect Cloud Services (PaaS) Azure VMs (IaaS)

Fault Domain count Two per Role Two per Availability Set

Update Domain count Five by default; up to twenty Five

Platform update UD by UD UD by UD

Administrator initiated update

UD by UD, or Blast, or Customer Controlled UD walk or VIP-Swap

Administrator controlled (can be automated using PowerShell or REST management APIs)

Frontend and backend highly-available addressability

Windows Azure provides Load-Balancer per role; queuing recommended for backend roles

Administrator defines endpoints in VMs and maps them to a load-balanced set; queuing recommended for backend roles

SLA 99.95% uptime for roles with two or more role instances

99.95% uptime for Availability Sets with two or more VMs

Multi-service collocation

Yes, using Affinity Groups Yes, using Affinity Groups

UD/FD automated management when service grows / shrinks

Yes (except when deleting a specific instance)

Yes when service grows; no when shrinks

• Plan for Updates and Redundancy when Designing a Service

• Select the best update mode for PaaS services; utilize update notifications as needed

• Use Availability Sets for IaaS service tiers• Design IaaS updates to ensure high

availability for a redundant service

In Review: Key Takeaways

Your Feedback is Important

Fill out an evaluation of this session and help shape future events.

Scan the QR code to evaluate this session on your mobile device.

You’ll also be entered into a daily prize drawing!

© 2014 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.

Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers...

Documents

Transcript of Visual Studio Windows Azure Portal Rest APIs / PS Cmdlets US-North Central Region FC TOR PDU Servers...

Pdu Rack APC

apc pdu 16a

Product PDU Catalogue

4.3” Door Mount PDU Panel 4 3” D M t PDU P l

IP Storage Protocols: iSCSI - SNIA | Advancing Storage and ... · iSCSI PDU. iSCSI PDU. Header with. SCSI Cmd. iSCSI PDU Header with Only Control Info. iSCSI PDU. IP packet. IP packet

Pdu Mini Catalogo

Privacy at the communication layercarmelatroncoso.com/cs-721/CS-721-tor-Troncoso.pdf · 2018-03-18 · Tor directory servers. Directory tor Relays. 10 directory servers. Every hour:

User Manual - Austin Hughes · UM-IPM-02-Q116V1 IPM-02 PDU management software W kWh Monitored PDU WS kWh Switched PDU PDU Inspired by Your Data Center User Manual Designed and …

Installation and Maintenance Guide - download.lenovo.comLenovo)_DPI_PDU+_Installation_and... · DPI C13 PDU+, DPI C13 3-phase PDU+, DPI C19 PDU+, and DPI C19 3-phase PDU+ Installation

Meeting Packet - SharpSchoolnlrsd.ss6.sharpschool.com/UserFiles/Servers/Server_228269/File/Sch… · American Power Conversion AP7540 20A Basic Zero U Rack-Mount PDU (A0333377) Price:

PDU MAPA 2

Anonymity With Tor · Tor - How it Works I Low latency P2P Network of mix servers I Designed for interactive tra c (https, ssh, etc.) I "Directory Servers\ store list of participating

Guided Reading PDU

Specification of CAN Transport Layer - AUTOSAR · 2017. 10. 20. · NMNM NM Module NM Module N-PDU L-PDU L-PDU L-PDU PDU multi-plexer I-PDU. Specification of CAN Transport Layer V2.6.0

Power Distribution Catalog - Raritan Inc.cdn.raritan.com/download/PDU-Catalog-with-toc-Online.pdf · Raritan began developing KVM switches for IT professionals to manage servers remotely

Locating Hidden Servers on Tor

Configuration Guide - R@ck'n Roll & Rack to buildsupport.bull.com/documentation/byproduct/servers/... · Rack Configuration Guide 2.PDU (Power Distribution Units) 2.1.Outlets and

(PDU) Pocket Guide

MS-PDU User Manual - Amazon Web Serviceshabitech.s3.amazonaws.com/PDFs/NIV/NIV-NPDVS20C16A... · MS-PDU User Manual V/0 3 MS-PDU User Manual 1. MS-PDU Summary： MS-PDU is the latest

PDU PLANOS 1