CAT: Azure SQL DB Premium – Deep
Dive and Mythbuster
Ewan Fairweather
Senior Program Manager
Azure Customer Advisory Team
Tobias Ternstrom
Principal Program Manager
Data Platform Group
Cloud & Enterprise Customer Team
CAT
Customer
45%
Engineering
45%
Community
10%
Architecture guidance and technology expertise
i.e. patterns, practices and codification
Community
Accelerate cloud adoption
i.e. white-papers, events
Frameworks and code
Platform
Provide “end to end” Azure customer story on how
features works in customer project scenarios based
on learnings from the biggest dpeloyments
Europe:
- Azure Applications
- Azure Data
- Azure Analytics
Agenda
• Persistent data options in Azure
• Azure SQL DB Premium Deep Dive
• Sizing and capacity planning
• Customer experience and learnings
• Summary
Persistent Data Options in Azure
The Application Journey
Azure Storage Options
Platform as a Service
• Azure SQL Database (managed databases)
• Publish and run
• Shared environment
Infrastructure as a Service
• SQL Server running in a Windows Azure VM
• Or any other database you have bits for
• Full control / insight
• More administrative effort
Azure Storage
• Tables
• Blobs
• Queues
• No relational
• Cheap storage
• Optimized for density and scale out
High “Friction”/Control
Res
ou
rce
sDedicated
Shared
Low
100% of API, Virtualized
Roll-your-own HA/DR/scale
SQL Server in IaaSVirtualized Machine
SQL Server
Raw iron
Scale-up
Full h/w control
Roll-your-own HA/DR/scale
Auto HA, Fault-Tolerance
Self-provisioning, mgmt @ scale
Virtualized Database
SQL Database - PaaS
Three different ways to run SQL
Premium
Azure
Decision Points• Common Data going to WA Storage (Point lookups, minimial relational)
– Telemetry Logs, append workloads, primarily key value lookups
– Blobs for WA SQL DB (lower costs, reduce DB size under 150GB limit)
• Commonly going to SQL Server in VM (lift and shift, DW)
– Applications needing features not currently in SQL DB (example: Fulltext)
– Light DW workloads
• Commonly going to SQL DB (OLTP)
– Applications who do not want to manage their databases
– Applications that need massive horizontal scale (Internet-facing SaaS ISVs)
– New OLTP applications
– Premium DB extends Azure SQL DB’s capabilities
Typical Performance Factors
Factor Why it matters
Latency - Greater than on-premise- Higher variance
Establishing connections
- Initial login goes to the gateway
- Connections are unreliable and will fail
Multi tenancy
- Unpredictableperformance
- Soft throttling - Hard throttling- Shared log, max
transaction size
1
1
2
2
3
3
• Writes are the most expensive resource in this system
SQL DB Web/Business Performance Variance• Web/Business Editions provide elastic scale without
performance SLA
• There is some variance in performance due to multi tenancy, we will reduce the variance further over time
• SQL DB contains logic to move DBs around to balance load across each cluster to maximize average resources
DB ResourcesAvailable
Time
Databases can get different resources based on other’s activity
Resource management in Azure SQL DB• SQL Database monitors the usage of the shared resources to keep databases within resource
limits
• When resource usage exceeds limits SQL DB can manage resource usage at DB or node level killing connection or deny requests– Throttling stages: Soft (subset of DBs) and Hard (all DBs)
Decode type and resource
Resource Limit Error code
Database Size 150 GB or less depending on the database quota (MAXSIZE) 40544
Transaction durationState 1: 24 hoursState 2: 20 seconds if a transaction locks a resource required by an underlying system task
40549
Lock count 1 million locks per transaction 40550
TempDBState 1: 5 GB of tempdb spaceState 2: 2 GB per transaction in tempdbState 3: 20% of total log space in tempdb
40551
Transaction log spaceState 1: 2 GB per transactionState 2: 20% of total log space
40552
Memory 16 MB memory grant for more than 20 seconds 40553
Worker Thread Governance Every database will have a maximum worker thread concurrency limit10928 10929
Azure SQL DB Premium: How it works
Edition Comparison• Premium has reserved resources on all 3 nodes
• You can upgrade or downgrade a database
• You should decide sizing based on your resource needs
DB ResourcesAvailable
Time
P1
P2
Web/Business
Premium Edition• Some applications require guaranteed resources
• Premium Edition was introduced for customers who need dedicated resources
• Common customer attributes:– High throughput requirements
– Low latency requirements
– Low performance variance requirements
• Premium Edition details– Dedicated resources (min=max) to avoid performance variance
– Different sizes (P1-P2) allow adjustment based on resource needs
– Currently in Public Preview
Premium Edition Reservation Sizes• Reservations are done separately for each database
– Capacity is limited during public preview
– Customers can get 1-2 reservations based on availability
• Monthly Price is USD $930 for P1 at GA. P2 is 2x
• P3 and P4’s available at engineering discretion
Size CPU Cores Worker Threads
Active Sessions
Disk IO (IOPS)
Memory (GB)
P1 1 200 2000 150 8
P2 2 400 4000 300 16
Premium Database
Set Premium Service Objective
Checking Status of Azure SQL DB
• The DB will remain online aside from a few seconds during the final failover
Checking Current SLO
Checking Status of Move
• Lower and upper bound estimates vary between 15 minutes for an empty database and approximately 2 days for a 150 GB database
SQL Premium DB SizeSQL Premium GA
Monthly Cost
SQL VM
Monthly Cost
SQL VM Size
(Enterprise Edition)
P1 (M)
1 CPU Core
8GB RAM
150 IOPS
$930 $1,629
S (A1)
1 CPU Core
1.75GB RAM
2x500 IOPS
P2 (L)
2 CPU Cores
16GB RAM
300 IOPS
$1860 $1696
M (A2)
2 CPU Cores
3.5GB RAM
4x500 IOPS
$1,830
L (A3)
4 CPU Cores
7GB RAM
8x500 IOPS
$2,321
A6
4 CPU Cores
28GB RAM
8x500 IOPS
$3,660
XL (A4)
8 CPU Cores
14GB RAM
16x500 IOPS
$4,642
A7
8 CPU Cores
56GB RAM
16x500 IOPS
Premium DB or A Larger VM?
Sizing and Capacity Planning
Sizing Databases• For a SINGLE database…
– Find largest resource consumer
– Measure peak load over time period
– Choose appropriate reservation size to handle peak load
• Workload Type matters– Batch processing – aim to achieve avg
throughput over time (not size for peak)
– Interactive applications need to size for the peak to preserve response times
0
0.2
0.4
0.6
0.8
1
1.2
1
11
21
31
41
51
61
71
81
91
10
1
11
1
12
1
13
1
14
1
15
1
16
1
CPUAvgCoresUsedInHr
Peak Load Example• Weekly IO chart of a large
customer on WA SQL DB
• We actively work on the load each week
– Query tuning
– Moving maintenance jobs to off-peak hours
• We also do aggressive things
– Split different functions out into different databases
– Rate-meter background jobs to not impact core workloads
0
50
100
150
200
250
20
13
09
16
00
20
13
09
16
10
20
13
09
16
20
20
13
09
17
06
20
13
09
17
16
20
13
09
18
02
20
13
09
18
12
20
13
09
18
22
20
13
09
19
08
20
13
09
19
18
20
13
09
20
04
20
13
09
20
14
20
13
09
21
00
20
13
09
21
10
20
13
09
21
20
20
13
09
22
06
20
13
09
22
16
Avg Hourly Physical Write IOPS (1 week)
Total
Daily Maintenance Job Moved to off-peak hours
Weekly Maintenance Moved to Sunday
Query Tuning to reduce daily peak
26
Azure SQL Database DMV Surface Area Health (master)• sys.event_log• sys.bandwidth_usage• sys.database_connection_stats
Resource Usage (master)• sys.resource_usage*• sys.resource_stats*
Data Access & Usage• sys.dm_db_index_usage_stats• sys.dm_db_missing_index_details• sys.dm_db_missing_index_groups• sys.dm_db_missing_index_group_stats• sys.dm_exec_sessions
Performance• sys.dm_exec_query_stats• sys.dm_exec_sql_text• sys.dm_exec_query_plan• sys.dm_exec_requests• sys.dm_db_wait_stats
Windows Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrastedhttp://msdn.microsoft.com/en-us/library/windowsazure/jj879332.aspx
Capacity planning• Use sys.resource_stats (in preview) in
master db to determine your application resource needs:
SELECT * FROM sys.resource_statsWHERE database_name = 'MyTestDB' ANDstart_time > DATEADD(day, -7, GETDATE())
Investigating resource usage
SELECT(SELECT
SUM(DATEDIFF(minute, start_time, end_time))FROM sys.resource_statsWHERE database_name = 'MyTestDB' AND
start_time > DATEADD(day, -7, GETDATE()) ANDavg_cpu_cores_used > 1.0) * 1.0 / SUM(DATEDIFF(minute,
start_time, end_time)) AS percenage_more_than_1_coreFROM sys.resource_statsWHERE database_name = 'MyTestDB' AND start_time > DATEADD(day, -7,GETDATE())
SELECTavg(avg_cpu_cores_used) AS 'Average CPU Cores Used',max(avg_cpu_cores_used) AS 'Maximum CPU Cores Used',avg(avg_physical_read_iops + avg_physical_write_iops) AS
'Average Physical IOPS',max(avg_physical_read_iops + avg_physical_write_iops) AS
'Maximum Physical IOPS',avg(active_memory_used_kb / (1024.0 * 1024.0)) AS 'Average
Memory Used in GB',max(active_memory_used_kb / (1024.0 * 1024.0)) AS 'Maximum
Memory Used in GB',avg(active_session_count) AS 'Average # of Sessions',max(active_session_count) AS 'Maximum # of Sessions',avg(active_worker_count) AS 'Average # of Workers',max(active_worker_count) AS 'Maximum # of Workers'
FROM sys.resource_statsWHERE database_name = 'MyTestDB' AND start_time >DATEADD(day, -7, GETDATE())
Avg and Max resource usage Percentage of time using more than 1 core
Managing DB Resource Growth• Assuming your application resources grow over time, you need a plan to deal with
that growth, in the box world we are always sizing for a future peak
• The cloud offers two architectural approaches to manage, which are both elastic
– “Scale-up” (limited): Web/Business -> P1 -> P2
– “Scale-out”: use more databases
• Partitioning data by function or by tenant allows you to adjust as needed to growth in resource usage at the database level
• Plan on actively monitoring/alerting telemetry about the resource use so you can adjust to growth before something breaks…
Cost Optimization• Two paths to improve your cloud service
– Spend more money (purchase more capacity)
– Optimize/Tune (more operations in capacity you have)
• The Cloud model lets you choose– If you have development resources available, you might choose to ‘tune’
– If you are on a time deadline, you might just choose to scale up instead
• This model also works great for seasonal demand changes– Example: Add capacity before the holiday sales season, remove after. (~$32
per day for a P1)
Customer Experience and Learnings
What’s different with data access in the Cloud?
Two key areas of attention
• Connection management issues
• Less reliable connection state due to multiple layers and network hops
• Retry logic mandatory to implement reliable communications between application and database server
• Higher latency between app tier and database tier compared to an on-premises deployment
• Firewalls, load balancers, gateways
• This amplifies the impact of chatty application behaviors
We will talk more about this in our 11:45 session
• Time (t) or size (n) window approach can result in the loss of: – t seconds of data
– n rows of data
Batching inserts
Azure SQL Database
APP TIER
Bulk Insert
Buffer, group items
Data access layer
Application logicasynch inserts
Batch
12
3
Takeaways • Reliability: Plain ADO.NET single insert with full retry logic
• Density: Async and buffered approach
2
1
• How can I improve density?
– Introducing batching
– Reducing application round-trips
– Improve insert performance
• Leverage asynchronous approach
– Buffer across time and number of insertions
Workload tuning options
Scale-Up vs. Scale-Out• P1-P2 supported during public preview period• Additional sizes can be introduced by GA• With a scale up approach you may lose some
flexibility– E.g. require planning for worst case / peaks– Premium let you scale up/down between P1 and
P2 max one time a day
• Scale up may not fit all costs/business models– Unpredictable workloads– Multiple database deployments
Easyjet Seat Selection System • 70/30 R/W
workload, very efficient workload (<200mS max exec time)
• Majority of queries benefitted from switching
• Reduced and more stable response times for both reads and writes
Switch
Customer experience: Easyjet
• Reduced impact of 40501, 10928 and 10929 errors
• Remaining exceptions have been mostly due to application issues
Major ticket sale
Broken build
Another Customer experience• Availability has greatly improved
after the switch (less than 2min x month)
• Growing trend in CPU usage– Around 2 on average, with spikes up to 5
• No major errors related to resource issues
• Sporadic throttling for High Log IO waits
Switch
Application-Tier Caching• App-tier caching is a very effective way to reduce data-tier
load
• Azure has a several caching solutions available to you
• For load spikes, this can often significantly reduce peak load
• Example: Azure SQL DB was used in the last US Presidential Election
– Few writes, massive reads all at once
– App tier caching used to remove reads from the database
CPU graph for the core reporting DB
• 1st 10 seconds – 44K page views/second (est. ~450K DB calls/sec)
• Next 20 seconds – 10K page views/sec (est. ~100K DB calls/sec)
(DB calls mostly removed due to caching)
Summary
Summary
• Premium DB provides predictable performance
and elasticity
• We offer you a mixture of scale-up and scale-out
approaches
• The elastic nature of these options allows you to
deal with peaks in a different way to on premises
Resources• Premium Preview for SQL Database Guidance
(http://msdn.microsoft.com/en-us/library/jj853352.aspx)
• Azure SQL Database and SQL Server -- Performance and Scalability Compared and Contrasted (http://msdn.microsoft.com/en-us/library/windowsazure/jj879332.aspx)
Resources…• Cloud Service Fundamentals in Windows Azure
• Wiki: http://social.technet.microsoft.com/wiki/contents/articles/17987.cloud-service-fundamentals.aspx
• Best practices on:– Scale out architecture
– Design for operations
– Telemetry solution
– Reliable architecture
THANK YOU!• For attending this session and
PASS SQLRally Nordic 2013, Stockholm
Top Related