OutSystems - A Framework Approach for Troubleshooting - NextStep 2012
-
Upload
outsystems -
Category
Technology
-
view
612 -
download
10
description
Transcript of OutSystems - A Framework Approach for Troubleshooting - NextStep 2012
www.outsystems.com
1 © 2012 outsystems – all rights reserved
Performance: Troubleshooting and Monitoring A framework approach
Paulo Cunha Solutions Delivery
NextStep 2012
www.outsystems.com
2 © 2012 outsystems – all rights reserved
www.outsystems.com
3 © 2012 outsystems – all rights reserved
Performance Troubleshooting Motivations for a framework
ü Deal with emergency scenarios
ü Quick and accurate diagnostic
ü Systematic approach
ü Common metrics and use cases
www.outsystems.com
4 © 2012 outsystems – all rights reserved
agileplatformenvironment
Performance Troubleshooting Where is the fire?
Client
Frontend 1
Database
Frontend 2
External Systems
Load Balancer
www.outsystems.com
5 © 2012 outsystems – all rights reserved
Performance Troubleshooting Designing the framework
1st Problem tipification • Where does it happen? • When does it happen?
2nd Identify possible causes • Application • Infrastructure
3rd Identify resolution strategies • Digg deeper • Apply known solution
www.outsystems.com
6 © 2012 outsystems – all rights reserved
Performance Troubleshooting Designing the framework
Where All Applications Specific Application Specific Operation
Whe
n
All the time ? ? ?
Peak Hours ? ? ?
Periodically ? ? ?
Off Peak Hours ? ? ?
Pattern 1
Pattern 2 Pattern 3
Pattern 4
Pattern 5
www.outsystems.com
7 © 2012 outsystems – all rights reserved
Performance Troubleshooting Framework – Pattern 1
All Applications
All t
he ti
me
Possible Cause: Horizontal bottleneck, usually related to the database
Strategy:
• Check Service Center reports Slow SQL
• Check database server performance counters CPU, Memory, Disk
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours
Pattern 1
www.outsystems.com
8 © 2012 outsystems – all rights reserved
High Load System
99.9% availability
4M searches/month
7M daily web hits
50K daily visitors
Travel Search Web Site
Performance Troubleshooting Framework – Pattern 1 - Example
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours
Pattern 1
www.outsystems.com
9 © 2012 outsystems – all rights reserved
1. Symptoms Where: Low performance on web site When: During Peak Hours (24/7) i.e. All Time
2. Diagnosis
o Slow SQL reports: queries taking too long o DB server CPU ~ 100% o SQL Server’s execution plan cache too large
• Overuse of expand inline parameters • Detected platform inefficiency on
handling variable length data types
Travel Search Web Site
Performance Troubleshooting Framework – Pattern 1 - Example
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours
Pattern 1
www.outsystems.com
10 © 2012 outsystems – all rights reserved
3. Resolution o Contention Measure: Clear execution plan cache
DBCC FREEPROCCACHE
o Remove expand inline parameters from queries o Agile Platform optimization at query parameterization level
Travel Search Web Site
Performance Troubleshooting Framework – Pattern 1 - Example
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours
Pattern 1
www.outsystems.com
11 © 2012 outsystems – all rights reserved
Performance Troubleshooting Framework – Pattern 2
All Applications
Peak
Hou
rs
Possible Cause: Infrastructure not handling generated load, usually at Front-End or Database level
Strategy:
• Check Service Center reports Slow SQL, Slow Screens
• Check FEs and DB servers performance counters CPU, Memory, Disk
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours
Pattern 2
www.outsystems.com
12 © 2012 outsystems – all rights reserved
Core System
2M Software Units
300 GB Database
400K daily web hits (200K on May 2011)
600 daily users (300 on May 2011)
Insurance Business Application
Performance Troubleshooting Framework – Pattern 2 - Example
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours
Pattern 2
www.outsystems.com
13 © 2012 outsystems – all rights reserved
1. Symptoms Where: Low performance on all applications When: During the day i.e. Peak Hours
2. Diagnosis o Slow SQL and Slow Screens reports
• Verified correlation between top queries and top screens o DB server CPU @ 100%, Memory ~ 99% o DB server inadequate hardware sizing o Application data model inefficiencies
• Big Datasets, Fragmented Indexes
Insurance Business Application
Performance Troubleshooting Framework – Pattern 2 - Example
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours
Pattern 2
www.outsystems.com
14 © 2012 outsystems – all rights reserved
3. Resolution o Data Model and Query optimizations
• Add/remove and defragment indexes • Split queries and remove expand inline parameters • Force TOPs, avoid UNIONs
o Application logic improvements • Timers re-scheduling (for day operations) • Enforce refined searches (reduced dataset) and use flat tables for searches
Insurance Business Application
Performance Troubleshooting Framework – Pattern 2 - Example
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours
Pattern 2
www.outsystems.com
15 © 2012 outsystems – all rights reserved
Performance Troubleshooting Framework – Pattern 3
Specific Application Specific Operation
All t
he ti
me
Peak
Hou
rs
Possible Causes: • Application/Operation data model, integration or architecture
bottleneck (bad design) • IIS Worker Process recycle (.NET stack)
Strategy:
• Check Service Center reports Slow SQL / Screens / Extensions / Web References
• Check Windows Event Viewer on FEs for IIS messages • Review application/operation implementation
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours
Pattern 3
www.outsystems.com
16 © 2012 outsystems – all rights reserved
Performance Troubleshooting Framework – Pattern 4
All Applications Specific Application Specific Operation
Perio
dica
lly
Possible Causes: • Timers (asynchronous processing) • IIS Worker Process recycle (.NET stack) • Application/Operation data model, integration or architecture
bottleneck (bad design) Strategy:
• Correlate Timer and Screen logs for that period • Check Service Center reports for that period
Slow Timers / Screens / Extensions / Web References • Check Windows Event Viewer on FEs for IIS messages • Review application/operation
implementation
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours
Pattern 4
www.outsystems.com
17 © 2012 outsystems – all rights reserved
Performance Troubleshooting Framework – Pattern 5
All Applications Specific Application Specific Operation
Off P
eak H
ours
Possible Causes: • Maintenance Tasks (DB, Antivirus) • Timers (asynchronous processing) • Application/Operation data model, integration or architecture
bottleneck (bad design) Strategy:
• Check DB maintenance tasks history • Check server’s scheduled tasks and antivirus configurations • Check Service Center reports
Slow Timers / Screens / Extensions / Web References • Review application/operation
implementation
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours Pattern 5
www.outsystems.com
18 © 2012 outsystems – all rights reserved
Batch processing with long timer runs
Critical Operation
800 GB Database
Energy Billing System
Performance Troubleshooting Framework – Pattern 5 - Example
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours Pattern 5
www.outsystems.com
19 © 2012 outsystems – all rights reserved
1. Symptoms Where: Low performance/timeout on specific operation When: Night/Off Peak Hours
2. Diagnosis
o Slow Timers and Slow SQL reports o DB Maintenance Tasks taking 15 hours o Timer execution colliding with DB maintenance tasks o Timer performance degraded with data growth
Energy Billing System
Performance Troubleshooting Framework – Pattern 5 - Example
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours Pattern 5
www.outsystems.com
20 © 2012 outsystems – all rights reserved
3. Resolution o Contention Measure: Increase timer timeouts o Optimize DB maintenance tasks
• Reorganize vs rebuild indexes o Reduce data set to be processed
• Split batches, reorganize data model • Archive old data
Energy Billing System
Performance Troubleshooting Framework – Pattern 5 - Example
Where
All Applications Specific Application Specific Operation
Whe
n
All the time
Peak Hours
Periodically
Off Peak Hours Pattern 5
www.outsystems.com
21 © 2012 outsystems – all rights reserved
Performance Troubleshooting The Framework
Where All Applications Specific Application Specific Operation
Whe
n
All the time Database Application Design IIS Worker Processes Application Design
Peak Hours Database IIS Worker Processes
Application Design IIS Worker Processes Application Design
Periodically Timers IIS Worker Processes
Timers IIS Worker Processes Integrations
Timers Application Design
Off Peak Hours Timers Maintenance Tasks
Timers Maintenance Tasks
Timers Maintenance Tasks Application Design
www.outsystems.com
22 © 2012 outsystems – all rights reserved
How to gather performance data
Performance Data Tools
www.outsystems.com
23 © 2012 outsystems – all rights reserved
Application
Application Server
Infrastructure
3 layers to gather performance metrics
Performance Data Tools
www.outsystems.com
24 © 2012 outsystems – all rights reserved
Performance Data Tools Infrastructure layer (.NET stack)
Use Windows Performance Counters Start menu > Control Panel > Administrative Tools > Performance Monitor
www.outsystems.com
25 © 2012 outsystems – all rights reserved
Performance Data Tools Infrastructure layer (.NET stack)
Keep counter values below the thresholds
Performance Counter Threshold
\Processor(_Total)\% Processor Time Depends on the server role:
FE < 40% DB < 60%
\Memory\Pages/sec < 1000 at all times
\PhysicalDisk\Avg. Disk Queue Length < 2 for each physical disk drive
\TCPv4\Connections Established < (100 * #worker processes + 50) * 2 < 3900
www.outsystems.com
26 © 2012 outsystems – all rights reserved
agileplatformenvironment
Performance Data Tools Infrastructure layer (.NET stack)
Client
Frontend 1
Database
Frontend 2
External Systems
Load Balancer
CPU RAM DISK NETWORK
www.outsystems.com
27 © 2012 outsystems – all rights reserved
Performance Data Tools Application Server layer (.NET stack)
Use Windows Event Viewer to check for IIS events Start menu > Control Panel > Administrative Tools > Event Viewer
www.outsystems.com
28 © 2012 outsystems – all rights reserved
Performance Data Tools Application Server layer (.NET stack)
Make sure IIS Application Pools are properly configured Follow “Tuning and Security Check list” on Agile Platform .NET Install Checklist
Event Threshold
IIS Worker Process recycle Recycling should only occur when scheduled and off hours
www.outsystems.com
29 © 2012 outsystems – all rights reserved
agileplatformenvironment
Performance Data Tools Application Server layer (.NET stack)
Client
Frontend 1
Database
Frontend 2
External Systems
Load Balancer
CPU RAM DISK NETWORK
IIS WP RECYCLES
www.outsystems.com
30 © 2012 outsystems – all rights reserved
Performance Data Tools Application layer
Use Agile Platform’s Service Center reports Service Center > Analytics > Reports
www.outsystems.com
31 © 2012 outsystems – all rights reserved
Performance Data Tools Application layer
Service Center Report Threshold
Slow SQL <100 occurrences with 500ms of average duration
Slow Screen <100 occurrences with +1s of average duration
Slow Web Service <100 occurrences with +1s of average duration
Slow Web Reference <100 occurrences with +1s of average duration
Slow Extension <100 occurrences with +1s of average duration
Slow Timer Depends on the business logic
www.outsystems.com
32 © 2012 outsystems – all rights reserved
agileplatformenvironment
Performance Data Tools Application layer
Client
Frontend 1
Database
Frontend 2
External Systems
Load Balancer
SLOW SCREEN
SLOW SCREEN SLOW SQL
SLOW SQL
SLOW EXTENSION
SLOW EXTENSION
SLOW WEB REFERENCE
SLOW WEB REFERENCE
SLOW WEB SERVICE
SLOW WEB SERVICE
SLOW TIMER
SLOW TIMER
CPU RAM DISK NETWORK
IIS WP RECYCLES
www.outsystems.com
33 © 2012 outsystems – all rights reserved
How to prevent performance emergencies
Now what?
www.outsystems.com
34 © 2012 outsystems – all rights reserved
Performance Monitoring Goals
ü Maintain good performance levels
ü Know your apps/installation expected behavior
ü Identify new patterns and trends
ü No surprises!
www.outsystems.com
35 © 2012 outsystems – all rights reserved
Agile Platform (Application)
Infrastructure
2 Layer Monitoring
Performance Monitoring
www.outsystems.com
36 © 2012 outsystems – all rights reserved
Performance Monitoring Infrastructure
Setup monitoring on DB and FE servers • CPU, Memory, Disk, Network • Windows Services status
IIS OutSystems Services
• Database indicators Size
Average Lock Wait Index Fragmentation
www.outsystems.com
37 © 2012 outsystems – all rights reserved
Performance Monitoring Infrastructure
Define thresholds and alarms • Start with recommended thresholds • Adapt to your requirements
Use tools already available on IT • e.g. Tivoli, OpManager, Nagios • Windows Performance Monitor and Event Viewer
www.outsystems.com
38 © 2012 outsystems – all rights reserved
Performance Monitoring Agile Platform
Collect daily Service Center reports • Slow SQL • Slow Screens
• Daily History - Screen Hits, Daily Users Service Center > Analytics > Daily History Automatically generated by the platform (if active on Server Configuration)
www.outsystems.com
39 © 2012 outsystems – all rights reserved
Performance Monitoring Agile Platform
Check Error Log daily for timeouts • May indicate performance problems
Increase Log Cycle Period • Configuration Tool > Logs tab (default is 4 weeks)
www.outsystems.com
40 © 2012 outsystems – all rights reserved
Performance Monitoring A framework
1st
Collect
2nd
Analyze 3rd
Implement
www.outsystems.com
41 © 2012 outsystems – all rights reserved
Performance Monitoring Phase 1 - Collect
Gather metrics in one place e.g. Excel Workbook
Period depends on criticality 1 day vs. 1 week
Register daily events to aid in analysis • Know what happened and when
E.g. scheduled maintenance, external downtimes • Correlate with performance data
Make sure to reserve budget for these tasks • It must be followed through!
• E.g. 1 hour daily to collect and analyze
1st
Collect
2nd
Analyze 3rd
Implement
www.outsystems.com
42 © 2012 outsystems – all rights reserved
Performance Monitoring Phase 2 - Analyze
Focus on “Top 10” most relevant • SQL, Screens, Extensions, Web Services
• By usage or criticality
Build visualizations (graphs) • Better identification of trends
• Easier to analyze and spot deviations
1st
Collect
2nd
Analyze 3rd
Implement
www.outsystems.com
43 © 2012 outsystems – all rights reserved
Performance Monitoring Phase 2 - Analyze
1st
Collect
2nd
Analyze 3rd
Implement
www.outsystems.com
44 © 2012 outsystems – all rights reserved
Performance Monitoring Phase 3 - Implement
Pick “Top X” to address on each sprint • Fix them when they are small! • Prioritize increasing trends
Do not postpone! • Make it a compromise to implement some
improvements every sprint • Keeps focus on performance • Positive impact on users
1st
Collect
2nd
Analyze 3rd
Implement
www.outsystems.com
45 © 2012 outsystems – all rights reserved
Thank you!
Paulo Cunha [email protected]