OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

45
www.outsystems.com 1 © 2012 outsystems – all rights reserved Performance: Troubleshooting and Monitoring A framework approach Paulo Cunha Solutions Delivery NextStep 2012

description

Great applications can only be great when they respond at the speed business demands. In this session you will learn form real world experience in troubleshooting, monitoring and running a complex enterprise application to cope with extremely demanding performance requirements. The take away is a simple framework you can apply immediately to make sure your applications are great.

Transcript of OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

Page 1: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

1 © 2012 outsystems – all rights reserved

Performance: Troubleshooting and Monitoring A framework approach

Paulo Cunha Solutions Delivery

NextStep 2012

Page 2: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

2 © 2012 outsystems – all rights reserved

Page 3: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

3 © 2012 outsystems – all rights reserved

Performance Troubleshooting Motivations for a framework

ü Deal with emergency scenarios

ü Quick and accurate diagnostic

ü Systematic approach

ü Common metrics and use cases

Page 4: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

4 © 2012 outsystems – all rights reserved

agileplatformenvironment

Performance Troubleshooting Where is the fire?

Client

Frontend 1

Database

Frontend 2

External Systems

Load Balancer

Page 5: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

5 © 2012 outsystems – all rights reserved

Performance Troubleshooting Designing the framework

1st Problem tipification •  Where does it happen? •  When does it happen?

2nd Identify possible causes •  Application •  Infrastructure

3rd Identify resolution strategies •  Digg deeper •  Apply known solution

Page 6: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

6 © 2012 outsystems – all rights reserved

Performance Troubleshooting Designing the framework

Where All Applications Specific Application Specific Operation

Whe

n

All the time ? ? ?

Peak Hours ? ? ?

Periodically ? ? ?

Off Peak Hours ? ? ?

Pattern 1

Pattern 2 Pattern 3

Pattern 4

Pattern 5

Page 7: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

7 © 2012 outsystems – all rights reserved

Performance Troubleshooting Framework – Pattern 1

All Applications

All t

he ti

me

Possible Cause: Horizontal bottleneck, usually related to the database

Strategy:

•  Check Service Center reports Slow SQL

•  Check database server performance counters CPU, Memory, Disk

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours

Pattern 1

Page 8: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

8 © 2012 outsystems – all rights reserved

High Load System

99.9% availability

4M searches/month

7M daily web hits

50K daily visitors

Travel Search Web Site

Performance Troubleshooting Framework – Pattern 1 - Example

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours

Pattern 1

Page 9: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

9 © 2012 outsystems – all rights reserved

1. Symptoms Where: Low performance on web site When: During Peak Hours (24/7) i.e. All Time

2. Diagnosis

o  Slow SQL reports: queries taking too long o  DB server CPU ~ 100% o  SQL Server’s execution plan cache too large

•  Overuse of expand inline parameters •  Detected platform inefficiency on

handling variable length data types

Travel Search Web Site

Performance Troubleshooting Framework – Pattern 1 - Example

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours

Pattern 1

Page 10: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

10 © 2012 outsystems – all rights reserved

3. Resolution o  Contention Measure: Clear execution plan cache

DBCC FREEPROCCACHE

o  Remove expand inline parameters from queries o  Agile Platform optimization at query parameterization level

Travel Search Web Site

Performance Troubleshooting Framework – Pattern 1 - Example

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours

Pattern 1

Page 11: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

11 © 2012 outsystems – all rights reserved

Performance Troubleshooting Framework – Pattern 2

All Applications

Peak

Hou

rs

Possible Cause: Infrastructure not handling generated load, usually at Front-End or Database level

Strategy:

•  Check Service Center reports Slow SQL, Slow Screens

•  Check FEs and DB servers performance counters CPU, Memory, Disk

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours

Pattern 2

Page 12: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

12 © 2012 outsystems – all rights reserved

Core System

2M Software Units

300 GB Database

400K daily web hits (200K on May 2011)

600 daily users (300 on May 2011)

Insurance Business Application

Performance Troubleshooting Framework – Pattern 2 - Example

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours

Pattern 2

Page 13: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

13 © 2012 outsystems – all rights reserved

1. Symptoms Where: Low performance on all applications When: During the day i.e. Peak Hours

2. Diagnosis o  Slow SQL and Slow Screens reports

•  Verified correlation between top queries and top screens o  DB server CPU @ 100%, Memory ~ 99% o  DB server inadequate hardware sizing o  Application data model inefficiencies

•  Big Datasets, Fragmented Indexes

Insurance Business Application

Performance Troubleshooting Framework – Pattern 2 - Example

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours

Pattern 2

Page 14: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

14 © 2012 outsystems – all rights reserved

3. Resolution o  Data Model and Query optimizations

•  Add/remove and defragment indexes •  Split queries and remove expand inline parameters •  Force TOPs, avoid UNIONs

o  Application logic improvements •  Timers re-scheduling (for day operations) •  Enforce refined searches (reduced dataset) and use flat tables for searches

Insurance Business Application

Performance Troubleshooting Framework – Pattern 2 - Example

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours

Pattern 2

Page 15: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

15 © 2012 outsystems – all rights reserved

Performance Troubleshooting Framework – Pattern 3

Specific Application Specific Operation

All t

he ti

me

Peak

Hou

rs

Possible Causes: •  Application/Operation data model, integration or architecture

bottleneck (bad design) •  IIS Worker Process recycle (.NET stack)

Strategy:

•  Check Service Center reports Slow SQL / Screens / Extensions / Web References

•  Check Windows Event Viewer on FEs for IIS messages •  Review application/operation implementation

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours

Pattern 3

Page 16: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

16 © 2012 outsystems – all rights reserved

Performance Troubleshooting Framework – Pattern 4

All Applications Specific Application Specific Operation

Perio

dica

lly

Possible Causes: •  Timers (asynchronous processing) •  IIS Worker Process recycle (.NET stack) •  Application/Operation data model, integration or architecture

bottleneck (bad design) Strategy:

•  Correlate Timer and Screen logs for that period •  Check Service Center reports for that period

Slow Timers / Screens / Extensions / Web References •  Check Windows Event Viewer on FEs for IIS messages •  Review application/operation

implementation

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours

Pattern 4

Page 17: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

17 © 2012 outsystems – all rights reserved

Performance Troubleshooting Framework – Pattern 5

All Applications Specific Application Specific Operation

Off P

eak H

ours

Possible Causes: •  Maintenance Tasks (DB, Antivirus) •  Timers (asynchronous processing) •  Application/Operation data model, integration or architecture

bottleneck (bad design) Strategy:

•  Check DB maintenance tasks history •  Check server’s scheduled tasks and antivirus configurations •  Check Service Center reports

Slow Timers / Screens / Extensions / Web References •  Review application/operation

implementation

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours Pattern 5

Page 18: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

18 © 2012 outsystems – all rights reserved

Batch processing with long timer runs

Critical Operation

800 GB Database

Energy Billing System

Performance Troubleshooting Framework – Pattern 5 - Example

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours Pattern 5

Page 19: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

19 © 2012 outsystems – all rights reserved

1. Symptoms Where: Low performance/timeout on specific operation When: Night/Off Peak Hours

2. Diagnosis

o  Slow Timers and Slow SQL reports o  DB Maintenance Tasks taking 15 hours o  Timer execution colliding with DB maintenance tasks o  Timer performance degraded with data growth

Energy Billing System

Performance Troubleshooting Framework – Pattern 5 - Example

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours Pattern 5

Page 20: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

20 © 2012 outsystems – all rights reserved

3. Resolution o  Contention Measure: Increase timer timeouts o  Optimize DB maintenance tasks

•  Reorganize vs rebuild indexes o  Reduce data set to be processed

•  Split batches, reorganize data model •  Archive old data

Energy Billing System

Performance Troubleshooting Framework – Pattern 5 - Example

Where

All Applications Specific Application Specific Operation

Whe

n

All the time

Peak Hours

Periodically

Off Peak Hours Pattern 5

Page 21: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

21 © 2012 outsystems – all rights reserved

Performance Troubleshooting The Framework

Where All Applications Specific Application Specific Operation

Whe

n

All the time Database Application Design IIS Worker Processes Application Design

Peak Hours Database IIS Worker Processes

Application Design IIS Worker Processes Application Design

Periodically Timers IIS Worker Processes

Timers IIS Worker Processes Integrations

Timers Application Design

Off Peak Hours Timers Maintenance Tasks

Timers Maintenance Tasks

Timers Maintenance Tasks Application Design

Page 22: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

22 © 2012 outsystems – all rights reserved

How to gather performance data

Performance Data Tools

Page 23: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

23 © 2012 outsystems – all rights reserved

Application

Application Server

Infrastructure

3 layers to gather performance metrics

Performance Data Tools

Page 24: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

24 © 2012 outsystems – all rights reserved

Performance Data Tools Infrastructure layer (.NET stack)

Use Windows Performance Counters Start menu > Control Panel > Administrative Tools > Performance Monitor

Page 25: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

25 © 2012 outsystems – all rights reserved

Performance Data Tools Infrastructure layer (.NET stack)

Keep counter values below the thresholds

Performance Counter Threshold

\Processor(_Total)\% Processor Time Depends on the server role:

FE < 40% DB < 60%

\Memory\Pages/sec < 1000 at all times

\PhysicalDisk\Avg. Disk Queue Length < 2 for each physical disk drive

\TCPv4\Connections Established < (100 * #worker processes + 50) * 2 < 3900

Page 26: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

26 © 2012 outsystems – all rights reserved

agileplatformenvironment

Performance Data Tools Infrastructure layer (.NET stack)

Client

Frontend 1

Database

Frontend 2

External Systems

Load Balancer

CPU RAM DISK NETWORK

Page 27: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

27 © 2012 outsystems – all rights reserved

Performance Data Tools Application Server layer (.NET stack)

Use Windows Event Viewer to check for IIS events Start menu > Control Panel > Administrative Tools > Event Viewer

Page 28: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

28 © 2012 outsystems – all rights reserved

Performance Data Tools Application Server layer (.NET stack)

Make sure IIS Application Pools are properly configured Follow “Tuning and Security Check list” on Agile Platform .NET Install Checklist

Event Threshold

IIS Worker Process recycle Recycling should only occur when scheduled and off hours

Page 29: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

29 © 2012 outsystems – all rights reserved

agileplatformenvironment

Performance Data Tools Application Server layer (.NET stack)

Client

Frontend 1

Database

Frontend 2

External Systems

Load Balancer

CPU RAM DISK NETWORK

IIS WP RECYCLES

Page 30: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

30 © 2012 outsystems – all rights reserved

Performance Data Tools Application layer

Use Agile Platform’s Service Center reports Service Center > Analytics > Reports

Page 31: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

31 © 2012 outsystems – all rights reserved

Performance Data Tools Application layer

Service Center Report Threshold

Slow SQL <100 occurrences with 500ms of average duration

Slow Screen <100 occurrences with +1s of average duration

Slow Web Service <100 occurrences with +1s of average duration

Slow Web Reference <100 occurrences with +1s of average duration

Slow Extension <100 occurrences with +1s of average duration

Slow Timer Depends on the business logic

Page 32: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

32 © 2012 outsystems – all rights reserved

agileplatformenvironment

Performance Data Tools Application layer

Client

Frontend 1

Database

Frontend 2

External Systems

Load Balancer

SLOW SCREEN

SLOW SCREEN SLOW SQL

SLOW SQL

SLOW EXTENSION

SLOW EXTENSION

SLOW WEB REFERENCE

SLOW WEB REFERENCE

SLOW WEB SERVICE

SLOW WEB SERVICE

SLOW TIMER

SLOW TIMER

CPU RAM DISK NETWORK

IIS WP RECYCLES

Page 33: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

33 © 2012 outsystems – all rights reserved

How to prevent performance emergencies

Now what?

Page 34: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

34 © 2012 outsystems – all rights reserved

Performance Monitoring Goals

ü  Maintain good performance levels

ü  Know your apps/installation expected behavior

ü  Identify new patterns and trends

ü  No surprises!

Page 35: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

35 © 2012 outsystems – all rights reserved

Agile Platform (Application)

Infrastructure

2 Layer Monitoring

Performance Monitoring

Page 36: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

36 © 2012 outsystems – all rights reserved

Performance Monitoring Infrastructure

Setup monitoring on DB and FE servers •  CPU, Memory, Disk, Network •  Windows Services status

IIS OutSystems Services

•  Database indicators Size

Average Lock Wait Index Fragmentation

Page 37: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

37 © 2012 outsystems – all rights reserved

Performance Monitoring Infrastructure

Define thresholds and alarms •  Start with recommended thresholds •  Adapt to your requirements

Use tools already available on IT •  e.g. Tivoli, OpManager, Nagios •  Windows Performance Monitor and Event Viewer

Page 38: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

38 © 2012 outsystems – all rights reserved

Performance Monitoring Agile Platform

Collect daily Service Center reports •  Slow SQL •  Slow Screens

•  Daily History - Screen Hits, Daily Users Service Center > Analytics > Daily History Automatically generated by the platform (if active on Server Configuration)

Page 39: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

39 © 2012 outsystems – all rights reserved

Performance Monitoring Agile Platform

Check Error Log daily for timeouts •  May indicate performance problems

Increase Log Cycle Period •  Configuration Tool > Logs tab (default is 4 weeks)

Page 40: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

40 © 2012 outsystems – all rights reserved

Performance Monitoring A framework

1st

Collect

2nd

Analyze 3rd

Implement

Page 41: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

41 © 2012 outsystems – all rights reserved

Performance Monitoring Phase 1 - Collect

Gather metrics in one place e.g. Excel Workbook

Period depends on criticality 1 day vs. 1 week

Register daily events to aid in analysis •  Know what happened and when

E.g. scheduled maintenance, external downtimes •  Correlate with performance data

Make sure to reserve budget for these tasks •  It must be followed through!

•  E.g. 1 hour daily to collect and analyze

1st

Collect

2nd

Analyze 3rd

Implement

Page 42: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

42 © 2012 outsystems – all rights reserved

Performance Monitoring Phase 2 - Analyze

Focus on “Top 10” most relevant •  SQL, Screens, Extensions, Web Services

•  By usage or criticality

Build visualizations (graphs) •  Better identification of trends

•  Easier to analyze and spot deviations

1st

Collect

2nd

Analyze 3rd

Implement

Page 43: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

43 © 2012 outsystems – all rights reserved

Performance Monitoring Phase 2 - Analyze

1st

Collect

2nd

Analyze 3rd

Implement

Page 44: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

44 © 2012 outsystems – all rights reserved

Performance Monitoring Phase 3 - Implement

Pick “Top X” to address on each sprint •  Fix them when they are small! •  Prioritize increasing trends

Do not postpone! •  Make it a compromise to implement some

improvements every sprint •  Keeps focus on performance •  Positive impact on users

1st

Collect

2nd

Analyze 3rd

Implement

Page 45: OutSystems - A Framework Approach for Troubleshooting - NextStep 2012

www.outsystems.com

45 © 2012 outsystems – all rights reserved

Thank you!

Paulo Cunha [email protected]