CWIC Developers Meeting January 29 th 2014 Calin Duma [email protected] Service Level Agreements...

12
CWIC Developers Meeting January 29 th 2014 Calin Duma [email protected] om Service Level Agreements High-Availability, Reliability and Performance

Transcript of CWIC Developers Meeting January 29 th 2014 Calin Duma [email protected] Service Level Agreements...

Page 1: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

CWIC Developers Meeting

January 29th 2014

Calin [email protected]

Service Level AgreementsHigh-Availability, Reliability and Performance

Page 2: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

Agenda

• What are SLAs• Why use SLAs• Joint SLOs / SLAs dependencies• How to establish joint SLOs / SLAs• CWIC Data Providers SLO challenges• Initial Sample Approach• CWIC Start performance challenges• CWIC Start metrics options• Joint metrics to consider

2

Page 3: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

What are SLAs

• Service Level Agreements:

– Specify service level requirements between a service provider and a service consumer

– Often in terms of a legal contract with penalties for non-compliance– Concrete and measurable service level objectives (SLOs) are used to test that

SLAs are being met

• In general there is a recognized gap between the expected service levels and the delivered ones:– Availability: downtime per year (ex 5 minutes translates to an SLO of 99.999% uptime)– Reliability: advertised components failure rates, can be mitigated by fault tolerant

software and system design– Performance: response time (completion - submission) and throughput (concurrent

requests) oriented SLOs• Response times increase as throughput increases

3

Page 4: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

Why use SLAs

• CWIC is gaining popularity and is providing potential for excellent exposure of data islands (India, China, Brazil etc.)

• We should provide better end-user service:

– Service consumers know what to expected when using GCMD, CWIC and CWIC Start (and other clients)

• We should establish SLOs for our applications:– Involves hardware resources, infrastructure platforms (OS, Web Application stack)

and custom code – Teams are motivated to work toward agreed upon targets– Can dictate and provide empirical data for future hardware and software needs

4

Page 5: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

Joint SLOs / SLAs dependencies• CWIC Start depends on GCMD and CWIC• CWIC depends on GCMD and 5 providers:

– NASA, INPE, GHRSST, USGSLSI and CCMEO• In order to have availability, reliability and performance SLOs we

would have to coordinate among 8 components:1. CWIC Start2. GCMD3. CWIC4. NASA / ECHO5. INPE6. GHRSST7. USGSLSI8. CCMEO

• If any of the above components are down or slow the end-user will be subject to a sub-optimal experience

• Complexity will increase when more providers are added5

Page 6: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

How to establish joint SLOs / SLAs

• While usage of our services is free it doesn’t mean that we can’t provide a reasonable user experience and set realistic user expectations

• True joint SLOs / SLAs would be at most the SLO / SLA of the weakest component and therefore not desirable

• CWIC, GCMD, CWIC Start and ECHO can work together on joint SLOs / SLAs

• CWIC can collect existing provider SLAs where applicable or help providers think about SLAs

6

Page 7: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

CWIC data providers SLOs challenges

• Similar to ECHO’s challenges of dealing with its 11 data partners

• ECHO model is something we can learn from:– Provide individual availability notices on the CWIC

WGISS home page– If providers do not communicate down times or

availability, collect statistics with monitoring technologies / APIs

– Collect CWIC Start and CWIC metrics that can capture current SLOs for all external dependencies*

7

Page 8: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

Initial Sample Approach

8

Page 9: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

CWIC Start Performance Challenges• CWIC Start had performance issues due to:– A distributed search memory leak– Inconsistent OpenLayers maps rendering– Potential memory leak due to high load generated by

search bots• It was and still is very challenging to pinpoint

performance problems due to:– Ruby on Rails running on top of jRuby and difficulties of

using memory profilers that point to the actual ruby code– Clustered / load balanced deployment and requests from

the same user being serviced on different hosts– Difficulties in collecting host level performance metrics

such as free physical memory, swap utilization, CPU and network IO

9

Page 10: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

CWIC Start metrics options• We are investigating Real User Monitoring

Metrics (RUM) that capture the user browser experience:– Google Analytics (~26 subjects with hundreds of

dimensions / specific descriptive attributes)– W3C Navigation Timing to complement GA– New Relic: excellent back end code instrumentation

targeting SLAs and detailed performance metrics• We added semantic logging and detailed

durations to make it easy to trace requests on a cluster:

Example: [813d9f11-3235-4907-8470-7df507a10eac] [74.86.158.106] Started GET "/datasetssearch?standard=csw" for 74.86.158.106 at 2014-01-27 02:22:25 +0000 Example: HttpRequest.submit RESPONSE, DURATION (uCPU sCPU usCPU real): 0.000000 0.000000 0.000000 ( 0.239054)

10

Page 11: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

Joint metrics to consider

• CWIC ExtJS application is an excellent start• Questions to answer:– Who is using CWIC and GCMD CSW (clientId?)– Who is using CWIC and GCMD OpenSearch

(clientId)– Who is using CWIC without GCMD interaction– Granule metadata and data downloads via CWIC

provided links– Percentage of direct downloads vs. provider

welcome page redirects– Average response times

11

Page 12: CWIC Developers Meeting January 29 th 2014 Calin Duma cv.duma@gmail.com Service Level Agreements High-Availability, Reliability and Performance.

Joint metrics to consider cont.

• Questions to answer (CSW and OpenSearch):– Number of errors due to provider internal errors– Number of errors due to CWIC internal errors– Number of errors due to provider unavailable– CWIC specific performance metrics per provider– GCMD specific performance metrics– Others?

12