Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi

Post on 09-Jan-2016

22 views 2 download

description

Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi. Introduction. Autonomic Problem Approach Results Discussion. The Autonomic Problem. To allow the application to recover automatically from transient and intermittent software failure. The Approach. - PowerPoint PPT Presentation

Transcript of Autonomous Recovery in Componentized Internet Application Candea et. al Vikram Negi

Autonomous Recovery in Componentized Internet

ApplicationCandea et. al

Vikram Negi

Introduction

• Autonomic Problem

• Approach

• Results

• Discussion

The Autonomic Problem

• To allow the application to recover automatically from transient and intermittent software failure.

The Approach

• Introduce the idea :– Microanalysis (fault detection)– Microrebooting (rapid recovery)– External Management (recovery action)

• Integrate and Test with JBOSS

Design Overview

• Autonomous Process – Monitoring

• Java probes

– Fault detection• Generate Anomaly report

– Recovery• Takes action

• Total time to recovery.

J2EE Review

• J2EE enterprise apps = collection of reusable Java modules

• JSPs / servlets invoke EJBs, which invoke other EJBs, ...

• EJB = Java component that complies to a certain interface and provides a service

• Deployment descriptor (per-bean XML file) conveys run-time characteristics and dependencies; used in deploying the application

JBoss Design

• Open-source J2EE app server• Written entirely in Java • Microkernel with components held together by JMX (Mgmt Support)

JAGR = ROC-ified JBoss with Application-Generic Recovery

• 3 Tier Architecture

• Key Components– Macro analysis Engine

– Microrebooting Hook

– Recovery Manager

Pinpoint : Detection and Localization

• Store Observation– IP address of machine, timestamp– Globally unique request ID. – # of calls/returns to EJB’s– Association between sender and receiver.– Collect SQL Queries, update, read

Pinpoint : Analysis

• Analysis Engine– Centralized Engine

– Plugin based architecture

• Modeling Components– Assume both present

component behavior and historical (normal) behavior have same probability distribution.

– Ki square test to determine different probability distribution.

Recovery : micro-reboot is not expensive

• State Segregation– Store impt. state outside the application in database. – Persistent State

• CMP (container managed persistence, J2EE) is a requirement for prototype.

– Session State• Store in modified SSM(external session state store)

• Containment and Reintegration– Microreboot transitive closure of all inter-EJB references– XML deployment descriptors to determine grouping for closure– Complete or micro reboot

Recovery

• Enabling Micro reboot– Method in JBOSS EJB Container– Preserve Class Loader

Manage Recovery

• Recovery Policy

– Read failure report consider components > 1.0

– Micro-reboot(top n) or all >1.0

– Allow delay (~30sec)

– If error is present still try few time or reboot completely

– Finally report it to sys admin

Evaluation Test Framework

• Application– Petstore 1.1 (12 comp, 233 java file, 11K Loc)

– Petstore 1.3.1(47 comp, 310 java file 10K Loc)

– RUBiS (21 comp, 500 java file , 25K Loc)

• Workload– Implement Simulators with Transition table.

– 350 client (max utilization principle)

• Faultload– Based on industry experience

– No low level hardware or OS faults.

Evaluation Detection

• Result similar to other detector

• No discussion on absolute numbers?• Forced Java Runtime/Declared Exceptions, call emission and src code bug

• 1# How well the fault was detected, 2#how well major outage was detected ?

Evaluation : Localization

Localization % for a algorithm per fault type CIA > 85%No absolute data again ?

Evaluation : Recovery

• Introduce faults in SSM-RUBiS.

• Restart SSM-RUBiS or micro reboot component.

• Observation from 10 trials per 350 concurrent client.

Full v/s Micro reboot

• Injected a null reference fault in SB CommitBid, then a corrupt User-Item, SB BrowseCategories and SB CommitUserFeedback.

• Microreboot maintains steady response.

• 425 vs 3916 failed request

• 61527 vs 56028 success request

• What error condition did other trials had?

Total Recovery Time

• Corrupt SB_ViewItem set it to NULL.• 19.4 sec TRT• 18.5 sec in analysis• Pinpoint is bottleneck in micro reboot.

Pinpoint is app generic ?

• Upgrade to Petstore v.1.3.2– Works for the confidence interval

How different was the updated version??

Perfomance Overload

• Results for 30min fault free run w/ 350 clients

• In memory v/s Out memory (SSM)

• Marshalling costs

Assumption

• Well defined interface for components (.Net,J2ee)

• Deterministic call path b/w component

• No critical service request

• Training data for statistical model

• Guidelines (Crash Only Software)

Discussion

• Overall one of the Good Papers maybe bit verbose in introduction !

• Integrating framework for earlier work by Candea.• Limitation of the present statistical model.• Shared EJB state

– Modify JIT, disable microreboots(ref, static var)

• Application – Global data not scrubbed. • Cost Benefit : micro reboot v/s total reboot

Supplementary

• Application server = operating system for Internet applications (instantiates app components in containers, provides runtime system services, integrates with web server to make app webaccessible)

• http://people.epfl.ch/george.candea