Addressing Diagnostic Complexity The EDDY Approach End-to-end Diagnostic DiscoveryY Chas DiFatta...

24
Addressing Diagnostic Complexity The EDDY Approach End-to-end Diagnostic DiscoveryY Chas DiFatta [email protected] Mark Poepping [email protected] Joel Smith [email protected]

Transcript of Addressing Diagnostic Complexity The EDDY Approach End-to-end Diagnostic DiscoveryY Chas DiFatta...

Addressing Diagnostic Complexity

The EDDY Approach

End-to-end Diagnostic DiscoveryY

Chas DiFatta [email protected] Poepping [email protected]

Joel Smith [email protected]

Problems of IT Management

• Distributed systems are extremely difficult to diagnose, i.e. complexity, scale.

• Limited number of domain experts.

• Need to transfer the experience of experts to less experienced staff.

• Increasing amount of internal services are increasingly dependant on external resources.

Problems of Distributed System Diagnosticians

• No access to the diagnostic data• Discovering valuable information in a sea of data• Correlating different diagnostic data types• Providing evidence for non-repudiation of a

diagnosis• Finding time to create tools to transfer diagnostic

knowledge to less skilled organizations and/or individuals

Application centric:

• I can’t open or launch the application!

• I can’t authenticate with the application!

• The application says I’m not authorized to use it!

• The application is behaving inconsistently!

• The application is giving errors!

• I can’t use all the features of the application!

• The application’s performance is poor!

??

?

Workstation Centric:

• The performance of my workstation seems slow!

• Is the network down?

From the end-user’s view

From the help desk’s view

Application centric:

• Vital Signs – “Is the application’s basic functionality working?”

• Authn/Authz: “can the user log in and what are their privileges ?”

•Highly focused – “Is a specific feature of the app not working?”

? ?

?

Workstation Centric:

• Does the user have the correct version and configuration?

• Can the user connect to the network or is it a performance problem?

• Security problem? E.g. Botnet, virus, worm, other compromise?

• Is the user having a hardware problem?

?

Infrastructure centric enterprise based:

• Can the user get to key local services? E.g. DHCP, DNS, SMTP, POP/IMAP, NTP, Authn, etc and if so, are they operating properly?

From the diagnostician’s view

Application Centric:

• Does the application have enough critical resources?

• Is the application reporting any internal or external errors?

• Are the lower, upper middleware and services that the application is dependent on functioning an expectable level?

Network Centric:

• Is there an connectivity problem between the user and the application?

• Could there be a routing or firewall problem?

• Is the performance or loss of the network at an acceptable level?

Service Centric:

• Are the middleware services that the end user is dependent upon functioning within an acceptable level and can they get to them? E.g. DHCP, Authn, Authz, DNS, NTP, Email, VPN, Web, etc.

??

?

From the CIO’s view

Finance centric:

• What will I need to spend resources on?

• Is my staff’s time being used effectively?

• What is the growth in specific areas?

??

?

Business centric:

• Are my customers being serviced properly?

• How is the infrastructure operating?

Compliance centric:

• Are our diagnostic procedures consistent with our security and privacy policies?

• Are we within compliance boundaries for

• Processes and procedures

• Legal issues

The end-user’s needs

Application centric tools that:

• Verifies that the user has the correct resources

• Conforms a baseline of functions to the user. E.g. what can the user do and a test to prove to them that they can

• Provides ways for users to tag errors at any phase of the applications operation and publish them to the diagnostic repository

!!

!

A workstation centric tool that:

• Verifies basic hardware operation

• Reports on software versions and any internal errors

• Reports on network connectivity and performance

• Verifies that key network services are available

• Scans system for security external and internal security vulnerabilities

• Publishes results to diagnostic repository

The help desk’s needs

Application centric tools:

• Real-time views into the application as the user is operating

• Medium depth tools verifies the operation of the application and its supporting services

• The ability to share diagnostic information with external groups

! !

!

A workstation centric tool that verifies the users:

• Baseline software and hardware configuration

• Network connectivity and performance

• Key network services are available (DHCP, DNS, NTP, Authn/Authz etc.)

!Service centric enterprise based tools:

• Testing enterprise base services (DNS, SMTP, POP/IMAP, NTP, Authn, Authz, etc.) from the perspective of the user

• Querying the infrastructure about internal and external problems

• The ability to share diagnostic information with internal groups

The diagnostician’s needs

Application Centric Tools:

• Verification that the application has the correct resources

• Highly focused tests into specific features and modules

• The ability to share detailed diagnostic information to the developer in near real-time

Network Centric Tools:

• Forensic tools that query the infrastructure about internal and external connectivity problems

• Active testing tools that perform network tracers at specific intervals and report anomalies into a reporting infrastructure

Service Centric Tools:

• Highly focused passive and active lower, upper, and middleware diagnostic tools that report anomalies

• Forensic tools that query detailed events of key services using their logs and other means

!!

!

The CIO’s needs

Finance centric:

• Reporting on anomaly and problem costs

• Reporting infrastructure growth

!!

!

Business centric:

• Reporting on help desk problems and resolution

• Reporting on infrastructure health

Compliance centric:

• Security process event process reporting

• Reporting that specific processes are being done

State of Practice

• Network, application, system and security events separate, therefore extremely difficult to correlate

• Data represents only what has faulted

• No end-to-end accountability of transactions. I.e. email, web, VoIP, intrusion

VisionCreate an activity audit ledger/application that...

• Provides a means to study the behavior of faults and anomalies

• Explores the impact of an Internet with assured electronic communications and its influence on infrastructure, security, reliability, privacy and trust

• Assures the ‘default’ electronic interaction by creating a means of non-repudiation between two or more parties

What if?

• Events (application, network, system, security, environmental) could be collected, disseminated and correlated using a common backplane?

• The backplane provides access to diagnostic data to tool developers and researchers?

Initial Direction

Enabling mechanism for investigating: • Machine to machine interaction, i.e. services• Taxonomic risk analysis of security anomalies• Automated diagnostic practices, not just what

has faulted but how the fault occurred• Perceived anomalies verses actual faults• Embedded system events• High volume event driven systems

What is EDDY?

• Consolidates events using a simple structure (CER) to enable a high degree of correlation

• Event management environment to collect, disseminate, store and analyze events

• Diagnostic tool platform that exposes the events to enhance and leverage existing tools as well as enable the next generation

Gory Technical Details

Policy

Diagnostic Data LifecycleAccess

Time

Collection Anonymize/Filter

Aggregation Archive Scour

SponsorsThis work has been funded and supported by;• NSF Middleware Initiative (Cooperative

Agreement No. ANI-0330626)

• Internet2

• Carnegie Mellon University

To Now

• Started in Fall 2003: Internet2 mw-e2ed– NSF NMI funding

• Fall 2005: Release 1.0– Early code – experimentation; “touch it”– Models/schema will change (partly the point)

Looking Forward• Other development help (Duke, others)• Expand to other use cases external to CMU

– Email performance and diagnostics– Lionshare, Shibboleth

• Operational/Commercial Interest– Abilene/NLR– IBM, Intel, Cisco, Microsoft– Architecture, standards, reference

implementation, experimental environment

Looking Forward (2)• CMU Campus Adopters – initial use cases

– CS/Cylab – security research• Dragnet – network flows

– Architecture – environment monitoring/control• Environmental event data from many ultra small

devices and embedded systems• Intelligent Workplace you may have toured..

– Computing Services: Systems, Middleware, Network, Security

• Consolidation of application log files, fault analysis• Traffic consumption, network event correlation• Security event correlation and forensics

Looking Forward (3)

• Expanding… (seeking partners/funding)– Mature base technology– Spawn effort for diagnostic application

development– Enable multi-subsystem correlation– Experiment with extending research data flow

analysis into multi-campus; federating/automating some diagnostic data sharing

Addressing Diagnostic Complexity

The EDDY Approach

End-to-end Diagnostic DiscoveryY

Chas DiFatta [email protected] Poepping [email protected]

Joel Smith [email protected]