The Inductive Software Engineering Manifesto Principles for Industrial Data Mining Paper Authored...

The Inductive Software Engineering

ManifestoPrinciples for Industrial Data

Mining

Paper Authored By:Menzies & Kocaganeli – Lane Dept of CS/EE, WVUBird, Zimmerman, & Schulte – Microsoft Research

Presentation By: Ebeid Soliman & Mason Schoolfield

Motivation

• This paper is a reflection of the authors’ applied data mining work, discussions with researchers, and software engineering practitioners.

• Document methods and experience from industrial practitioners

• The principal question is : what characterizes the difference between academic and industrial data mining ?

• Motivation: Successful data-mining projects in industry

Inductive Software Engineering

• “A branch of software engineering that focuses on the delivery of data mining based software applications to users”

• Understand user goals to inductively generate the models that most matter to the user

• Industrial practitioners are focused on users, whereas academic data mining research is focused on algorithms

Industrial Data Mining7 Principles

• Users before algorithms

• Plan for scale

• Early feedback

• Be open-minded

• Do smart learning

• Live with the data you have

• Broad skill set, big toolkit

Users before algorithms

•Guiding Principle – Users Before Algorithms

•Mining algorithms are only good if users fund their use in real-world applications

Users before Algorithms

Hallmarks of good interaction meetings

• Users bring senior management to the meetings

• Users keep interrupting (you or each other) and debating your results

• Indicates the users understand your explanation of the results

• Your results are touching on issues that concern them

• User begin to offer more data sources for analysis

• Users invite you to their workspace to show how to do part of the analysis

Plan for scaleKnowledge Discovery in

Databases (KDD)• KDD – Knowledge Discovery In Databases

• The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data

• Repetition RequiredSteps that compose the KDD process - Fayyad 1996

Plan for scale• Most data mining is data pre-processing

• Gaining access to databases in business groups is time consuming

• To ensure repeatability automate as many KDD steps as possible

• Data mining methods are repeated multiple times

• Answer user questions

• Enhance data mining method or Fix bugs

• Deploy to different user groups

Plan for scale

• Observed Phases

• Scout - rapid prototyping, apply many methods to data, explore range of hypotheses, gain user interest (get feedback)

• Survey - experiment to find stable models - focusing on user goals

• Build - integrate models into a deployment framework – suitable for target user base

• Team size doubles after scouting, doubles after surveying – time implications!

Early feedback

• Simplicity first: before conducting very elaborate studies, try applying very simple tools to gain rapid early feedback

• Get Feedback Early and Often

• Discretize continuous attributes (determine what is ignorable)

Be open-minded

• Avoid a fixed hypothesis

• Avoid a fixed approach, particularly for data not been mined before

• Initial results are important and can change goals

Smart Learning

• Inductive agents, human or otherwise, make errors

• Don’t torture the data to meet preconceptions, but it can be ok to go “fishing”

• Important outcomes are riding on your conclusions - check & validate!

• Check the variance before concluding, it may be based on statistical noise

• Check conclusion stability against different sample sizes

• Check conclusion support to avoid conclusions based on a small percent of the data

Smart Learning

• Prevent spurious conclusions by carefully controlling data collection and focusing on a small space of hypotheses (IF YOU CAN)

• Rule learners – RIPPER and INDUCT check against randomly generated alternatives (if probabilities are the same you can delete the rule)

Live with the data you have

• Collecting data comes at a cost!

• Go mining with the data you have, not the data you hope to have at a later date

• Remove spurious data - conduct instance or feature selection studies

• 80 to 90% of rows and all but the square root of columns can be deleted before compromising performance of the learned model

• Be respectful but doubtful to all user-suggested domain hypotheses

Broad skill set, big toolkit

• Try multiple inductive technologies

• Inductive Engineers generate novel and insightful feedback for users

• Researchers can work to perfect a single algorithm

• Big ecology: Use tools supported by a large ecosystem of developers who are constantly building new modules (e.g. R, WEKA, MATLAB)

What does this mean for Industry?

• Implications for Project Management

• Scouting takes weeks, Surveying takes months, and Building takes years

• Implications for Training

• Communications skills

• Results briefing

• Scripting

Research to help Industry

• Research themes to benefit industrial data mining

• Analysis patterns for inductive engineers (like design patterns for developers)

• Design pattern for data miners

• Optimizations of learning algorithms

• Anomaly detectors

• Business-aware learners

Final Notes

• Conclusion – Be user-focused, keep these principles in mind

• Hopefully these generalities will be helpful

• Share your experiences and knowledge so that Industrial Inductive Engineering can mature

The Inductive Software Engineering Manifesto Principles for Industrial Data Mining Paper Authored...

Documents

Transcript of The Inductive Software Engineering Manifesto Principles for Industrial Data Mining Paper Authored...