The Inductive Software Engineering Manifesto Principles for Industrial Data Mining Paper Authored...
-
Upload
cathleen-peters -
Category
Documents
-
view
214 -
download
0
Transcript of The Inductive Software Engineering Manifesto Principles for Industrial Data Mining Paper Authored...
The Inductive Software Engineering
ManifestoPrinciples for Industrial Data
Mining
Paper Authored By:Menzies & Kocaganeli – Lane Dept of CS/EE, WVUBird, Zimmerman, & Schulte – Microsoft Research
Presentation By: Ebeid Soliman & Mason Schoolfield
Motivation
• This paper is a reflection of the authors’ applied data mining work, discussions with researchers, and software engineering practitioners.
• Document methods and experience from industrial practitioners
• The principal question is : what characterizes the difference between academic and industrial data mining ?
• Motivation: Successful data-mining projects in industry
Inductive Software Engineering
• “A branch of software engineering that focuses on the delivery of data mining based software applications to users”
• Understand user goals to inductively generate the models that most matter to the user
• Industrial practitioners are focused on users, whereas academic data mining research is focused on algorithms
Industrial Data Mining7 Principles
• Users before algorithms
• Plan for scale
• Early feedback
• Be open-minded
• Do smart learning
• Live with the data you have
• Broad skill set, big toolkit
Users before algorithms
•Guiding Principle – Users Before Algorithms
•Mining algorithms are only good if users fund their use in real-world applications
Users before Algorithms
Hallmarks of good interaction meetings
• Users bring senior management to the meetings
• Users keep interrupting (you or each other) and debating your results
• Indicates the users understand your explanation of the results
• Your results are touching on issues that concern them
• User begin to offer more data sources for analysis
• Users invite you to their workspace to show how to do part of the analysis
Plan for scaleKnowledge Discovery in
Databases (KDD)• KDD – Knowledge Discovery In Databases
• The non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data
• Repetition RequiredSteps that compose the KDD process - Fayyad 1996
Plan for scale• Most data mining is data pre-processing
• Gaining access to databases in business groups is time consuming
• To ensure repeatability automate as many KDD steps as possible
• Data mining methods are repeated multiple times
• Answer user questions
• Enhance data mining method or Fix bugs
• Deploy to different user groups
Plan for scale
• Observed Phases
• Scout - rapid prototyping, apply many methods to data, explore range of hypotheses, gain user interest (get feedback)
• Survey - experiment to find stable models - focusing on user goals
• Build - integrate models into a deployment framework – suitable for target user base
• Team size doubles after scouting, doubles after surveying – time implications!
Early feedback
• Simplicity first: before conducting very elaborate studies, try applying very simple tools to gain rapid early feedback
• Get Feedback Early and Often
• Discretize continuous attributes (determine what is ignorable)
Be open-minded
• Avoid a fixed hypothesis
• Avoid a fixed approach, particularly for data not been mined before
• Initial results are important and can change goals
Smart Learning
• Inductive agents, human or otherwise, make errors
• Don’t torture the data to meet preconceptions, but it can be ok to go “fishing”
• Important outcomes are riding on your conclusions - check & validate!
• Check the variance before concluding, it may be based on statistical noise
• Check conclusion stability against different sample sizes
• Check conclusion support to avoid conclusions based on a small percent of the data
Smart Learning
• Prevent spurious conclusions by carefully controlling data collection and focusing on a small space of hypotheses (IF YOU CAN)
• Rule learners – RIPPER and INDUCT check against randomly generated alternatives (if probabilities are the same you can delete the rule)
Live with the data you have
• Collecting data comes at a cost!
• Go mining with the data you have, not the data you hope to have at a later date
• Remove spurious data - conduct instance or feature selection studies
• 80 to 90% of rows and all but the square root of columns can be deleted before compromising performance of the learned model
• Be respectful but doubtful to all user-suggested domain hypotheses
Broad skill set, big toolkit
• Try multiple inductive technologies
• Inductive Engineers generate novel and insightful feedback for users
• Researchers can work to perfect a single algorithm
• Big ecology: Use tools supported by a large ecosystem of developers who are constantly building new modules (e.g. R, WEKA, MATLAB)
What does this mean for Industry?
• Implications for Project Management
• Scouting takes weeks, Surveying takes months, and Building takes years
• Implications for Training
• Communications skills
• Results briefing
• Scripting
Research to help Industry
• Research themes to benefit industrial data mining
• Analysis patterns for inductive engineers (like design patterns for developers)
• Design pattern for data miners
• Optimizations of learning algorithms
• Anomaly detectors
• Business-aware learners
Final Notes
• Conclusion – Be user-focused, keep these principles in mind
• Hopefully these generalities will be helpful
• Share your experiences and knowledge so that Industrial Inductive Engineering can mature