Post on 07-Aug-2018
8/19/2019 Data Science Methodolgy
1/12
© 2015 IBM Corporation
Foundational Data Science Methodology
John B. Rollins, Ph.D.IBM Analytics | IBM Corporation
8/19/2019 Data Science Methodolgy
2/12
© 2015 IBM Corporation2
Introduction
! Why we are interested in data science
- Solve problems and answer questions
- Gain useful insights through modeling to predict outcomes or discover
underlying patterns
! Rapidly evolving technologies
- Platform growth
- In-database analytics
- Text analysis
- Automation
8/19/2019 Data Science Methodolgy
3/12
© 2015 IBM Corporation3
Data science methodology
! Why?
- To provide a guiding strategy
! What?
- General strategy that guides the processes and activities within a given
domain
- Does not depend on particular technologies or tools
- Not a set of techniques or recipes
- Provides the data scientist with a framework for how to proceed to obtain
answers
8/19/2019 Data Science Methodolgy
4/12
© 2015 IBM Corporation4
Methodology diagram
BusinessUnderstanding
Data
Understanding
DataPreparation
AnalyticApproach
DataRequirements
Data Collection
Modeling
Evaluation
Deployment
Feedback
8/19/2019 Data Science Methodolgy
5/12
© 2015 IBM Corporation5
Business understanding
! Every project begins with business understanding.
- Clearly define project objectives and requirements from the business
perspective… key to a successful solution
- Business sponsors most critical in this stage
• Define problem and solution requirements
- Business sponsors involved throughout the project
• Provide domain expertise
• Review intermediate findings
• Ensure that the work generates the intended solution
BusinessUnderstanding
8/19/2019 Data Science Methodolgy
6/12
© 2015 IBM Corporation6
Analytic approach
! With a clear definition of the business problem, we define the analytic
approach to solving the problem.
- Express problem in context of statistical and machine learning techniques
- Identify suitable technique(s)
- Examples
• Classification to predict response to a promotion ("yes" or "no“)
• Clustering and Associations for customer segmentation and market basket
analysis
AnalyticApproach
8/19/2019 Data Science Methodolgy
7/12
© 2015 IBM Corporation7
Data
Understanding
DataRequirements
Data Collection
Data compilation
! The chosen analytic approach determines the
data requirements.
- Content, formats, representations
! Initial data collection is performed.
- Available data resources (structured, unstructured,semi-structured) relevant to the problem domain
- Decide whether to obtain less-accessible data
elements
- Revise data requirements or collect more data,
if needed
! Then data understanding is gained.
- Descriptive statistics and visualization
- Content, quality, initial insights about data
- Additional data collection to fill gaps, if needed
8/19/2019 Data Science Methodolgy
8/12
© 2015 IBM Corporation8
Data preparation
! Data preparation encompasses all activities to construct the data set.
- Data cleaning
• Missing or invalid values
• Eliminating duplicate rows
•
Formatting properly
- Combining multiple data sources
- Transforming data
- Feature engineering
- Text analysis
! Accelerate data preparation by
automating common stepsData
Preparation
8/19/2019 Data Science Methodolgy
9/12
© 2015 IBM Corporation9
Modeling
Modeling
! Modeling focuses on developing models.
- Predictive or descriptive models
- According to the previously-defined analytic approach
- Training set for predictive modeling
! Highly iterative process
- Intermediate insights " refinements in data preparation & model specification
- Multiple algorithms & parameters to find best model for a given technique
8/19/2019 Data Science Methodolgy
10/12
© 2015 IBM Corporation10
Model evaluation
! Model evaluation is performed during model development and before
model deployment.
- Understand the model’s quality
- Ensure that it properly addresses the business problem
! Diagnostic measures
- Suitable to the modeling technique used
- Testing set
- Refine model as needed
! Statistical significance tests
Evaluation
8/19/2019 Data Science Methodolgy
11/12
© 2015 IBM Corporation11
Deployment and feedback
! Once finalized, the model is deployed into a production environment.
- May be in a limited / test environment until model is proven
- Involves additional groups, skills, and technologies
• Solution owner
•
Marketing
•
Application developers
• IT administration
!
Feedback to assess model performance- Gathering and analysis of feedback for assessment
of the model’s performance and impact
- Iterative process for model refinement and redeployment
- Accelerate through automated processes
Deployment
Feedback
8/19/2019 Data Science Methodolgy
12/12
© 2015 IBM Corporation12
Ongoing value through good methodology
! Methodology diagram illustrates the iterative nature of problem-solving in
a data science project.
! Through feedback, refinement, and redeployment, models are continually
improved and adapted to evolving conditions.
! The model continues to provide value to the organization for as long as
the solution is needed.