Predictive Analytics (PA) in Practice€¦ · Predictive Analytics (PA). • In particular, to: –...
Transcript of Predictive Analytics (PA) in Practice€¦ · Predictive Analytics (PA). • In particular, to: –...
Predictive Analytics (PA) in Practice Steven Finlay 13th September 2016
Objectives & Agenda
• Today I want to discuss some of the practical aspects of using Predictive Analytics (PA).
• In particular, to: – Provide a business perspective on how
Predictive Analytics is applied.
– Highlight some of the risk and issues that one may encounter when designing, building and implementing business solutions that incorporate predictive models, developed using predictive analytics.
1.Recap: What is predictive analytics?
2.Problem formulation (requirements planning)
3.Legal and reputational risks & issues that can arise when using predictive analytics.
4.Q & A
2
Objectives Agenda topics
Appendix A. Recommended sources of further information about PA.
1. What is predictive analytics?
3
Data set containing
dependent and independent
variables
Predictive Analytics
• Predictive analytics is the process of generating a predictive model via the analysis of a suitable data set.
• A predictive model is the result of the predictive analytics process. The model captures the relationships (correlations) that have been found between the dependent and independent variables.
Constant +598
Term of loan Number of children ≤ 12 months +51 0 0 13 – 17 months +28 1 – 2 +12 18 – 23 months +9 3+ 0 24 – 35 months 0 36 – 47 months -5 Occupation status 48 – 71 months -19 Full-time employed +7 72+ months -36 Part-time employed -22
Self employed -9 Accomodation status Homemaker -17
Owner +32 Student -47 Renting -17 Unemployed -84 Living at home 0 Retired +2
Time at current address Time in current employment < 1 year -68 < 1 year -59 1 – 2 years -29 1 – 2 years -23 3 - 5 years -11 3 – 4 years -14 6 – 10 years 0 5 – 7 years 0 11+ years +33 6 - 12 years 0 13 - 19 years +6
Gross annual income $ 20+ years +12 125,000+ +17 Not in employment 0 90,000 - 124,999 +11 50,000 - 89,999 0 Number of previous good paid loans 30,000 - 39,999 -26 0 -12 0 - 29,999 -49 1 0 2+ +17
Predictive Model
Definition of Predictive Analytics: “The application of quantitative techniques to predict future, or otherwise unknown behaviour, of individuals or other entities” (Finlay, 2014)
1. What is predictive analytics?
• The term “Predictive Analytics” (PA) became prominent in the mid-late 2000s. – PA can be considered a sub-set of Data Mining. To some extent, it can
be viewed as a rebranding of a range of Data Mining tools, applied to certain types of prediction/forecasting problem.
– Machine learning / AI use many of the same techniques • Albeit applied to a slightly different, but overlapping problem domain.
• Common techniques used in PA include: – Multivariate regression – Decision Trees / Classification and Regression Trees (CART) – (Artificial) Neural Networks (ANNs) – Support Vector Machines (SVMs) – Survival Analysis – Ensemble Methods
4
1. What is predictive analytics? Common applications.
5
Application1 Example / Description Type of prediction generated by the model
Response (choice) modelling
The probability that someone will respond to a marketing communication.
Classification
Date matching Producing lists of people who are likely to enjoy going on dates with each other.
Classification
Credit scoring The likelihood that an individual will repay a credit obligation (loan, credit card, mortgage etc.)
Classification
Tax evasion The probability that someone is deliberately paying too little tax.
Classification
Transaction fraud The probability of a credit card transaction being fraudulent.
Classification
Customer spend / profitability
The estimated amount that a customer will spend or the profit that a customer will generate.
Regression
Life expectancy The expected lifespan of an individual. Regression In residence The time when someone is expected to be in their home
(so best time to call/visit them). Regression
1. This is only a small number of examples of the application of PA. Eric Segal lists 127 different applications, and this is not an exhaustive list! (Segal, 2013)
2. Problem formulation (requirements planning)
6
2. Problem formulation (requirements):
• Things usually start with a business problem.
• An organisation wants to do something new, or do something more efficiently. – PA is just one tool which may (or may
not) be useful for addressing that problem.
– A predictive model is often just one component of a wider decision making system. E.g. a credit card application processing system (diagram to the right).
– The model needs to fit within the requirements and constraints of that system.
– Insufficient understanding of the wider business requirement is one of the most common reasons why PA projects fail.
7
Data management
unit
Scoring and strategy unit (Decision Engine)
Application processing system
Customer interface (customer contact centre/website/branch)
Applicant
Credit application Decision
Predictive model
Predictive modelling drives most credit Granting decisions (e.g. cards, loans)
External data sources e.g. credit reference agencies
2. Problem formulation (requirements)
• The need for good planning at the outset of a PA project is no different from other similarly complex projects in other domains. – You would never dream of asking a
programmer to just start coding and expect them to deliver useful operational software.
– You need an architect to design a building before letting the builders lose.
– Just handing over a load of data to a statistician/management scientist, and expecting them to deliver something useful is a very risky practice.
• In the remainder of this section we are going to look at some of the factors that one should consider before beginning any significant type of predictive analytics project.
8
Key requirements to consider: • Definition of business and modelling Objectives. • Success criteria • Outliers and other model exclusions. • Dealing with missing data. • Forecast horizon. • Sampling / Sample size. • Choice of modelling approach. • Validation methodology. • Model usage (decision rules). • How is the model going to be implemented, and by whom? • Post implementation monitoring
Good practice is to write a “requirements Document” that contains all of these things, and to obtain sign-off from the key business stakeholders before proceeding further.
2. Problem formulation (requirements): Definition of business and modelling objectives
• Business objectives often: – Fuzzy / ill defined
– Qualitative / subjective
– Differs depending who you talk to
• Modelling objective (Dependent variable) – Simple
– Quantitative (a single number)
• Modelling objective (for classification) – Binary 0/1 value. E.g.
– Event = 1. Non-Event = 0
9
2. Problem formulation (requirements): Definition of business and modelling objectives
10
• Imagine that we have been asked to build a model to identify tax evaders.
• Business objective:
– “We want an automated way of screening tax returns to identify cases were the return has been completed incorrectly, resulting in tax liability that is significantly lower than it should be.”
– “The cases identified will be passed to tax inspectors to investigate.”
– “We would therefore, like a model that predicts which tax returns are likely to generate significant additional tax yield when subject to detailed investigation.”
• To support the building of the model, the tax authority is willing to make all of its data available.
• However, it has initially provided details of a large number of historic tax investigations and their outcomes (if the tax return was correct or incorrect, and by how much).
• Its other databases are available on request.
2. Problem formulation (requirements): Definition of business and modelling objectives
11
• Modelling Objective (dependent variable)
– A naïve model developer would now go away and build a model using the data provided, with the dependent variable defined as follows:
• Event (1) = Tax investigation resulting in £>0 additional tax liability being identified.
• Non-event (0) = Tax investigation resulting in £0 additional tax liability being identified.
• OK – what might be wrong with this formulation?
• Many things! For example:
– The business requirement referred to “significant” revenue. What does significant mean?
– The data provided only referred to the amount of the discrepancy identified. It did not report the actual amount of tax recovered.
– What about errors resulting in overpayment rather than underpayment?
2. Problem formulation (requirements): Success criteria
• In the classroom, model performance is often measured soley (or mainly) in terms of predictive accuracy (discriminates well and is unbiased). – e.g. R-squared, GINI/Somer’s D/AUC, MAPE etc.
• These measures are important, but there are often other (business/qualitative) criteria, which must also be considered in practice. In particular: – How much money will I make / save?!
– Interpretability and common sense; i.e. can I understand how the model makes it predictions, and does the model structure conform to business expectation?
– Model complexity.
– Local performance in one region of the model’s prediction range (as opposed to global performance across the full range of predictions).
12
2. Problem formulation (requirements): Success criteria. Interpretability / common sense
• Do models always need to be interpretable?
– No, not always.
• However, business users and industry regulators are often reluctant to use models if they can’t understand how an individual prediction was arrived at.
• Common to represent linear models (e.g. those derived using linear or logistic regression) in the form of scorecards so to be easy for the layperson to understand.
13
Constant +598
Term of loan Number of children ≤ 12 months +51 0 0 13 – 17 months +28 1 – 2 +12 18 – 23 months +9 3+ 0 24 – 35 months 0 36 – 47 months -5 Occupation status 48 – 71 months -19 Full-time employed +7 72+ months -36 Part-time employed -22
Self employed -9 Accomodation status Homemaker -17
Owner +32 Student -47 Renting -17 Unemployed -84 Living at home 0 Retired +2
Time at current address Time in current employment < 1 year -68 < 1 year -59 1 – 2 years -29 1 – 2 years -23 3 - 5 years -11 3 – 4 years -14 6 – 10 years 0 5 – 7 years 0 11+ years +33 6 - 12 years 0 13 - 19 years +6
Gross annual income $ 20+ years +12 125,000+ +17 Not in employment 0 90,000 - 124,999 +11 50,000 - 89,999 0 Number of previous good paid loans 30,000 - 39,999 -26 0 -12 0 - 29,999 -49 1 0 2+ +17
2. Problem formulation (requirements): Success criteria. Complexity
• A traditional response or credit scoring model, derived using stepwise regression techniques, may have less than 20 variables/parameters.
• Some types of predictive models have hundreds of thousands of parameters, based on complex ensemble methods. – These typically yield a 5-10% improvement in performance (GINI, R-
squared etc.) compared to a simpler, single models.
• The cost of implementation and maintenance of complex models can be a barrier to their use.
• Classic case is the Netflix prize (Amatriain X and Basilico J. 2012)
– Hailed as a great success, yielding a 10% improvement in discriminatory ability, yet was not deemed implementable.
• Also some evidence that the performance of complex models decays more rapidly than simple models (Hand 2006), requiring more frequent cycles of monitoring and redevelopment.
14
2. Problem formulation (requirements): Success criteria. Local performance
• When we build models, measures of performance are typically based on global measures covering entire problem domain.
• However, its often within specific sub-sets of the problem domain where key decisions need to be made.
• An interesting case is where I was once asked to construct a model of voluntary repossession.
• The model predicted very well overall against industry standards. – GINI (AUC) 0.84
– 1,250 lift (model could identify one group of accounts which were 1,250 times more likely to voluntary repossess than the least likely).
• However, all we were interested in was the tail end of the distribution. The very worst 1-2% of cases, and the model was not able to provide a sufficient level of discrimination in this region.
• In the end the model was not used. 15
3. Legal and reputational risks/issues that can arise when using predictive analytics.
16
3. Legal and reputational risks/issues • The most prominent use of predictive
analytics is to predict the behaviour of individuals, using information that is known about them.
• Personal data is the key ingredient in the predictive analytics process when deriving models that predict individual behaviour.
• The general use (and misuse) of personal data is subject of increasing regulation in many regions
• If I follow all laws and regulations, then that’s all I need to worry about right? – This may prevent you being taken to
court, but won’t necessarily prevent bad publicity.
17 Source: BBC
3. Legal and reputational risks/issues
• One of the most significant problems, is the one of bias:
– +VE: A predictive model that has been designed correctly won’t display unjustified bias.
– -VE: The very nature of predictive models means that bias will exist. That’s why they work!
• But its “evidential bias” based on statistical evidence, so that’s OK?
• Maybe, but there are still situations where certain types of bias are illegal, and if you make decisions using the output of models that display these biases, then you may be breaking the law.
• Even when legal, the fact that a model displays certain biases can be controversial and may result in reputational damage, which more than outweighs any benefits brought by the model in terms of improved decision making.
18
3. Legal and reputational risks/issues • The easiest type of (illegal) bias to avoid is
direct bias; i.e. to ensure certain variables do not feature in certain types of model. For example gender, race, religion in credit scoring models. However, indirect bias can still exist.
• If income is included in a predictive model, then men and women will be treated differently, despite gender not explicitly being included.
• Likewise for occupation (e.g. Primary school teachers and Engineers).
• Things are also not universal. – Gender OK for marketing models, but not
for insurance models.
19
Source: BBC How to assess the potential risk associated with using different data within a predictive model?
What data to use when?
• Age • Alcohol consumption • Credit history • Criminal records • Dependents • DNA • Driving speed • Education • Gas consumption • Gender • Grocery purchases at supermarket • Income
• Last book purchased • Live with smoker (Y/N) • Marital status • Medical history • Music currently listening too • Race • Religion • Sexual orientation • Smoker (Y/N) • Type of car you drive
20
Immutability of data
21
Immutable (Individual can’t change at all)
Mutable (Individual can change easily)
Age
Alcohol consumption
Income Criminal record
Gas consumption
Education
Gender
Grocery purchases
Last book purchased
Live with smoker
Marital status
Medical history
Dependents Race
Religion Music currently Listening too
Sexual orientation
Smoker
Type of car
Driving speed
DNA
Beneficiary
22
Individual / society Decision maker
Treatment for illness
Selection for tax inspection
Product marketing
Benefit payment
Foreclosure Match on dating site
Credit granting
Child protection Insurance pricing
For whose benefit is a decisions made? (This is not the same thing as if the individual benefits from the decision)
Suspect selection in criminal cases
Making job offers
Redundancy selection
Home improvement grants
Parole
Survey selection
Impact
23
What is the potential impact of decisions on an individual’s well being?
23
Low Impact High Impact
Treatment for illness
Selection for tax inspection
Product marketing
Benefit payment
Foreclosure Match on dating site
Credit granting
Child protection Insurance
pricing
Suspect selection in criminal cases
Making job offers
Redundancy selection
Home improvement grants
Parole
Survey selection
Bringing it all together
Impact of decision on individual
Beneficiary of the decision
Immutability of data used
Ethical challenge
/ risk
High Decision maker
High Greatest
Least
Low Individual High
Low Low Decision
maker High Low
Individual High Low
• More legislation • Audit & regulatory oversight • Public interest • Greater manual involvement • Simple and explicable models • Judgemental overriding • Expert “Buy-in” • Understand model weaknesses • Constant monitoring
• Less legislation • Predictive ability trumps all else • Complex “black box” models • Automated model generation • Rapid redevelopment of models • Little oversight
E.G, foreclosure, redundancy,
parole
E.G. Marketing type
applications
Appendix A: Sources of further information
25
• Operational Database Management Systems. http:// www.odbms.org/ This site is supported by a range of industry experts. It covers a wide range of topics relating to the implementation and application of new technologies associated with predictive analytics, cloud computing and Big Data, amongst other things.
• KDnuggets. http://www.kdnuggets.com/ This is one of the most popular sites providing resources for data scientists.
• AnalyticBridge. http://www.analyticbridge.com/ AnalyticBridge hosts a range of articles, blogs and discussion forums about predictive analytics that is open to all. There is a broad range of topics covered, from the strategic to the very technical / operational.
• LinkedIn. http://www.linkedin.com/ There are several forums on LinkedIn that discuss predictive analytics and related topics.
• StatSoft. http://www.statsoft.com/Textbook. This is a website managed by Dell, providers of the STATISTICA statistical software package. If you want to know more about a wide range of statistical methods, including those used in predictive analytics, then this is a useful site to refer to.
26
The following are some of the primary internet resources for PA and related technologies (Big data, data mining, machine learning, etc.)
Business / non-technical books • Davenport, T., Kim, J. (2013). Keeping Up with the Quants: Your Guide to
Understanding and Using Analytics. Harvard Business Review Press. Davenport was one of the first people to write an accessible analytics text in his 2006 book – Competing on Analytics. This new book is written specifically for non-technical managers to help them understand and work with technically minded people who do predictive analytics.
• Finlay, S. (2014). Predictive Analytics, Data Mining and Big Data. Myths, Misconceptions and Methods. Palgrave Macmillan. This is one of my books. Primarily it’s a book about predictive analytics, but it also provides a brief introduction to Big Data. The main focus is on practical issues around the development and implementation of predictive models.
• Siegel, E. (2016). Predictive Analytics: the Power to Predict Who Will Click, Buy, Lie, or Die. Wiley. 2nd Ed. Very much a marmite book. You’ll either love it or hate it, but it’s the book that brought predictive analytics to the attention of a much wider audience than ever before.
• Silver, N. (2012). The Signal and the Noise: Why So Many Predictions Fail. Penguin. This is not really a predictive analytics book. However, what is relevant is the focus on understanding why so many forecasting systems fail. It discusses why more attention needs to focus on the weaknesses and pitfalls of forecasting and prediction, so as to improve the quality of forecasting models in the future.
• 27
More academic / theory focused books • Baesens, B. (2014). Analytics in a Big Data World: The Essential Guide to Data
Science and its Applications. Wiley. This book describes the key stages involved in developing a predictive model. A good read for those with a little bit of mathematical and/or statistical knowledge, but you don’t need a higher degree in mathematics or statistics to understand the concepts that Baesens puts forward.
• Bishop, C. M. (2007). Pattern Recognition and Machine Learning (Information Science and Statistics). Springer. This book covers a lot of the theoretical material underpinning many of the tools commonly used for data mining and predictive analytics.
• Bishop, C. M. (1995). Neural Networks for Pattern Recognition. Clarendon Press. It has been more than twenty years since its original publication, yet this remains one of the few definitive guides to the theory and application of neural networks.
• Hastie, T., Tibshirani, R. and Friedman, J. (2011). The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd Edition. Springer. A detailed and technical guide to many of the data mining tools used in predictive analytics, written by three of the world’s leading academics in the field.
• Hosmer, D. and Lemeshow, S. (2013). Applied Logistic Regression (Wiley Series in Probability and Statistics). 3rd Edition Wiley. Logistic regression remains one of the most popular and widely used methods for generating predictive models. This is the main book I recommend to people who want to know more about this method.
28
More academic / theory focused books (Cont.) • Khun, M.(2013), Johnson, K. Applied Predictive Modeling. Springer. Another
well-constructed book in a similar vein to Baesens (above). It combines practical advice with the more mathematical aspects of the subject.
• Linoff, G. S. and Berry, M. J. (2011). Data Mining Techniques: For Marketing, Sales, and Customer Relationship Management. 3rd. Edition. Wiley. This is a broad, well-rounded, and not overtly technical book that describes the most popular data mining techniques applied to direct marketing.
• Witten, I. H., Frank, E. and Hall, M. A. (2011). Data Mining: Practical Machine Learning Tools and Techniques, 3rd Edition (The Morgan Kaufmann Series in Data Management Systems). Morgan Kaufmann. This is a detailed reference manual for those interested in practical data mining. I found it provided a nice blend of theory and practice, with many good examples.
29
Bibliography
• Amatriain X and Basilico J. (2012). The NetFlix TechBlog: http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html
• Baraniuk, C. (8/9/2016). “LinkedIn denies gender bias claim over site search” BBC http://www.bbc.co.uk/news/technology-37306828
• BBC. (23/8/2016) “Gender pay gap: Why do mums increasingly earn less?” http://www.bbc.co.uk/news/business-37167610
• Finlay, S. (2014). Predictive Analytics, Data Mining and Big Data. Myths, Misconceptions and Methods. Palgrave Macmillan.
• Hand, D. J. (2006). Classifier technology and the illusion of progress. Statistical Science 21(1): 1-15.
• Segal, E. (2013). Predictive Analytics: the Power to Predict Who Will Click, Buy, Lie, or Die. Wiley.
30