Détection de profils, application en santé et en économétrie geissler
-
Upload
shi-kezhan -
Category
Education
-
view
923 -
download
1
description
Transcript of Détection de profils, application en santé et en économétrie geissler
7 Juillet 2014
Christophe Geissler, Quinten, IAF. DÉTECTION DE PROFILS:
A¨PPLICATIONS EN SANTE ET EN ECONOMETRIE
1) QUINTEN EN BREF 2) SCIENCES DE LA VIE ET PREDICTION 3) ETUDES DE CAS 4) COMPARAISON DE METHODES 5) SUJETS DE RECHERCHE
PLAN
Hommage à AK, ~780-850
QUINTEN IN SHORT
A company providing data-oriented strategic advisory.
Since 2008, over 100 missions for more than 25 clients
Historical focus on Life Sciences and Healthcare
Now extending to CRM, Insurance and Investment
18 employees, self-financed, annual average growth of 40%
80% of the revenue reinvested in R&D each year, including a proprietary learning technology
Active member of several technology clusters: Medicen
3
REFERENCES
4
THE HEALTHCARE SECTOR AS ADVANCED ALGORITHMIC PRESCRIPTOR ?
The prediction/classification needs in life sciences have evolved. Huge increase of available variables Limited size of samples (often < 1000) for economic reasons
These needs are not fully met by predictive approaches. Need for evidence-based methods Trade-off between predictive power and auditability of recommendations
Exponential increase in computation capacity open the way for exploration-based methods
With an increasing risk of overfitting the data
Correlation with similar trends in CRM. Customer profiling: data gathering is key.
5
ALGORITHMIC NEEDS IN EPIDEMIOLOGICAL STUDIES
Databases have large sets of variables (#V >> #Obs)
Practitioners often wish to get rid of a priori selection (or hierarchization) of variables
Poor tractability by most kinds of regression models Using ‘sparsity’, ie penalizing complexity in order to simplify the model, does not fully solve the problem
Leaving the cartesian paradigm: a single ((very)complex) function driving globally the entirety of the visible phenomena
For a heuristic approach: accepting the possibility of multiple, local, partially correlated causes to be discovered: the ‘profiles’.
Interpretability of the profiles and descriptive parsimony are mandatory: no black-box or randomized results.
6
PREDICTION VS DESCRIPTION IN SUPERVISED METHODS
Supervised problems, ie where there training data are ‘labeled’ by a variable Y to be explained. Y is the ‘interest phenomenon’.
Y can be a boolean (treatment outcome) or a continuous variable (loss amount, etc).
Explanatory variables X = (Xi)i=1..V in RV, continuous or discrete with possibly missing values.
Predictor: a function Ŷ = F (X) : RV Dom(Y) verifying: Var(Ŷ – Y | X) < Var (Y)
Explanatory power: capacity to ‘simply’ describe the sets F-1 ([s, 1]), i.e answering the question ‘Who are the strong responders ?’
Simplicity can be formalized, always imply the number of variables involved in the predictors.
Simplicity is key when targeting large sets of ‘new’ individuals (not in the training sample).
7
THE PREDICTIVE VS EXPLANATORY TRADE-OFF
8
Problem: separating ‘nicely’ red from blue points in R2. Dark colors in the training sample, light colors in the test sample.
THE PREDICTIVE VS EXPLANATORY TRADE-OFF
9
Running four prediction techniques on the previous set.
Colored areas depending on the predicted value.
How many words are needed to describe the dark shaded areas ?
Poor response of linear separators (SVM) indicate that more dimensions could be
needed in order to improve the description.
PROFILE SEARCH VS DECISION TREES
10
Decisions trees look for optimal cut-offs on explanatory variables: partition of space in non-overlapping regions. Profile search allows for some controlled degree of intersection.
Toy data-base with a phenomenon taking place on two overlapping rectangles on variables a and b, hidden
among 250 random variables. CART response: up to 14 levels to partition space
507 patients Typology 1 6,4 % AEX
507 patients Typology 2 10% AEX
808 patients Typology 3 13% AEX
USE CASE IN HEALTHCARE CLUSTERING : A NON SUPERVISED APPROACH
Database : 2000 patients / 1000 variables
Patient without Adverse Event X
Patient with Adverse Event X
10% got the Adverse Event X (200 patients)
Singular value Decomposition
(SVD) : Clustering (PCA, K-Means ...)
11
Are there various typologies of patients in this database ? Do these typologies show any deviations with regard to Adverse Event X ? Are these difference important enough to avoid treating some typologies ?
ASSOCIATIVE RULES DISCOVERY: QFINDER ALGORITHM Identification and characterization of singular profiles
Database : 2000 patients / 1000 variables
Patient without Adverse Event X
Patient with Adverse Event X
10% got the Adverse Event X (200 patients)
Data processing (QFinder)
12
What are the various profiles of patients with the highest risk of Adverse Event X ? What are the key characteristics of each of these profiles ? How to prevent Adverse Event X ?
Age > 56 Average Daily Dose = High
Treatment duration > 50 days
126 patients 47% Adverse Event X
108 patients 60% Adverse Event X Gender : female Diabetes =Yes Menopause = Yes
59 patients 75% Adverse Event X Blood Pressure = High Dyslipidemia = Yes Interpretable and actionable results
Optimality of recommendations
MANY CRITERIA HAVE LITTLE OR NO INFLUENCE
EXAMPLE OF PROFILE Detection of mutually influent factors not seen by regressions
ACTION : AVOID THE HIGH DOSE ON PATIENTS > 56 TREATED > 50 DAYS
AVOID TREATING MORE THAN 50 DAYS PATIENTS > 56 WITH THE HIGH DOSE
10%
Database size : 2000 patients
(100%)
Average rate of adverse events : 10%
13
90%
65%
Size : 739
patients(37%)
AGE > 56
11% 89%
69%
Size : 936
patients(47%)
TREATMENT DURATION > 50 days
8% 92%
Size : 647
patients(32%)
AVERAGE DAILY DOSE : HIGH
13% 87%
HOWEVER Q-FINDER WAS ABLE TO DETECT THEIR COMBINED INFLUENCE WHEN RELEVANT
Profile size : 126
patient(6,3%)
Patients matching the following characteristics :
Are 4,7 more likely to trigger adverse events
AGE > 56 TREATMENT DURATION > 50 days
AVERAGE DAILY DOSE : HIGH
84% 47% 53%
USING PROFILE DETECTION IN INVESTMENT
14
Using machine learning for the detection of recurrent biases on the returns of main assets classes (interest rates, equity indices, currencies). Empirical facts: Financial markets are interaction hubs for investors having a huge diversity in horizon and risk aversion. Fluctuations can therefore be caused by a large number of potential factors. The influence of these factors is not uniform through time. GLM-type approaches are too difficult to calibrate and yield unstable results. Retained approach: Search for signifiant profiles, characterized by conditions on a limited number of variables. Profiles can be partially intersected. No predefined hierarchy on the variables. Creating derived variables from primary variables: stationarity and variety.
USING PROFILE DETECTION IN INVESTMENT
Présentation commerciale 2014 15
Exemple: • Y(t) = D Bund (1month) / stdev (D Bund (1 month)) • 250 explanatory variables:
• Eurozone, US economic indicators • Interest rates levels and dynamics • Central money data • Inflationary anticipations (inflation swaps) • Risk premia on equity markets • Energy prices • Volatilities, correlations
• Training period: 1999-2013. Average (Y(t), <Training period>) = +0.15 s
-15
-10
-5
0
5
10
1999
1101
2000
0428
2000
1026
2001
0426
2001
1024
2002
0423
2002
1021
2003
0421
2003
1017
2004
0415
2004
1014
2005
0412
2005
1011
2006
0411
2006
1009
2007
0409
2007
1008
2008
0404
2008
1001
2009
0401
2009
0929
2010
0326
2010
0924
2011
0323
2011
0921
2012
0321
2012
0919
Dbund = f(t)
USING PROFILE DETECTION IN INVESTMENT
16
Stylized fact 1: Sharp drop in German equities increase in risk aversion rise in German Govt Bonds. Validating hypothesis: X1 = Decile (D (E/P_ratio (Dax) – Bobl yield)). Interpretation: 3 month variation in German equity risk premium r = Correlation (X1, Y) = 9%, R2 = 0.8% : Decile analysis: E(Y | X1) Non linearity General trend conform with intuition
-0.3
-0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0.6
0 - 2 1 - 3 2 - 4 3 - 5 4 - 6 5 - 7 6 - 8 7 - 9 8 - 10
E(Dbund) = f(Dprime Dax)
USING PROFILE DETECTION IN INVESTMENT
17
Stylized fact 2: Growth acceleration in monetary aggregates future rise in inflation loss in Govt Bonds.
Hypothesis validation: X2 : Decile (D M3 (3 month)) . r = Correlation (X2, Y) = -1.5%, R2 = 0.4%. Decile analysis: Non linearity General trend conform with intuition -0.2
-0.1
0
0.1
0.2
0.3
0.4
0.5
0 - 3 1 - 4 2 - 5 3 - 6 4 - 7 5 - 8 6 - 9 7 - 10 8 - 11
E(DBund) = f(DM3)
USING PROFILE DETECTION IN INVESTMENT
18
When: X1 >= 5 D (DAX Risk Premium) > 5th decile AND
X2 in [2, 6] D (M3) between 2nd and 6th decile Then: E(Y | X1,X2) = +0.83s, True on 21.5% of observations between 1999 and 2012. These conditions form a market profile. Information ratio: 0.83 x (21.5% x 260/20)0.5 = 1.05 Strong synergy between variables: +90% increase in conditional expectation on Bund performance .
01
23
45
67
89
-1.5 s
-1.0 s
-0.5 s
0.0 s
0.5 s
1.0 s
1.5 s
01
23
45
67
89
Espé
ranc
e co
nditi
onne
lle
Influence combinée des deux variables
1.0 s-1.5 s
0.5 s-1.0 s
0.0 s-0.5 s
-0.5 s-0.0 s
-1.0 s--0.5 s
-1.5 s--1.0 s
Combined influence
0
1
1999
1101
2000
0107
2000
0316
2000
0524
2000
0802
2000
1010
2000
1219
2001
0226
2001
0507
2001
0713
2001
0921
2001
1129
2002
0206
2002
0417
2002
0625
2002
0902
2002
1108
2003
0117
2003
0328
2003
0605
2003
0813
2003
1021
2003
1229
2004
0308
2004
0514
2004
0722
2004
0930
2004
1208
2005
0216
2005
0425
2005
0704
2005
0909
2005
1118
2006
0126
2006
0406
2006
0614
2006
0822
2006
1031
2007
0109
2007
0319
2007
0525
2007
0803
2007
1012
2007
1220
2008
0227
2008
0506
2008
0715
2008
0920
2008
1128
2009
0205
2009
0416
2009
0624
2009
0902
2009
1110
2010
0118
2010
0325
2010
0603
2010
0811
2010
1020
2010
1228
2011
0307
2011
0513
2011
0722
2011
0929
2011
1207
2012
0215
2012
0425
2012
0703
2012
0911
2012
1119
2013
0125
Occurrences historiques du profil
113 independent occurrences in 14 years
MANAGING THE RISK OF OVERFITTING
19
Parameter Role Influence on
overfitting risk
P Size of training sample P↑: risk↓
ρ Average (coding compression rate
of variables = #modalities / P) ρ↓: risk↓
y Proportion of 1’s in dependent
variable y↑: risk↓
k Maximum profile complexity k↓: risk↓
V Total number of variables V↓: risk↓
ε Maximum admissible probability of
finding any configuration by
random search
ε↓: risk↓
0
10
20
30
40
50
60
Nb max
Coding compression of variables
Maximum number of profiles
#V=1
#V=2
#V=3
#V=4
RISK AND REWARDS OF COMBINATORIAL EXPLORATION
No preselection of variables, no hierarchy, localized search: more freedom is granted
No free lunch: computation time increases (linear in #Obs, polynomial in #V)
But parallel computation and cloud-computing are perfectly adapted
Risk of overfitting must be carefully controlled
The richness of the descriptive language must be kept at a parsimonious level in order to prevent ‘nugget-fishing’: interesting maths behind the scene.
Présentation commerciale 2014 20
CURRENT RESEARCH AREAS
Improving the dynamic aggregation of predictors: Using prediction as a topology on data: COBRA algorithm (G. Biau, B. Guedj).
Weighting schemes based on regret (Lugosi, Stoltz) or regularity (Wintenberger).
Embedding time stationarity requirements in profile search.
Incremental production of backtests.
Visualization of an audit trail between variables and final recommendations.
GPU calculations
…
Présentation commerciale 2014 21
CONTACT
22
11, rue Galvani 75017 Paris, France +33 (0)1 45 74 33 05
http://www.quinten-france.com @QuintenFrance
Christophe GEISSLER
33 (0)6 08 60 46 14 [email protected]