Détection de profils, application en santé et en économétrie geissler

7 Juillet 2014

Christophe Geissler, Quinten, IAF. DÉTECTION DE PROFILS:

A¨PPLICATIONS EN SANTE ET EN ECONOMETRIE

1) QUINTEN EN BREF 2) SCIENCES DE LA VIE ET PREDICTION 3) ETUDES DE CAS 4) COMPARAISON DE METHODES 5) SUJETS DE RECHERCHE

PLAN

Hommage à AK, ~780-850

QUINTEN IN SHORT

A company providing data-oriented strategic advisory.

Since 2008, over 100 missions for more than 25 clients

Historical focus on Life Sciences and Healthcare

Now extending to CRM, Insurance and Investment

18 employees, self-financed, annual average growth of 40%

80% of the revenue reinvested in R&D each year, including a proprietary learning technology

Active member of several technology clusters: Medicen

3

REFERENCES

4

THE HEALTHCARE SECTOR AS ADVANCED ALGORITHMIC PRESCRIPTOR ?

The prediction/classification needs in life sciences have evolved. Huge increase of available variables Limited size of samples (often < 1000) for economic reasons

These needs are not fully met by predictive approaches. Need for evidence-based methods Trade-off between predictive power and auditability of recommendations

Exponential increase in computation capacity open the way for exploration-based methods

With an increasing risk of overfitting the data

Correlation with similar trends in CRM. Customer profiling: data gathering is key.

5

ALGORITHMIC NEEDS IN EPIDEMIOLOGICAL STUDIES

Databases have large sets of variables (#V >> #Obs)

Practitioners often wish to get rid of a priori selection (or hierarchization) of variables

Poor tractability by most kinds of regression models Using ‘sparsity’, ie penalizing complexity in order to simplify the model, does not fully solve the problem

Leaving the cartesian paradigm: a single ((very)complex) function driving globally the entirety of the visible phenomena

For a heuristic approach: accepting the possibility of multiple, local, partially correlated causes to be discovered: the ‘profiles’.

Interpretability of the profiles and descriptive parsimony are mandatory: no black-box or randomized results.

6

PREDICTION VS DESCRIPTION IN SUPERVISED METHODS

Supervised problems, ie where there training data are ‘labeled’ by a variable Y to be explained. Y is the ‘interest phenomenon’.

Y can be a boolean (treatment outcome) or a continuous variable (loss amount, etc).

Explanatory variables X = (Xi)i=1..V in RV, continuous or discrete with possibly missing values.

Predictor: a function Ŷ = F (X) : RV Dom(Y) verifying: Var(Ŷ – Y | X) < Var (Y)

Explanatory power: capacity to ‘simply’ describe the sets F-1 ([s, 1]), i.e answering the question ‘Who are the strong responders ?’

Simplicity can be formalized, always imply the number of variables involved in the predictors.

Simplicity is key when targeting large sets of ‘new’ individuals (not in the training sample).

7

THE PREDICTIVE VS EXPLANATORY TRADE-OFF

8

Problem: separating ‘nicely’ red from blue points in R2. Dark colors in the training sample, light colors in the test sample.

THE PREDICTIVE VS EXPLANATORY TRADE-OFF

9

Running four prediction techniques on the previous set.

Colored areas depending on the predicted value.

How many words are needed to describe the dark shaded areas ?

Poor response of linear separators (SVM) indicate that more dimensions could be

needed in order to improve the description.

PROFILE SEARCH VS DECISION TREES

10

Decisions trees look for optimal cut-offs on explanatory variables: partition of space in non-overlapping regions. Profile search allows for some controlled degree of intersection.

Toy data-base with a phenomenon taking place on two overlapping rectangles on variables a and b, hidden

among 250 random variables. CART response: up to 14 levels to partition space

507 patients Typology 1 6,4 % AEX

507 patients Typology 2 10% AEX

808 patients Typology 3 13% AEX

USE CASE IN HEALTHCARE CLUSTERING : A NON SUPERVISED APPROACH

Database : 2000 patients / 1000 variables

Patient without Adverse Event X

Patient with Adverse Event X

10% got the Adverse Event X (200 patients)

Singular value Decomposition

(SVD) : Clustering (PCA, K-Means ...)

11

Are there various typologies of patients in this database ? Do these typologies show any deviations with regard to Adverse Event X ? Are these difference important enough to avoid treating some typologies ?

ASSOCIATIVE RULES DISCOVERY: QFINDER ALGORITHM Identification and characterization of singular profiles

Database : 2000 patients / 1000 variables

Patient without Adverse Event X

Patient with Adverse Event X

10% got the Adverse Event X (200 patients)

Data processing (QFinder)

12

What are the various profiles of patients with the highest risk of Adverse Event X ? What are the key characteristics of each of these profiles ? How to prevent Adverse Event X ?

Age > 56 Average Daily Dose = High

Treatment duration > 50 days

126 patients 47% Adverse Event X

108 patients 60% Adverse Event X Gender : female Diabetes =Yes Menopause = Yes

59 patients 75% Adverse Event X Blood Pressure = High Dyslipidemia = Yes Interpretable and actionable results

Optimality of recommendations

MANY CRITERIA HAVE LITTLE OR NO INFLUENCE

EXAMPLE OF PROFILE Detection of mutually influent factors not seen by regressions

ACTION : AVOID THE HIGH DOSE ON PATIENTS > 56 TREATED > 50 DAYS

AVOID TREATING MORE THAN 50 DAYS PATIENTS > 56 WITH THE HIGH DOSE

10%

Database size : 2000 patients

(100%)

Average rate of adverse events : 10%

13

90%

65%

Size : 739

patients(37%)

AGE > 56

11% 89%

69%

Size : 936

patients(47%)

TREATMENT DURATION > 50 days

8% 92%

Size : 647

patients(32%)

AVERAGE DAILY DOSE : HIGH

13% 87%

HOWEVER Q-FINDER WAS ABLE TO DETECT THEIR COMBINED INFLUENCE WHEN RELEVANT

Profile size : 126

patient(6,3%)

Patients matching the following characteristics :

Are 4,7 more likely to trigger adverse events

AGE > 56 TREATMENT DURATION > 50 days

AVERAGE DAILY DOSE : HIGH

84% 47% 53%

USING PROFILE DETECTION IN INVESTMENT

14

Using machine learning for the detection of recurrent biases on the returns of main assets classes (interest rates, equity indices, currencies). Empirical facts: Financial markets are interaction hubs for investors having a huge diversity in horizon and risk aversion. Fluctuations can therefore be caused by a large number of potential factors. The influence of these factors is not uniform through time. GLM-type approaches are too difficult to calibrate and yield unstable results. Retained approach: Search for signifiant profiles, characterized by conditions on a limited number of variables. Profiles can be partially intersected. No predefined hierarchy on the variables. Creating derived variables from primary variables: stationarity and variety.


Présentation commerciale 2014 15

Exemple: • Y(t) = D Bund (1month) / stdev (D Bund (1 month)) • 250 explanatory variables:

• Eurozone, US economic indicators • Interest rates levels and dynamics • Central money data • Inflationary anticipations (inflation swaps) • Risk premia on equity markets • Energy prices • Volatilities, correlations

• Training period: 1999-2013. Average (Y(t), <Training period>) = +0.15 s

-15

-10

-5

0

5

10

1999

1101

2000

0428

2000

1026

2001

0426

2001

1024

2002

0423

2002

1021

2003

0421

2003

1017

2004

0415

2004

1014

2005

0412

2005

1011

2006

0411

2006

1009

2007

0409

2007

1008

2008

0404

2008

1001

2009

0401

2009

0929

2010

0326

2010

0924

2011

0323

2011

0921

2012

0321

2012

0919

Dbund = f(t)


16

Stylized fact 1: Sharp drop in German equities increase in risk aversion rise in German Govt Bonds. Validating hypothesis: X1 = Decile (D (E/P_ratio (Dax) – Bobl yield)). Interpretation: 3 month variation in German equity risk premium r = Correlation (X1, Y) = 9%, R2 = 0.8% : Decile analysis: E(Y | X1) Non linearity General trend conform with intuition

-0.3

-0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0.6

0 - 2 1 - 3 2 - 4 3 - 5 4 - 6 5 - 7 6 - 8 7 - 9 8 - 10

E(Dbund) = f(Dprime Dax)


17

Stylized fact 2: Growth acceleration in monetary aggregates future rise in inflation loss in Govt Bonds.

Hypothesis validation: X2 : Decile (D M3 (3 month)) . r = Correlation (X2, Y) = -1.5%, R2 = 0.4%. Decile analysis: Non linearity General trend conform with intuition -0.2

-0.1

0

0.1

0.2

0.3

0.4

0.5

0 - 3 1 - 4 2 - 5 3 - 6 4 - 7 5 - 8 6 - 9 7 - 10 8 - 11

E(DBund) = f(DM3)


18

When: X1 >= 5 D (DAX Risk Premium) > 5th decile AND

X2 in [2, 6] D (M3) between 2nd and 6th decile Then: E(Y | X1,X2) = +0.83s, True on 21.5% of observations between 1999 and 2012. These conditions form a market profile. Information ratio: 0.83 x (21.5% x 260/20)0.5 = 1.05 Strong synergy between variables: +90% increase in conditional expectation on Bund performance .

01

23

45

67

89

-1.5 s

-1.0 s

-0.5 s

0.0 s

0.5 s

1.0 s

1.5 s

01

23

45

67

89

Espé

ranc

e co

nditi

onne

lle

Influence combinée des deux variables

1.0 s-1.5 s

0.5 s-1.0 s

0.0 s-0.5 s

-0.5 s-0.0 s

-1.0 s--0.5 s

-1.5 s--1.0 s

Combined influence

0

1

1999

1101

2000

0107

2000

0316

2000

0524

2000

0802

2000

1010

2000

1219

2001

0226

2001

0507

2001

0713

2001

0921

2001

1129

2002

0206

2002

0417

2002

0625

2002

0902

2002

1108

2003

0117

2003

0328

2003

0605

2003

0813

2003

1021

2003

1229

2004

0308

2004

0514

2004

0722

2004

0930

2004

1208

2005

0216

2005

0425

2005

0704

2005

0909

2005

1118

2006

0126

2006

0406

2006

0614

2006

0822

2006

1031

2007

0109

2007

0319

2007

0525

2007

0803

2007

1012

2007

1220

2008

0227

2008

0506

2008

0715

2008

0920

2008

1128

2009

0205

2009

0416

2009

0624

2009

0902

2009

1110

2010

0118

2010

0325

2010

0603

2010

0811

2010

1020

2010

1228

2011

0307

2011

0513

2011

0722

2011

0929

2011

1207

2012

0215

2012

0425

2012

0703

2012

0911

2012

1119

2013

0125

Occurrences historiques du profil

113 independent occurrences in 14 years

MANAGING THE RISK OF OVERFITTING

19

Parameter Role Influence on

overfitting risk

P Size of training sample P↑: risk↓

ρ Average (coding compression rate

of variables = #modalities / P) ρ↓: risk↓

y Proportion of 1’s in dependent

variable y↑: risk↓

k Maximum profile complexity k↓: risk↓

V Total number of variables V↓: risk↓

ε Maximum admissible probability of

finding any configuration by

random search

ε↓: risk↓

0

10

20

30

40

50

60

Nb max

Coding compression of variables

Maximum number of profiles

#V=1

#V=2

#V=3

#V=4

RISK AND REWARDS OF COMBINATORIAL EXPLORATION

No preselection of variables, no hierarchy, localized search: more freedom is granted

No free lunch: computation time increases (linear in #Obs, polynomial in #V)

But parallel computation and cloud-computing are perfectly adapted

Risk of overfitting must be carefully controlled

The richness of the descriptive language must be kept at a parsimonious level in order to prevent ‘nugget-fishing’: interesting maths behind the scene.


CURRENT RESEARCH AREAS

Improving the dynamic aggregation of predictors: Using prediction as a topology on data: COBRA algorithm (G. Biau, B. Guedj).

Weighting schemes based on regret (Lugosi, Stoltz) or regularity (Wintenberger).

Embedding time stationarity requirements in profile search.

Incremental production of backtests.

Visualization of an audit trail between variables and final recommendations.

GPU calculations

…


CONTACT

22

11, rue Galvani 75017 Paris, France +33 (0)1 45 74 33 05

http://www.quinten-france.com @QuintenFrance

Christophe GEISSLER

33 (0)6 08 60 46 14 [email protected]

Détection de profils, application en santé et en économétrie geissler

Education

Transcript of Détection de profils, application en santé et en économétrie geissler