Statistical Inference from Dependent Observations · Pr UÔŠ=ð=exp(Ï ðT ð...
Transcript of Statistical Inference from Dependent Observations · Pr UÔŠ=ð=exp(Ï ðT ð...
Statistical Inference from Dependent Observations
Constantinos DaskalakisEECS and CSAIL, MIT
Supervised Learning SettingGiven:
⢠Training set ð¥ð , ðŠð ð=1ð of examples
⢠ð¥ð: feature (aka covariate) vector of example ð
⢠ðŠð: label (aka response) of example ð
⢠Assumption: ð¥ð , ðŠð ð=1ð ~ððð ð· where ð· is unknown
⢠Hypothesis class â = âð ð â Î} of (randomized) responses⢠e.g. â = ð,â ð 2 †1},
⢠e.g.2 â = {ð ððð ðððð ð ðð ð·ððð ððð¢ððð ððð¡ð€ðððð }
⢠generally, âð might be randomized
⢠Loss function: â:ðŽ à ðŽ â â
Goal: Select ð to minimize expected loss: minð
ðŒ ð¥,ðŠ âŒð· â(âð(ð¥), ðŠ)
Goal 2: In realizable setting (i.e. when, under ð·, ðŠ ⌠âðâ(ð¥)), estimate ðâ
, âdogâ
âŠ
, âcatâ
E.g. Linear Regression
Input: ð feature vectors ð¥ð â âð and responses ðŠð â â, ð = 1,⊠, ðAssumption: for all ð, ðŠð sampled as follows:
Goal: Infer ð
E.g.2 Logistic Regression
Input: ð feature vectors ð¥ð â âð and binary responses ðŠð â {±1}Assumption: for all ð, ðŠð sampled as follows:
Goal: Infer ð
Maximum Likelihood Estimator (MLE)
In the standard linear and logistic regression settings, under mild assumptions about the design matrix ð, whose rows are the covariate vectors, MLE is strongly consistent.
MLE estimator áð satisfies: áð â ð2= ð
ð
ð
⢠Recall: ⢠Assumption: ð1, ð1 , ⊠, ðð, ðð âŒððð ð·, where ð· unknown⢠Goal: choose â â â to minimize expected loss under some loss function â, i.e.
minâââ
ðŒ ð¥,ðŠ âŒð· â(â(ð¥), ðŠ)
⢠Let â â â be the Empirical Risk Minimizer, i.e.
â â argminâââ
1
ð
ð
â(â(ð¥ð), ðŠð)
⢠Then:
Lossð· â †infâââ
Lossð· â + ð ðð¶ â /ð , for Boolean â, 0-1 loss â
Lossð· â †infâââ
Lossð· â + ð ð âð(â) , for general â, ð-Lipschitz â
Beyond Realizable Setting: Learnability
âŠ
Lossð· â
Supervised Learning Setting on Steroids
But what do these results really mean?
Broader Perspectiveâ ML methods commonly operate under stringent assumptions
o train set: independent and identically distributed (i.i.d.) samples
o test set: i.i.d. samples from same distribution as training
â Goal: relax standard assumptions to accommodate two important challenges
o (i) censored/truncated samples and (ii) dependent samples
o Censoring/truncation â systematic missing of data
â train set â test set
o Data Dependencies â peer effects, spatial, temporal dependencies
â no apparent source for independence
Todayâs Topic: Statistical Inference from Dependent Observations
independent dependent
Observations ð¥ð , ðŠð ð are commonly collected on some spatial domain, some temporal domain, or on a social network
As such, they are not independent, but intricately dependent
Why dependent?
e.g. spin glasses⢠neighboring particles influence each other
- +
++
+ +
-+
+ +
++
- +
--
e.g. social networksdecisions/opinions of nodes are influenced by the decisions of their neighbors (peer effects)
Statistical Physics and Machine Learning
⢠Spin Systems [Isingâ25]
⢠MRFs, Bayesian Networks, Boltzmann Machines ⢠Probability Theory
⢠MCMC
⢠Machine Learning
⢠Computer Vision
⢠Game Theory
⢠Computational Biology
⢠Causality
⢠âŠ
- +
++
+ +
-+
+ +
++
- +
--
Peer Effects on Social Networks
Several studies of peer effects, in applications such as:⢠criminal networks [Glaeser et alâ96]⢠welfare participation [Bertrand et alâ00]⢠school achievement [Sacerdoteâ01]⢠participation in Retirement Plans [Duflo-Saezâ03]⢠obesity [Trogdon et alâ08, Christakis-Fowlerâ13]
AddHealth Dataset: ⢠National Longitudinal Study of Adolescent Health⢠National study of students in grades 7-12⢠friendship networks, personal and school life, age,
gender, race, socio-economic background, health,âŠ
MicroEconomics: ⢠Behavior/Opinion Dynamics e.g.
[Schellingâ78], [Ellisonâ93], [Youngâ93, â01], [Montanari-Saberiâ10],âŠ
Econometrics: ⢠Disentangling individual from network
effects [Manskiâ93], [Bramoulle-Djebbari-Fortinâ09], [Li-Levina-Zhuâ16],âŠ
Menu
⢠Motivation
⢠Part I: Regression w/ dependent observations
⢠Part II: Statistical Learning Theory w/ dependent observations
⢠Conclusions
Menu
⢠Motivation
⢠Part I: Regression w/ dependent observations
⢠Part II: Statistical Learning Theory w/ dependent observations
⢠Conclusions
Standard Linear Regression
Input: ð feature vectors ð¥ð â âð and responses ðŠð â â, ð = 1,⊠, ðAssumption: for all ð, ðŠð sampled as follows:
Goal: Infer ð
Standard Logistic Regression
Input: ð feature vectors ð¥ð â âð and binary responses ðŠð â {±1}Assumption: for all ð, ðŠð sampled as follows:
Goal: Infer ð
Maximum Likelihood Estimator
In the standard linear and logistic regression settings, under mild assumptions about the design matrix ð, whose rows are the covariate vectors, MLE is strongly consistent.
MLE estimator áð satisfies: áð â ð2= ð
ð
ð
Standard Linear Regression
Input: ð feature vectors ð¥ð â âð and responses ðŠð â â, ð = 1,⊠, ðAssumption: for all ð, ðŠð sampled as follows:
Goal: Infer ð
Linear Regression w/ Dependent Samples
Input: ð feature vectors ð¥ð â âð and responses ðŠð â â, ð = 1,⊠, ðAssumption: for all ð, ðŠð sampled as follows conditioning on ðŠâð:
Parameters: ⢠coefficient vector ð, inverse temperature ðœ (unknown)⢠Interaction matrix ðŽ (known)
Goal: Infer ð, ðœ
Standard Logistic Regression
Input: ð feature vectors ð¥ð â âð and binary responses ðŠð â {±1}Assumption: for all ð, ðŠð sampled as follows:
Goal: Infer ð
Logistic Regression w/ Dependent Samples(todayâs focus)
Input: ð feature vectors ð¥ð â âð and binary responses ðŠð â {±1}Assumption: for all ð, ðŠð sampled as follows conditioning on ðŠâð:
Parameters: ⢠coefficient vector ð, inverse temperature ðœ (unknown)⢠Interaction matrix ðŽ (known)
Goal: Infer ð, ðœ
Input: ð feature vectors ð¥ð â âð and binary responses ðŠð â {±1}Assumption: ðŠ1, ⊠, ðŠð are sampled jointly according to measure
Parameters: ⢠coefficient vector ð, inverse temperature ðœ (unknown)⢠Interaction matrix ðŽ (known)
Goal: Infer ð, ðœ
Ising model w/ inverse temperature ðœ and external fields ððð¥ð
Logistic Regression w/ Dependent Samples(todayâs focus)
Challenge: one sample, likelihood contains ð
For ðœ = 0 equivalent to Logistic regression
i.e. lack of LLN, hard to compute MLE
Main Result for Logistic Regression from Dependent Samples
N.B. We show similar results for linear regression with peer effects.
Menu
⢠Motivation
⢠Part I: Regression w/ dependent observations⢠Proof Ideas
⢠Part II: Statistical Learning Theory w/ dependent observations
⢠Conclusions
Menu
⢠Motivation
⢠Part I: Regression w/ dependent observations⢠Proof Ideas
⢠Part II: Statistical Learning Theory w/ dependent observations
⢠Conclusions
MPLE instead of MLE
⢠Likelihood involves ð = ðð,ðœ (the partition function), so is non-trivial to compute.
⢠[Besagâ75,âŠ,Chatterjeeâ07] studies the maximum pseudolikelihood estimator (MPLE)
⢠LogPL does not contain ð and is concave. Is MPLE consistent?
⢠[Chatterjeeâ07]: yes, when ðœ > 0, ð = 0
⢠[BMâ18,GMâ18]: yes, when ðœ > 0, ð â 0 â â, ð¥ð = 1, for all ð
⢠General case?
Problem: Given: ⢠Ԋð¥ = ð¥1, ⊠, ð¥ð and ⢠ԊðŠ = ðŠ1, ⊠, ðŠð â ±1 ð sampled as
Pr ÔŠðŠ = ð =exp(Ïð ð
Tð¥ð ðð+ðœðTðŽð)
ð
Infer ð, ðœ
ðð¿ ðœ, ð â Ïð Prðœ,ð,ðŽ
[ðŠð|ðŠâð]1/ð
= Ïð1
1+exp â2(ððð¥ð+ðœðŽðððŠ
1/ð
Analysis
Problem: Given: ⢠Ԋð¥ = ð¥1, ⊠, ð¥ð and ⢠ԊðŠ = ðŠ1, ⊠, ðŠð â ±1 ð sampled as
Pr ÔŠðŠ = ð =exp(Ïð ð
Tð¥ð ðð+ðœðTðŽð)
ð
Infer ð, ðœ
ðð¿ ðœ, ð â à·
ð
Prðœ,ð,ðŽ
[ðŠð|ðŠâð]
1/ð
Analysis
Problem: Given: ⢠Ԋð¥ = ð¥1, ⊠, ð¥ð and ⢠ԊðŠ = ðŠ1, ⊠, ðŠð â ±1 ð sampled as
Pr ÔŠðŠ = ð =exp(Ïð ð
Tð¥ð ðð+ðœðTðŽð)
ð
Infer ð, ðœ
⢠Concentration: gradient of log-pseudolikelihood at truth should be small; ⢠show ð»logðð¿ ð0, ðœ0 2
2 is ð ð/ð with probability at least 99%
⢠Anti-concentration: Constant lower bound on minð,ðœ
ðððð(âð»2 log ðð¿ (ð, ðœ)) w.pr. ⥠99%
ðð¿ ðœ, ð â à·
ð
Prðœ,ð,ðŽ
[ðŠð|ðŠâð]
1/ð
Menu
⢠Motivation
⢠Part I: Regression w/ dependent observations⢠Proof Ideas
⢠Part II: Statistical Learning Theory w/ dependent observations
⢠Conclusions
Menu
⢠Motivation
⢠Part I: Regression w/ dependent observations⢠Proof Ideas
⢠Part II: Statistical Learning Theory w/ dependent observations
⢠Conclusions
Supervised LearningGiven:
⢠Training set ð¥ð , ðŠð ð=1ð of examples
⢠Hypothesis class â = âð ð â Î} of (randomized) responses
⢠Loss function: â:ðŽ à ðŽ â â
Assumption: ð¥ð , ðŠð ð=1ð ~ððð ð· where ð· is unknown
Goal: Select ð to minimize expected loss: minð
ðŒ ð¥,ðŠ âŒð· â(âð(ð¥), ðŠ)
Goal 2: In realizable setting (i.e. when, under ð·, ðŠ ⌠âðâ(ð¥)), estimate ðâ
, âdogâ
âŠ
, âcatâ
Given:
⢠Training set ð¥ð , ðŠð ð=1ð of examples
⢠Hypothesis class â = âð ð â Î} of (randomized) responses
⢠Loss function: â:ðŽ à ðŽ â â
Assumption: ð¥ð , ðŠð ð=1ð ~ððð ð· where ð· is unknown
Goal: Select ð to minimize expected loss: minð
ðŒ ð¥,ðŠ âŒð· â(âð(ð¥), ðŠ)
Goal 2: In realizable setting (i.e. when, under ð·, ðŠ ⌠âðâ(ð¥)), estimate ðâ
, âdogâ
âŠ
, âcatâ
Supervised Learning w/ Dependent Observations
Supervised Learning w/ Dependent ObservationsGiven:
⢠Training set ð¥ð , ðŠð ð=1ð of examples
⢠Hypothesis class â = âð ð â Î} of (randomized) responses
⢠Loss function: â:ðŽ à ðŽ â â
Assumptionâ: a joint distribution ð· samples training examples andunknown test sample jointly, i.e.
⢠ð¥ð , ðŠð ð=1ð ⪠(ð¥, ðŠ) ⌠ð·
Goal: select ð to minimize: minð
ðŒ â(âð(ð¥), ðŠ)
, âdogâ
âŠ
, âcatâ
Main result
Learnability is possible when joint distribution ð· over training set and test set satisfies Dobrushinâs condition.
Dobrushinâs condition
⢠Given a joint probability vector Ò§ð = ð1, ⊠, ðð define the influence of variable ð to variable ð: ⢠âworst effect of ðð to the conditional distribution of ðð given ðâðâðâ
⢠Infðâð = supð§ð,ð§ð
â²,ð§âðâð
ððð Pr ðð ⣠ð§ð , ð§âðâð , Pr ðð ⣠ð§ðâ², ð§âðâð
⢠Dobrushinâs condition:⢠âall nodes have limited total influence exerted on themâ
⢠For all ð, Ïðâ ð Infðâð < 1.
Implies: Concentration of measureFast mixing of Gibbs sampler
Learnability under Dobrushinâs condition
⢠Suppose: ⢠ð1, ð1 , ⊠, ðð, ðð , (ð, ð) ⌠ð·, where ð· is joint distribution⢠ð·: satisfies Dobrushinâs condition⢠ð·: has the same marginals
⢠Define Lossð· â = ðŒ â(â(ð¥), ðŠ)
⢠There exists a learning algorithm which, given ð1, ð1 , ⊠, ðð, ðð , outputs â â â such that
Lossð· â †infâââ
Lossð· â + à·šð ðð¶ â /ð for Boolean â, 0-1 loss â
Lossð· â †infâââ
Lossð· â + à·šð ð ð¢ð(â) , for general â, ð-Lipschitz â
(need stronger than Dobrushin condition for general â)
Essentially the same bounds as i.i.d
Menu
⢠Motivation
⢠Part I: Regression w/ dependent observations⢠Proof Ideas
⢠Part II: Statistical Learning Theory w/ dependent observations
⢠Conclusions
Menu
⢠Motivation
⢠Part I: Regression w/ dependent observations⢠Proof Ideas
⢠Part II: Statistical Learning Theory w/ dependent observations
⢠Conclusions
Goal: relax standard assumptions to accommodate two important challenges
o (i) censored/truncated samples and (ii) dependent samples
o Censoring/truncation â systematic missing of data
â train set â test set
o Data Dependencies â peer effects, spatial, temporal dependencies
â no apparent source for independence
Goals of Broader Line of Work
o Regression on a network (linear and logistic, ð/ð rates):
Constantinos Daskalakis, Nishanth Dikkala, Ioannis Panageas:Regression from Dependent Observations.In the 51st Annual ACM Symposium on the Theory of Computing (STOCâ19).
o Statistical Learning Theory Framework (learnability and generalization bounds):
Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha Jayanti:Generalization and learning under Dobrushin's condition.In the 32nd Annual Conference on Learning Theory (COLTâ19).
Statistical Estimation from Dependent Observations
Thank you!