Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎...

40
Statistical Inference from Dependent Observations Constantinos Daskalakis EECS and CSAIL, MIT

Transcript of Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎...

Page 1: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Statistical Inference from Dependent Observations

Constantinos DaskalakisEECS and CSAIL, MIT

Page 2: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Supervised Learning SettingGiven:

• Training set 𝑥𝑖 , 𝑊𝑖 𝑖=1𝑛 of examples

• 𝑥𝑖: feature (aka covariate) vector of example 𝑖

• 𝑊𝑖: label (aka response) of example 𝑖

• Assumption: 𝑥𝑖 , 𝑊𝑖 𝑖=1𝑛 ~𝑖𝑖𝑑 𝐷 where 𝐷 is unknown

• Hypothesis class ℋ = ℎ𝜃 𝜃 ∈ Θ} of (randomized) responses• e.g. ℋ = 𝜃,⋅ 𝜃 2 ≀ 1},

• e.g.2 ℋ = {𝑠𝑜𝑚𝑒 𝑐𝑙𝑎𝑠𝑠 𝑜𝑓 𝐷𝑒𝑒𝑝 𝑁𝑒𝑢𝑟𝑎𝑙 𝑁𝑒𝑡𝑀𝑜𝑟𝑘𝑠}

• generally, ℎ𝜃 might be randomized

• Loss function: ℓ:𝒎 × 𝒎 → ℝ

Goal: Select 𝜃 to minimize expected loss: min𝜃

𝔌 𝑥,𝑊 ∌𝐷 ℓ(ℎ𝜃(𝑥), 𝑊)

Goal 2: In realizable setting (i.e. when, under 𝐷, 𝑊 ∌ ℎ𝜃∗(𝑥)), estimate 𝜃∗

, “dog”




, “cat”

Page 3: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

E.g. Linear Regression

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and responses 𝑊𝑖 ∈ ℝ, 𝑖 = 1,
 , 𝑛Assumption: for all 𝑖, 𝑊𝑖 sampled as follows:

Goal: Infer 𝜃

Page 4: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

E.g.2 Logistic Regression

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑊𝑖 ∈ {±1}Assumption: for all 𝑖, 𝑊𝑖 sampled as follows:

Goal: Infer 𝜃

Page 5: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Maximum Likelihood Estimator (MLE)

In the standard linear and logistic regression settings, under mild assumptions about the design matrix 𝑋, whose rows are the covariate vectors, MLE is strongly consistent.

MLE estimator መ𝜃 satisfies: መ𝜃 − 𝜃2= 𝑂

𝑑

𝑛

Page 6: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

• Recall: • Assumption: 𝑋1, 𝑌1 , 
 , 𝑋𝑛, 𝑌𝑛 ∌𝑖𝑖𝑑 𝐷, where 𝐷 unknown• Goal: choose ℎ ∈ ℋ to minimize expected loss under some loss function ℓ, i.e.

minℎ∈ℋ

𝔌 𝑥,𝑊 ∌𝐷 ℓ(ℎ(𝑥), 𝑊)

• Let ℎ ∈ ℋ be the Empirical Risk Minimizer, i.e.

ℎ ∈ argminℎ∈ℋ

1

𝑛

𝑖

ℓ(ℎ(𝑥𝑖), 𝑊𝑖)

• Then:

Loss𝐷 ℎ ≀ infℎ∈ℋ

Loss𝐷 ℎ + 𝑂 𝑉𝐶 ℋ /𝑛 , for Boolean ℋ, 0-1 loss ℓ

Loss𝐷 ℎ ≀ infℎ∈ℋ

Loss𝐷 ℎ + 𝑂 𝜆 ℛ𝑛(ℋ) , for general ℋ, 𝜆-Lipschitz ℓ

Beyond Realizable Setting: Learnability




Loss𝐷 ℎ

Page 7: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Supervised Learning Setting on Steroids

But what do these results really mean?

Page 8: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Broader Perspective❑ ML methods commonly operate under stringent assumptions

o train set: independent and identically distributed (i.i.d.) samples

o test set: i.i.d. samples from same distribution as training

❑ Goal: relax standard assumptions to accommodate two important challenges

o (i) censored/truncated samples and (ii) dependent samples

o Censoring/truncation ⇐ systematic missing of data

⇒ train set ≠ test set

o Data Dependencies ⇐ peer effects, spatial, temporal dependencies

⇒ no apparent source for independence

Page 9: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Today’s Topic: Statistical Inference from Dependent Observations

independent dependent

Page 10: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Observations 𝑥𝑖 , 𝑊𝑖 𝑖 are commonly collected on some spatial domain, some temporal domain, or on a social network

As such, they are not independent, but intricately dependent

Why dependent?

e.g. spin glasses• neighboring particles influence each other

- +

++

+ +

-+

+ +

++

- +

--

e.g. social networksdecisions/opinions of nodes are influenced by the decisions of their neighbors (peer effects)

Page 11: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Statistical Physics and Machine Learning

• Spin Systems [Ising’25]

• MRFs, Bayesian Networks, Boltzmann Machines • Probability Theory

• MCMC

• Machine Learning

• Computer Vision

• Game Theory

• Computational Biology

• Causality

• 


- +

++

+ +

-+

+ +

++

- +

--

Page 12: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Peer Effects on Social Networks

Several studies of peer effects, in applications such as:• criminal networks [Glaeser et al’96]• welfare participation [Bertrand et al’00]• school achievement [Sacerdote’01]• participation in Retirement Plans [Duflo-Saez’03]• obesity [Trogdon et al’08, Christakis-Fowler’13]

AddHealth Dataset: • National Longitudinal Study of Adolescent Health• National study of students in grades 7-12• friendship networks, personal and school life, age,

gender, race, socio-economic background, health,


MicroEconomics: • Behavior/Opinion Dynamics e.g.

[Schelling’78], [Ellison’93], [Young‘93, ’01], [Montanari-Saberi’10],


Econometrics: • Disentangling individual from network

effects [Manski’93], [Bramoulle-Djebbari-Fortin’09], [Li-Levina-Zhu’16],


Page 13: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Menu

• Motivation

• Part I: Regression w/ dependent observations

• Part II: Statistical Learning Theory w/ dependent observations

• Conclusions

Page 14: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Menu

• Motivation

• Part I: Regression w/ dependent observations

• Part II: Statistical Learning Theory w/ dependent observations

• Conclusions

Page 15: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Standard Linear Regression

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and responses 𝑊𝑖 ∈ ℝ, 𝑖 = 1,
 , 𝑛Assumption: for all 𝑖, 𝑊𝑖 sampled as follows:

Goal: Infer 𝜃

Page 16: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Standard Logistic Regression

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑊𝑖 ∈ {±1}Assumption: for all 𝑖, 𝑊𝑖 sampled as follows:

Goal: Infer 𝜃

Page 17: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Maximum Likelihood Estimator

In the standard linear and logistic regression settings, under mild assumptions about the design matrix 𝑋, whose rows are the covariate vectors, MLE is strongly consistent.

MLE estimator መ𝜃 satisfies: መ𝜃 − 𝜃2= 𝑂

𝑑

𝑛

Page 18: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Standard Linear Regression

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and responses 𝑊𝑖 ∈ ℝ, 𝑖 = 1,
 , 𝑛Assumption: for all 𝑖, 𝑊𝑖 sampled as follows:

Goal: Infer 𝜃

Page 19: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Linear Regression w/ Dependent Samples

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and responses 𝑊𝑖 ∈ ℝ, 𝑖 = 1,
 , 𝑛Assumption: for all 𝑖, 𝑊𝑖 sampled as follows conditioning on 𝑊−𝑖:

Parameters: • coefficient vector 𝜃, inverse temperature 𝛜 (unknown)• Interaction matrix 𝐎 (known)

Goal: Infer 𝜃, 𝛜

Page 20: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Standard Logistic Regression

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑊𝑖 ∈ {±1}Assumption: for all 𝑖, 𝑊𝑖 sampled as follows:

Goal: Infer 𝜃

Page 21: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Logistic Regression w/ Dependent Samples(today’s focus)

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑊𝑖 ∈ {±1}Assumption: for all 𝑖, 𝑊𝑖 sampled as follows conditioning on 𝑊−𝑖:

Parameters: • coefficient vector 𝜃, inverse temperature 𝛜 (unknown)• Interaction matrix 𝐎 (known)

Goal: Infer 𝜃, 𝛜

Page 22: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Input: 𝑛 feature vectors 𝑥𝑖 ∈ ℝ𝑑 and binary responses 𝑊𝑖 ∈ {±1}Assumption: 𝑊1, 
 , 𝑊𝑛 are sampled jointly according to measure

Parameters: • coefficient vector 𝜃, inverse temperature 𝛜 (unknown)• Interaction matrix 𝐎 (known)

Goal: Infer 𝜃, 𝛜

Ising model w/ inverse temperature 𝛜 and external fields 𝜃𝑇𝑥𝑖

Logistic Regression w/ Dependent Samples(today’s focus)

Challenge: one sample, likelihood contains 𝑍

For 𝛜 = 0 equivalent to Logistic regression

i.e. lack of LLN, hard to compute MLE

Page 23: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Main Result for Logistic Regression from Dependent Samples

N.B. We show similar results for linear regression with peer effects.

Page 24: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Menu

• Motivation

• Part I: Regression w/ dependent observations• Proof Ideas

• Part II: Statistical Learning Theory w/ dependent observations

• Conclusions

Page 25: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Menu

• Motivation

• Part I: Regression w/ dependent observations• Proof Ideas

• Part II: Statistical Learning Theory w/ dependent observations

• Conclusions

Page 26: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

MPLE instead of MLE

• Likelihood involves 𝑍 = 𝑍𝜃,𝛜 (the partition function), so is non-trivial to compute.

• [Besag’75,
,Chatterjee’07] studies the maximum pseudolikelihood estimator (MPLE)

• LogPL does not contain 𝑍 and is concave. Is MPLE consistent?

• [Chatterjee’07]: yes, when 𝛜 > 0, 𝜃 = 0

• [BM’18,GM’18]: yes, when 𝛜 > 0, 𝜃 ≠ 0 ∈ ℝ, 𝑥𝑖 = 1, for all 𝑖

• General case?

Problem: Given: • Ԋ𝑥 = 𝑥1, 
 , 𝑥𝑛 and • Ԋ𝑊 = 𝑊1, 
 , 𝑊𝑛 ∈ ±1 𝑛 sampled as

Pr Ԋ𝑊 = 𝜎 =exp(σ𝑖 𝜃

T𝑥𝑖 𝜎𝑖+𝛜𝜎T𝐎𝜎)

𝑍

Infer 𝜃, 𝛜

𝑃𝐿 𝛜, 𝜃 ≔ ς𝑖 Pr𝛜,𝜃,𝐎

[𝑊𝑖|𝑊−𝑖]1/𝑛

= ς𝑖1

1+exp −2(𝜃𝑇𝑥𝑖+𝛜𝐎𝑖𝑇𝑊

1/𝑛

Page 27: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Analysis

Problem: Given: • Ԋ𝑥 = 𝑥1, 
 , 𝑥𝑛 and • Ԋ𝑊 = 𝑊1, 
 , 𝑊𝑛 ∈ ±1 𝑛 sampled as

Pr Ԋ𝑊 = 𝜎 =exp(σ𝑖 𝜃

T𝑥𝑖 𝜎𝑖+𝛜𝜎T𝐎𝜎)

𝑍

Infer 𝜃, 𝛜

𝑃𝐿 𝛜, 𝜃 ≔ ෑ

𝑖

Pr𝛜,𝜃,𝐎

[𝑊𝑖|𝑊−𝑖]

1/𝑛

Page 28: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Analysis

Problem: Given: • Ԋ𝑥 = 𝑥1, 
 , 𝑥𝑛 and • Ԋ𝑊 = 𝑊1, 
 , 𝑊𝑛 ∈ ±1 𝑛 sampled as

Pr Ԋ𝑊 = 𝜎 =exp(σ𝑖 𝜃

T𝑥𝑖 𝜎𝑖+𝛜𝜎T𝐎𝜎)

𝑍

Infer 𝜃, 𝛜

• Concentration: gradient of log-pseudolikelihood at truth should be small; • show 𝛻log𝑃𝐿 𝜃0, 𝛜0 2

2 is 𝑂 𝑑/𝑛 with probability at least 99%

• Anti-concentration: Constant lower bound on min𝜃,𝛜

𝜆𝑚𝑖𝑛(−𝛻2 log 𝑃𝐿 (𝜃, 𝛜)) w.pr. ≥ 99%

𝑃𝐿 𝛜, 𝜃 ≔ ෑ

𝑖

Pr𝛜,𝜃,𝐎

[𝑊𝑖|𝑊−𝑖]

1/𝑛

Page 29: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Menu

• Motivation

• Part I: Regression w/ dependent observations• Proof Ideas

• Part II: Statistical Learning Theory w/ dependent observations

• Conclusions

Page 30: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Menu

• Motivation

• Part I: Regression w/ dependent observations• Proof Ideas

• Part II: Statistical Learning Theory w/ dependent observations

• Conclusions

Page 31: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Supervised LearningGiven:

• Training set 𝑥𝑖 , 𝑊𝑖 𝑖=1𝑛 of examples

• Hypothesis class ℋ = ℎ𝜃 𝜃 ∈ Θ} of (randomized) responses

• Loss function: ℓ:𝒎 × 𝒎 → ℝ

Assumption: 𝑥𝑖 , 𝑊𝑖 𝑖=1𝑛 ~𝑖𝑖𝑑 𝐷 where 𝐷 is unknown

Goal: Select 𝜃 to minimize expected loss: min𝜃

𝔌 𝑥,𝑊 ∌𝐷 ℓ(ℎ𝜃(𝑥), 𝑊)

Goal 2: In realizable setting (i.e. when, under 𝐷, 𝑊 ∌ ℎ𝜃∗(𝑥)), estimate 𝜃∗

, “dog”




, “cat”

Page 32: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Given:

• Training set 𝑥𝑖 , 𝑊𝑖 𝑖=1𝑛 of examples

• Hypothesis class ℋ = ℎ𝜃 𝜃 ∈ Θ} of (randomized) responses

• Loss function: ℓ:𝒎 × 𝒎 → ℝ

Assumption: 𝑥𝑖 , 𝑊𝑖 𝑖=1𝑛 ~𝑖𝑖𝑑 𝐷 where 𝐷 is unknown

Goal: Select 𝜃 to minimize expected loss: min𝜃

𝔌 𝑥,𝑊 ∌𝐷 ℓ(ℎ𝜃(𝑥), 𝑊)

Goal 2: In realizable setting (i.e. when, under 𝐷, 𝑊 ∌ ℎ𝜃∗(𝑥)), estimate 𝜃∗

, “dog”




, “cat”

Supervised Learning w/ Dependent Observations

Page 33: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Supervised Learning w/ Dependent ObservationsGiven:

• Training set 𝑥𝑖 , 𝑊𝑖 𝑖=1𝑛 of examples

• Hypothesis class ℋ = ℎ𝜃 𝜃 ∈ Θ} of (randomized) responses

• Loss function: ℓ:𝒎 × 𝒎 → ℝ

Assumption’: a joint distribution 𝐷 samples training examples andunknown test sample jointly, i.e.

• 𝑥𝑖 , 𝑊𝑖 𝑖=1𝑛 ∪ (𝑥, 𝑊) ∌ 𝐷

Goal: select 𝜃 to minimize: min𝜃

𝔌 ℓ(ℎ𝜃(𝑥), 𝑊)

, “dog”




, “cat”

Page 34: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Main result

Learnability is possible when joint distribution 𝐷 over training set and test set satisfies Dobrushin’s condition.

Page 35: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Dobrushin’s condition

• Given a joint probability vector ҧ𝑍 = 𝑍1, 
 , 𝑍𝑚 define the influence of variable 𝒊 to variable 𝒋: • “worst effect of 𝑍𝑖 to the conditional distribution of 𝑍𝑗 given 𝑍−𝑖−𝑗”

• Inf𝑖→𝑗 = sup𝑧𝑖,𝑧𝑖

′,𝑧−𝑖−𝑗

𝑑𝑇𝑉 Pr 𝑍𝑗 ∣ 𝑧𝑖 , 𝑧−𝑖−𝑗 , Pr 𝑍𝑗 ∣ 𝑧𝑖′, 𝑧−𝑖−𝑗

• Dobrushin’s condition:• “all nodes have limited total influence exerted on them”

• For all 𝑖, σ𝑗≠𝑖 Inf𝑗→𝑖 < 1.

Implies: Concentration of measureFast mixing of Gibbs sampler

Page 36: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Learnability under Dobrushin’s condition

• Suppose: • 𝑋1, 𝑌1 , 
 , 𝑋𝑛, 𝑌𝑛 , (𝑋, 𝑌) ∌ 𝐷, where 𝐷 is joint distribution• 𝐷: satisfies Dobrushin’s condition• 𝐷: has the same marginals

• Define Loss𝐷 ℎ = 𝔌 ℓ(ℎ(𝑥), 𝑊)

• There exists a learning algorithm which, given 𝑋1, 𝑌1 , 
 , 𝑋𝑛, 𝑌𝑛 , outputs ℎ ∈ ℋ such that

Loss𝐷 ℎ ≀ infℎ∈ℋ

Loss𝐷 ℎ + ේ𝑂 𝑉𝐶 ℋ /𝑛 for Boolean ℋ, 0-1 loss ℓ

Loss𝐷 ℎ ≀ infℎ∈ℋ

Loss𝐷 ℎ + ේ𝑂 𝜆 𝒢𝑛(ℋ) , for general ℋ, 𝜆-Lipschitz ℓ

(need stronger than Dobrushin condition for general ℋ)

Essentially the same bounds as i.i.d

Page 37: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Menu

• Motivation

• Part I: Regression w/ dependent observations• Proof Ideas

• Part II: Statistical Learning Theory w/ dependent observations

• Conclusions

Page 38: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Menu

• Motivation

• Part I: Regression w/ dependent observations• Proof Ideas

• Part II: Statistical Learning Theory w/ dependent observations

• Conclusions

Page 39: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

Goal: relax standard assumptions to accommodate two important challenges

o (i) censored/truncated samples and (ii) dependent samples

o Censoring/truncation ⇐ systematic missing of data

⇒ train set ≠ test set

o Data Dependencies ⇐ peer effects, spatial, temporal dependencies

⇒ no apparent source for independence

Goals of Broader Line of Work

Page 40: Statistical Inference from Dependent Observations · Pr UÔŠ=𝜎=exp(σ 𝜃T 𝜎 +𝛜𝜎T𝐎𝜎) 𝑍 Infer 𝜃,𝛜 • Concentration: gradient of log-pseudolikelihood at

o Regression on a network (linear and logistic, 𝑑/𝑛 rates):

Constantinos Daskalakis, Nishanth Dikkala, Ioannis Panageas:Regression from Dependent Observations.In the 51st Annual ACM Symposium on the Theory of Computing (STOC’19).

o Statistical Learning Theory Framework (learnability and generalization bounds):

Yuval Dagan, Constantinos Daskalakis, Nishanth Dikkala, Siddhartha Jayanti:Generalization and learning under Dobrushin's condition.In the 32nd Annual Conference on Learning Theory (COLT’19).

Statistical Estimation from Dependent Observations

Thank you!