High Dimensional Separable Representations for Statistical...

Kronecker GLasso Kronecker PCA Centralized Collaborative 20 Q. Decentralized Collaborative 20 Q. Conclusion References

High Dimensional Separable Representations forStatistical Estimation and Controlled Sensing

Theodoros Tsiligkaridis†

†Dept. of Electrical Engineering and Computer ScienceUniversity of Michigan, Ann Arbor

Ph.D. Thesis PresentationDecember 11, 2013


Motivation

I Separable approximations effective dimensionality reductiontechniques for high dimensional problems.

I Covariance estimation: reduced computational complexity &improved estimation accuracy. Statistical estimation performance forseparable models in high dimensions? Model mismatch?

I Centralized controlled sensing leads to great performance gains atthe expense of query design. Separable approximations to optimaljoint policy? Performance degradation?

I Controlled sensing over a network of greedy agents. Separablerepresentation of information state? Separable representation ofpolicy? Convergence?


Application: Spatiotemporal Signal Processing

Figure : U-component of wind speed as a function of time andlatitude/longitude for year 2008. (Source: National Centers for EnvironmentalPrediction, NOAA)


Application: Centralized Active Multisensor TargetLocalization

Fusion Center

S1

S2

S3

An(1)

An(2)

An(3)

Target X*

Yn+1(1)

Yn+1(2)

Yn+1(3)

Figure : Illustration of basic centralized collaborative tracking system.


Application: Decentralized Active Multisensor TargetLocalization

S1

S2

S3

An(1)

An(2)

An(3)

Target X*S4

S5

An(4)

An(5)

Figure : Illustration of basic decentralized collaborative tracking system.


Impact

I Engineering: collaborative on-road vehicle-recognition & tracking,optimization & design of active sensing systems (e.g., frequency agileradar, multicamera object tracking with PTZ cameras), conditionson network structure for successful aggregation of information indecentralized settings, human-in-the-loop decision making

I Signal Processing & Control: covariance decompositions formultidimensional data with theoretical guarantees, centralized &decentralized collaborative estimation with active queries,non-Bayesian social learning with active queries over finite networksleads to global consistency, decentralized stochastic search

I Social Sciences: social learning & opinion dynamics, adaptivetesting, recommendation systems, multitask learning, interviewdesign


Contributions of Thesis

1. Performance bounds for high-dimensional Kronecker-productstructured covariance matrix estimation

2. Optimal query design for a centralized collaborative controlledsensing system for target localization

3. Global convergence theory for decentralized collaborative controlledsensing for target localization


Kronecker Graphical Lasso


Mathematical setting

Observed d × n random matrix:

Z =

z1,1 · · · z1,n

.... . .

...zd,1 · · · zd,n

= [z1, . . . , zn]

Each column of Z is an independent realization of Gaussian randomvector

z = [z1, . . . , zd ]T

Of interest: estimate the d × d inverse covariance (precision) matrix of z(and the covariance matrix)

Θ = Σ−1, Σ = cov(z) = E [zzT ]

Gaussian graphical models: activity recognition, gene expressionnetworks, social networks, multiple financial time series.


Gaussian Graphical Models

Consider a random vector measurement Z ∈ Rd . Joint probabilitydistribution of d measurements can be represented as an undirectedgraph G = (V, E). Edge (i , j) /∈ E iff Zi and Zj are conditionallyindependent given all the other variables.

I If Z is a Gaussian random vector, conditional independencerelationships between variables are encoded in precision matrix(Lauritzen [1996]). Thus, estimating the Gaussian graphical modelis equivalent to estimating the precision matrix.

I Sparse GGM equivalent to sparse precision matrix.

Define sparsity parameter:

sΘ0 = card ({(i , j) : [Θ0]i,j 6= 0, i 6= j})


Sparse inverse covariance matrices and associatedgraphical models

Figure : Left: inverse correlation matrix. Right: associated graphical model(Wiesel et al. [2010])


Prior Work

I Many more unknown parameters (d(d + 1)/2) than measurements(n).

I Sample covariance matrix Sn = 1n

∑nt=1 ztzTt is poor estimator of Σ:

I Large eigenvalue spread in high dimensional regime (Karoui [2008]).I Estimation of eigenvectors of the SCM becomes impossible if the

ratio n/d is below a critical threshold (Paul [2007], Rao et al.[2008]).

I Regularize:I Parametric models: Toeplitz, AR, ARMA (Bickel and Levina [2008],

Huang et al. [2006], Cai et al. [2012]).I Sparse structured (inverse) covariance: Graphical lasso (Yuan and

Lin [2007])I Kronecker structured covariance: Flip-Flop Kronecker covariance

estimator (Werner et al. [2008])


Kronecker product model for covariance matrix

Figure : A saturated model with 18× 18 covariance matrix has18*(18+1)/2=171 unknown covariance parameters. A Kronecker productcovariance model reduces number of parameters to 6 + 21 = 27 unknowncovariance parameters.


Sparse Kronecker product model for covariance matrix

Figure : A sparse Kronecker product covariance model reduces number ofparameters from 65 to 16 unknown covariance parameters.


Applications of KP Covariance

I geostatistics (Cressie [1993], Genton [2007])

I genomics (Yin and Li [2012])

I multi-task learning (Bonilla et al. [2008])

I face recognition (Zhang and Schneider [2010])

I recommendation systems (Allen and Tibshirani [2010])

I collaborative filtering (Yu et al. [2009])

I MIMO wireless communications (Werner and Jansson [2007])


Problem Formulation

I Available are n i.i.d. multivariate Gaussian observations {zt}nt=1,where zt ∈ Rpq, having zero-mean and covariance equal to

Σ = A0︸︷︷︸p×p

⊗ B0︸︷︷︸q×q

=

[A0]1,1B0 . . . [A0]1,pB0

.... . .

...[A0]p,1B0 . . . [A0]p,pB0

,where A0 ∈ Sp

++ and B0 ∈ Sq++.

I Goal is to estimate the covariance matrix and its inverse Θ = Σ−1

(precision matrix).


Graphical Lasso (Yuan and Lin [2007])

Penalized negative log-likelihood function for Θ = Σ−1:

J(Θ) := tr(ΘSn)− log det(Θ) + λ|Θ|1 (1)

where Sn = 1n

∑nt=1 ztzTt is the sample covariance matrix (SCM).

Minimizer Θn ∈ arg min J(Θ).

I Fast algorithms exist for minimizing (1) (Friedman et al. [2008],Hsieh et al. [2011]) with worst-case computational complexity ofO(d4).

I High-dimensional MSE convergence rate (Rothman et al. [2008]):

‖Θn −Θ0‖2F = OP

((d + sΘ0 ) log(d)

n

)(2)

where λ �√

log(d)n .


ML estimator of Kronecker structured covariance

Negative log-likelihood function when Θ has Kronecker structureΘ = X⊗ Y:

J(X,Y) = tr((X⊗ Y)Sn)− q log det(X)− p log det(Y) (3)

Alternating minimization yields Flip-Flop algorithm (Werner et al. [2008])that generates updates of A = X−1, B = Y−1

A(B)︸︷︷︸p×p

=1

q

q∑k,l=1

[B−1]k,l Sn(l , k) (4)

B(A)︸︷︷︸q×q

=1

p

p∑i,j=1

[A−1]i,j Sn(j , i) (5)

where Sn = KTp,qSnKp,q

and Kp,qvec(N) = vec(NT ) for any p × q matrix N.


Submatrix partitioning of SCMŜn(1,2) Ŝn(1,1)

Figure : SCM of size pq × pq with p = 4, q = 5. Blue: Sn(1, 2). Red: Sn(1, 1).


MSE Convergence Rate of FF (Tsiligkaridis et al. [2012])

Let RFF (3) := A(B(Ainit))⊗ B(A(B(Ainit))) denote the 3-step(noniterative) version of the flip-flop algorithm (Werner et al. [2008]).More generally, let RFF (k) denote the k-step version of the flip-flopalgorithm.

TheoremLet A0,B0, and Ainit have uniformly bounded spectra and defineM = max(p, f , n). Assume p ≥ q ≥ 2 and p log M ≤ C ′′n for some finiteconstant C ′′ > 0. Finally, assume n ≥ p

q + 1. Then, for k ≥ 2 finite,

‖ΘFF (k)−Θ0‖2F = OP

((p2 + q2) log M

n

)(6)

as n→∞.


KGlasso Algorithm

min Jλ(X,Y) = J(X,Y) + λX |X|1 + λY |Y|1 (7)

where J(·, ·) is given in (3) and λX , λY ≥ 0.

Algorithm 1 KGlasso (Tsiligkaridis et al. [2012, 2013a])

1: Input: Sn, p, q, n, λX > 0, λY > 02: Output: ΘKGlasso

3: Initialize Ainit to be positive definite.4: A← Ainit

5: repeat6: B← 1

p

∑pi,j=1 [A−1]i,j Sn(j , i)

7: Y ← arg minY∈Sq++

tr(YB)− log det(Y) + λY |Y|18: A← 1

q

∑qk,l=1 [B−1]k,l Sn(l , k)

9: X← arg minX∈Sp++

tr(XA)− log det(X) + λX |X|110: until convergence11: ΘKGlasso ← X⊗ Y

Computational complexity: O(p4 + q4) (KGlasso)

vs O(p4q4) (Glasso).


KGlasso Convergence Rate (Tsiligkaridis et al. [2012])

Define ΘKGlasso(k) as the output of the kth KGlasso iteration.

TheoremLet A0,B0,Ainit have uniformly bounded spectra. Let M = max(p, f , n).Assume sparse X0 and Y0, i.e. sX0 = O(p), sY0 = O(f ). Assume

max(

pq ,

qp

)log M = o(n). If in the KGlasso algorithm

λ(k)X �

(1√p + 1√

q

)q√

log Mn and λ

(k′)Y �

(1√p + 1√

q

)p√

log Mn for all

k , k ′ ≥ 1, then

‖ΘKGlasso(k)−Θ0‖2F = OP

((p + q) log M

n

)(8)

as n→∞.

Assume p ∼ q. Comparing the KGlasso convergence rate (p + q)/n (8) withothers

I SCM rate: p2q2/n. Worse by 3 orders of magnitude

I FF rate: (p2 + q2)/n. Worse by 1 order of magnitude

I Glasso rate: (pq + sΘ0 )/n. Worse by 1 order of magnitude.


Large Sample MSE Convergence

We considered X0 and Y0 large sparse matrices of dimensionp = q = 100 yielding a covariance matrix Θ0 of dimension10, 000× 10, 000. This dimension was too large for implementation ofGlasso even when implemented using the state-of-the-art algorithm(Hsieh et al. [2011]). However, we can run KGlasso and FF and compareperformances since they have considerably less computational burden.

Figure : Sparse Kronecker matrix representation. Left panel: left Kroneckerfactor. Right panel: right Kronecker factor. The sparsity factor for bothprecision matrices is approximately 200.


Large Sample MSE Convergence (Cont.)

101

102

10−1

100

Sample size (n)

Nor

mal

ized

Fro

beni

us e

rror

in in

vers

e co

v. m

atrix

Max number of iter. = 100, Trials = 40, (p,f)=(100,100)

FFKGLassoFF/Thres

Student Version of MATLAB

Figure : Normalized RMSE performance for precision matrix as a function ofsample size n. For n = 10, there is a 72% RMSE reduction from the FF toKGLasso solution and a 70% RMSE reduction from the FF/Thres to KGLasso.


Kronecker PCA

S

V

projV(S)

V1=A1xB1V2=A2xB2

V1V2

projV2(S)

projV1(S)


Introduction

I Represent covariance as a Sum of Kronecker Products (SKP) of twolower dimensional factor matrices.

Σ0 =r∑

γ=1

A0,γ ⊗ B0,γ (9)

where {A0,γ} are p× p linearly independent matrices and {B0,γ} areq × q linearly independent matrices.

I Note 1 ≤ r ≤ r0 = min(p2, q2) and refer to r as the separation rank.


Introduction

Applications of Sum of Kronecker Products (SKP) model (9)

I Spatiotemporal MEG/EEG covariance modeling (de Munck et al.[2002, 2004], Bijma et al. [2005], Jun et al. [2006])

I Synthetic Aperture Radar (SAR) data analysis (Tebaldini [2009],Rucci et al. [2010])

Van Loan and Pitsianis [1993]:

I Any pq × pq matrix Σ0 can be written as an orthogonal expansionof Kronecker products of the form (9)

I Low separation rank is equivalent to low rank in a permuted spacedefined by the reshaping operator R(·)


Low separation rank ⇔ Low rank in permuted space

Original Covariance Σ0

50 100 150 200

50

100

150

200

Permuted Covariance R0

50 100 150 200 250 300 350 400

20

40

60

80

100

0

0.5

1

1.5

2

2.5

3

3.5


Figure : Original (top) and permuted covariance (bottom) matrix. The originalcovariance is Σ0 = A0 ⊗ B0, where A0 is a 10× 10 Toeplitz matrix and B0 is a20× 20 unstructured p.d. matrix. Note that the permutation operator R mapsa symmetric p.s.d. matrix Σ0 to a non-symmetric rank 1 matrix R0 = R(Σ0).


Permuted rank-penalized least-squares (PRLS)(Tsiligkaridis and Hero [2013a,b])

1. Map SCM to a different linear space:

Rn = R(Sn) ∈ Rp2×q2

2. Solve least-squares problem with nuclear norm penalization:

Rλn ∈ arg minR∈Rp2×q2

‖Rn − R‖2F + λ‖R‖∗ (10)

3. Map back to original space:

Sλn = R−1(Rλn ) ∈ Rpq×pq

where λ ≥ 0 is a regularization parameter.


Properties of PRLS Estimator (Tsiligkaridis and Hero[2013b])

Theorem

I The solution Σλn is symmetric.

I If n ≥ pq, then the solution Σλn is positive definite with probability 1.

TheoremDefine M = max(p, q, n). Set

λ = λn = 2C0t1−2ε′ max

{p2+q2+log M

n ,√

p2+q2+log Mn

}for t > 0 large enough.

Then, with probability at least 1− 2M−t

4C :

‖Σλn −Σ0‖2

F ≤ infR:rank(R)≤r

‖R− R0‖2F

+ C ′r max

{(p2 + q2 + log M

n

)2

,p2 + q2 + log M

n

}(11)

for some absolute constant C ′ > 0.


Setup

I NCEP Dataset: Daily average wind speeds collected at q = 144× 73weather stations spread throughout the world (Kalnay et al. [1996],Tsiligkaridis and Hero [2013b])

I Considered a 10× 10 grid of stations, corresponding to latituderange 90◦N-67.5◦N and longitude range 0◦E-22.5◦E

I Prediction time lag p − 1 = 7, full dimension d = pq = 800, numberof training samples n = 228.

I Training period: 2003− 2007, Testing period: 2008− 2012.


Kronecker product decomposition: PRLS

SCM

200 400 600 800

200

400

600

800

PRLS, reff = 2

200 400 600 800

200

400

600

800

KP Left Factor: Temporal

2 4 6 8

2

4

6

8

KP Right Factor: Spatial

20 40 60 80 100

20

40

60

80

100


Figure : Sample covariance matrix (SCM) (top left), PRLS covariance estimate(top right), temporal Kronecker factor for first KP component (bottom left)and spatial Kronecker factor for first KP component (bottom right).


Kronecker Spectrum

Figure : Kronecker spectrum of SCM (left) and Eigenspectrum of SCM (right).The KP spectrum is more compact than the eigenspectrum.


RMSE performance gains

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

Station

Roo

t Mea

n S

quar

e E

rror

SCMPRLSReg. Tyler


Figure : RMSE prediction performance across q stations for linear estimatorsusing SCM (blue), PRLS (green) and regularized Tyler (magenta).

I Average gain of PRLS over SCM = 4.64 dB

I Average gain of Reg. Tyler over SCM = 3.41 dB


Centralized Collaborative 20 Questions

Fusion Center

S1

S2

S3

An(1)

An(2)

An(3)

Target X*

Yn+1(1)

Yn+1(2)

Yn+1(3)


Motivation

Fusion Center

An(1)

An(2)

Target X*

Yn+1(1) Yn+1

(2)

I What is the intrinsic value of adding a human-in-the-loop to anautonomous learning machine?

I Insight into human-aided autonomous sensing for estimating anunknown target location or identifying a target.


Motivation

Figure : PTZ IP camera. Source: en.wikipedia.org/wiki/Pan-tilt-zoom camera

I Sensor systems become more flexible, e.g. pan-tilt-zoom cameras:where to look? different sensor waveforms & observations modes?How to control these aspects for a common localization objective?


Prior Work & Applications

Ask a sequence of questions and refine posterior distribution of target’slocation given the responses.

I Probabilistic Bisection Algorithm (PBA) first introduced in (Horstein[1963]).

I Discretized PBA (Burnashev and Zigangirov [1974]).

I Noisy Binary Search (Karp and Kleinberg [2007]).

I Convergence rate for BZ algorithm (Castro and Nowak [2007]).

I Noisy 20 questions game: PBA shown to be optimal under minimumexpected entropy criterion (Jedynak et al. [2012]).

I Convergence rate for PBA (Waeber et al. [2013]).

Applications of PBA: stochastic root finding, combinatorial optimization,road tracking, electron microscopy


Single player setting

I Jedynak et al. [2012] considers 20 questions with noise, where anoisy oracle is queried whether a target X ∗ lies in a set An ⊂ Rd .

I Starting with a prior distribution on the target’s location p0(·),minimize expected entropy of the posterior distribution:

infπEπ [H(pN)] (12)

where π = (π0, π1, . . . ) denotes the policy. The posteriormean/median of pN(·) is the target location estimate.

I Jedynak et al. [2012] shows the bisection policy is optimal under theminimum entropy criterion. Assuming the noisy channel is a BSC,optimal policies are characterized by:

Pn(An) :=

∫An

pn(x)dx = 1/2 (13)


Noisy 20 Questions with Collaborative Players: Model(Tsiligkaridis et al. [2013c])

I M collaborating players can be asked questions at each time instant.

I mth player’s query at time n: “does X ∗ lie in the region A(m)n ⊂ Rd?”

I Query is the binary variable Z(m)n = I (X ∗ ∈ A

(m)n ) ∈ {0, 1} to which the

player yields provides a noisy response Y(m)n+1 ∈ {0, 1}.

I Define the M-tuples Yn+1 = (Y(1)n+1, . . . ,Y

(M)n+1 ) and An = {A(1)

n , . . . ,A(M)n }.

AssumptionPlayers’ responses are conditionally independent:

P(Yn+1 = y|An,X∗ = x ,Fn) =

M∏m=1

P(Y(m)n+1 = y (m)|A(m)

n ,X ∗ = x ,Fn) (14)

P(Y(m)n+1 = y (m)|A(m)

n ,X ∗ = x ,Fn) =

{f

(m)1 (y (m)|εm), x ∈ A

(m)n

f(m)

0 (y (m)|εm), x /∈ A(m)n

(15)

f(m)j (y (m)|εm) =

{1− εm, y (m) = j

εm, y (m) = 1− j(16)


Optimal Joint Query Design: Setup

I Joint controller chooses M queries A(m)n at time n. Define the set of

subsets of Rd :

γ(A(1), . . . ,A(M)) =

{M⋂

m=1

(A(m))im : im ∈ {0, 1}

}

where (A)0 := Ac and (A)1 := A. The cardinality of this set ofsubsets is 2M and these subsets partition Rd .

I Define the density parameterized by An, pn, i1, . . . , iM :

gi1:iM (y (1), . . . , y (M)|An,Fn) :=M∏

m=1

f(m)im

(y (m)|A(m)n ,Fn)

where ij ∈ {0, 1}.


Sequential Query Design

Player 1Posterior

UpdateController 1

Controller M Player MPosterior

Update

Player 2Posterior

UpdateController 2

.

.

.

I Query region Ant chosen at time nt = (n, t), where n = 0, 1, . . .indexes over cycles and t = 0, . . . ,M − 1 indexes within cycles.

I Nested sequence of sigma-algebras Gn,t , Gn,t ⊂ Gn+i,t+j for all i ≥ 0and j ∈ {0, . . . ,M − 1− t}, generated by sequence of queries andthe players’ responses.


Optimal Joint Query Design

Player 1

Player M

Fusion

Center

Joint

Controller

.

.

.

I Joint controller chooses a batch of M queries {A(m)n } at time n.

I As in sequential query design, joint queries chosen based onaccumulation information at controller. Since full batch of jointqueries are determined at start of nth cycle, the joint controller onlyhas access to a coarser filtration Fn, Fn−1 ⊂ Fn, as compared toGn,t .


Equivalence Theorem (Tsiligkaridis et al. [2013c])

Theorem(Equivalence, Known Error Probabilities)

1. The expected entropy loss under an optimal joint query design is thesame as the greedy sequential query design. This loss is given by:

C =M∑

m=1

C (εm) =M∑

m=1

(1− hb(εm)) (17)

where hb(εm) = −εm log(εm)− (1− εm) log(1− εm) is the binaryentropy function.

2. All jointly optimal control laws equalize the posterior probability over

the dyadic partitions induced by An = {A(1)n , . . . ,A

(M)n }:

Pn(R) =

∫R

pn(x)dx = 2−M ,∀R ∈ γ(An). (18)


Consequences of Equivalence Theorem

I Optimal policy can be implemented using the simpler sequentialquery design.

I Despite the fact that all players are conditionally independent, thejoint policy does not decouple into separate single player optimalpolicies (analogous to the non-separability of the optimalvector-quantizer in source coding even for independent sourcesGersho and Gray [1992]).

I Optimal queries must be overlapping-i.e.,⋂M

m=1 A(m)n 6= ∅, but not

identical.

I Optimal query An is not unique.


Example of optimal queries for M = 2

Figure : Jointly optimal queries under uniform prior.


Lower Bounds on MSE via Entropy Loss

Theorem(Lower Bound on MSE) Assume the entropy H(p0) is finite. Then, theMSE of the joint or sequential query policies satisfies:

K

2πed exp

(−2nC

d

)≤ E[‖ X ∗ − Xn ‖2

2] (19)

where K = e2H(p0) and Xn is the posterior mean. The expected entropyloss per iteration is C =

∑m C (εm).


Upper Bounds on MSE: Setup

I Performance analysis of PBA is difficult primarily due to thecontinuous nature of the posterior Castro and Nowak [2007].

“The probabilistic bisection algorithm seems to work extremely wellin practice, but it is hard to analyze and there are few theoreticalguarantees for it, especially pertaining error rates of convergence.”

I A discretized version of PBA was proposed in (Burnashev andZigangirov [1974]) (BZ algorithm), which imposes a piecewiseconstant structure on the posterior (see Castro and Nowak [2007],App. A in Castro [2007]).

I Recently, an answer for the continuous PBA was given in (Waeberet al. [2013]) for one-dimensional target search.


Upper Bounds on MSE: Setup

I For simplicity, assume the target location is constrained to the unitinterval X = [0, 1].

I A step size ∆ > 0 is defined such that ∆−1 ∈ N and the posteriorafter j iterations is pj : X → R, given by

pj(x) =1

∆

∆−1∑i=1

ai (j)I (x ∈ Ii )

where I1 = [0,∆], Ii = ((i − 1)∆, i∆] for i = 2, . . . ,∆−1. The initialposterior is ai (0) = ∆. The posterior is characterized completely bythe pseudo-posterior a(j) = [a1(j), . . . , a∆−1 (j)] which is updated ateach iteration via Bayes rule.


Upper Bounds on MSE

Theorem(Upper Bound on MSE) Consider the sequential bisection algorithm forM players in one-dimension, where each bisection is implemented usingthe BZ algorithm. Then, we have:

P(|X ∗ − Xn| > ∆) ≤ (1

∆− 1) exp

(−nC

)E[(X ∗ − Xn)2] ≤ (2−2/3 + 21/3) exp

(−2

3nC

)(20)

where C =∑M

m=1 C (εm), C (ε) = 1/2−√ε(1− ε).


Upper Bounds on MSE: Human-in-the-loop

I Player 1 (machine) has constant error probability ε1 ∈ (0, 1/2)

I Player 2 (human) has error probability depending on the targetlocalization error:

P(Y(2)n+1 = y (2)|Z (2)

n = 1− y (2)) =1

2−min(δ0, µ|X ∗ − Xn|κ−1) (21)

I κ = human ”resolution” (κ > 1)

I δ0 = reliability parameter (0 < δ0 < µ < 1/2)

I MSE upper bound for “player 1 + human” system:

E[(X ∗ − Xn)2] ≤ e−23nC(ε1)

×[

2−2/3 + 21/3 exp

(−µ

2

50

(3 · 2−1/3

4

)2κ−2

ne−nC(ε1) 2κ−23

)](22)

which is no greater than the “player 1” MSE bound.

I Both bounds converge to zero at the same rate as n→∞.

I Human gain ratio (HGR) = ratio of MSE upper bounds associated with“player 1” and “player 1 + human”.

Rn(κ) =2−2/3 + 21/3

2−2/3 + 21/3 exp(−µ2

50( 3·2−1/3

4)2κ−2ne−nC(ε1) 2κ−2

3

) (23)


Upper Bounds on MSE: Human-Gain Ratio

I The larger ε1 is, the larger is the HGR.

I As κ decreases to 1, the ratio increases, meaning that the human becomesmore like the machine and helps more.

0 100 200 300 400 500 600 700 800 900 1000

1

1.05

1.1

1.15

1.2

Iteration (n)

MS

E H

uman

Gai

n R

atio

ε1 = 0.4

Opt. κ = 1.5Opt. κ = 2Opt. κ = 2.5Pred. κ = 1.5Pred. κ = 2Pred. κ = 2.5


Figure : Human gain ratio as a function of κ. The human provides the largestgain in the beginning few iterations and its value of information decreases asn→∞. The predictions well match the optimized bounds.


Simulation (Known error probabilities): Initial Distribution

0 0.2 0.4 0.6 0.8 10

1

2

3

4

5

6

7

8x 10

−3 Iteration = 0


Figure : Initial distribution is a mixture of three Gaussians with means 0.25, 0.5and 0.75, and variances 0.02, 0.05 and 0.08, respectively. The target was set tobe the center of the mode at X ∗ = 0.75 with the largest variance.


Simulation (Known error probabilities): MSE Decay

0 20 40 60 80 100 120 1400

0.05

0.1

0.15

0.2

0.25

Iteration (n)

MS

E

κ = 1.5, δ0 = 0.4, mu = 0.42

p1: ε

1 = 0.1

p1+human: ε1 = 0.1

p1: ε1 = 0.2

p1+human: ε1 = 0.2

p1: ε1 = 0.3

p1+human: ε1 = 0.3

p1: ε1 = 0.4

p1+human: ε1 = 0.4


Figure : Monte Carlo simulation for MSE performance of the sequentialestimator as a function of iteration and ε1 ∈ (0, 1/2). 2000 Monte Carlo trialswere used. The human parameters were set to κ = 1.5, µ = 0.42, δ0 = 0.4, thelength of pseudo-posterior was ∆−1 = 1618. The target was set to X ∗ = 0.75.


Decentralized Collaborative 20 Questions

S1

S2

S3

An(1)

An(2)

An(3)

Target X*S4

S5

An(4)

An(5)


Motivation

Consider a collection of agents in a network with the objective oflocalizing a target collectively.

I What is the value of collaboration when there is no central authority?

I Local in-network querying and processing leads to globalequilibrium? Deterministic or random limit? Unbiasedness?


Intractability of fully Bayesian methodology

I limited observability (observations of an agent not observable byothers) & lack of global knowledge of observation statistics

I if agents have only partial information on the network structure andthe probability distribution of the signals observed by other agents,the Bayesian approach becomes more complicated because agentswould need to form and update beliefs on the states of the world, inaddition to the networks struture and the rest of the agents’ signalstructures

I even if the network structure is known, agents would still need toupdate beliefs on the information of every other agent in thenetwork, given only the neighbors’ beliefs at each iteration


Prior Work on Distributed Averaging

Consensus, gossip algorithms, distributed averaging: messages distributedaround network through local processing.

I averaging under randomized gossip (Boyd et al. [2006])

I geographic gossip (Dimakis et al. [2006])

I randomized path averaging (Benezit et al. [2010])

I gossip algorithms for sensor networks (Dimakis et al. [2010])

I randomized gossip broadcast algorithms for consensus (Aysal et al.[2009])

I gossip distributed estimation for linear parameter estimation (Karand Moura [2011])

I consensus for wireless medium (Nokleby et al. [2013])

Applications: distributed optimization (Tsitsiklis [1984], Tsitsiklis et al.[1986]), load-balancing (Cybenko [1989]), distributed detection(Saligrama et al. [2006])Our work differs because we consider new information injected into thedynamical system described by averaging and because we considercontrolled observations.


Prior Work on Social Learning

Dynamic model of opinion formation.

I opinion formation model (DeGrout [1974])

I convergence of dynamics generated by non-Bayesian decentralizedestimation scheme (Jadbabaie et al. [2012])

I rate of convergence analysis (Molavi et al. [2013])

Our work differs because we consider continuous-valued target space andcontrolled observations.


Prior Work on Computerized Adaptive Testing

Given current estimate of proficiency, how to choose next test item?

I dynamic selection of test items via item-response theory &maximum information or maximum expected precision criterion(Wainer [2000], Owen [1975])

Our work differs because we consider continuous-valued query regions, nopractical constraints necessary, and a different objective function.


Prior Work on Active Stochastic Search/20 Questions

Active querying for sequential estimation.

I single-player 20 questions for target localization (Jedynak et al.[2012])

I convergence rate for discretized version of single-player 20 questions(Castro and Nowak [2007])

I convergence rate for continuous-space single-player PBA (Waeberet al. [2013])

I (centralized) multi-player 20 questions for target localization(Tsiligkaridis et al. [2013b])

Our work differs because we consider intermediate local belief sharingbetween agents after each local bisection and Bayesian update (entropyno longer monotonically decreasing for each agent!). Also, each agentincorporates beliefs of neighbors in a way that is agnostic of neighbors’error probabilities.


Notation

I X ∗ ∈ [0, 1] = true target location

I N = {1, . . . ,M} = agent set of network

I G = (N ,E ) directed graph capturing agent interactions

I Ni = {j ∈ N : (j , i) ∈ E} = local neighborhood of ith agent

I pi,t(·) = belief of ith agent at time t


Decentralized Estimation

Algorithm 2 Decentralized Estimation Algorithm

1: Input: G = (N , E),A = {ai,j : (i, j) ∈ N ×N}, {εi : i ∈ N}2: Output: {Xi,t , Xi,t : i ∈ N}3: Initialize pi,0(·) to be positive everywhere.4: repeat5: For each agent i ∈ N :6: Bisect posterior density at median: Xi,t = F−1

i,t (1/2).

7: Obtain (noisy) binary response yi,t+1 ∈ {0, 1}.8: Belief update:

pi,t+1(x) = ai,ipi,t(x)li (yi,t+1|x, Xi,t)

Zi,t(yi,t+1)+

∑j∈Ni

ai,jpj,t(x), x ∈ X (24)

where the observation p.m.f. is:

li (y |x, Xi,t) = f(i)

1 (y)I (x ≤ Xi,t) + f(i)

0 (y)I (x > Xi,t), y ∈ Y (25)

and f(i)

1 (y) = (1− εi )I (y=1)εI (y=0)i , f

(i)0 (y) = 1− f

(i)1 (y).

9: Calculate target estimate: Xi,t =∫X xpi,t(x)dx .

10: until convergence


Assumptions

I (Conditional Independence) Assume conditional independence:

P(Yt+1 = y|Ft) =M∏i=1

P(Yi,t+1 = yi |Ft) (26)

and each player’s response is governed by:

li (yi |x ,Ai,t) := P(Yi,t+1 = yi |Ai,t ,X∗ = x) =

{f

(i)1 (yi ), x ∈ Ai,t

f(i)

0 (yi ), x /∈ Ai,t

(27)

I (Memoryless Binary Symmetric Channels) Model players’ responses asindependent BSC’s with crossover probabilities εi ∈ (0, 1/2).

f(i)z (yi ) =

{1− εi , yi = zεi , yi 6= z

for i = 1, . . . ,M, z = 0, 1.

I (Strong Connectivity & Positive Self-reliances) Assume that the network isstrongly connected and all self-reliances ai,i are strictly positive.


Global Convergence Theory

Theorem(Asymptotic Agreement/Consensus) Consider Algorithm 2. LetB = [0, b] ∈ B([0, 1]). Then, consensus of the agents’ beliefs is asymptoticallyachieved across the network:

Vt(B) = maxi

Pi,t(B)−mini

Pi,t(B)p.−→ 0

as t →∞.

Theorem(Convergence of Beliefs to a Deterministic Limit & Consistency) ConsiderAlgorithm 2. Let B = [0, b] ∈ B([0, 1]). Then, we have:

1. For each i ∈ N :

Fi,t(b) = Pi,t(B)p.−→ F∞(b) =

{0, b < X ∗

1, b > X ∗

2. For all i ∈ N :

Xi,t :=

∫ 1

x=0

xpi,t(x)dxp.−→ X ∗


Simulation: Three network topologies

a) Fully connected graph b) Cyclic graphc) Star graph


MSE Performance, εi = 0.4,∀i

0 100 200 300 400 500 600 700 800 900 1000

10−4

10−3

10−2

10−1

100

Iteration

MS

E

I: avgI: minI: maxA: avgA: minA: max

0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

Iteration

δ pr

obab

ility

mas

s


0 100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

Iteration

Ent

ropy


0 100 200 300 400 500 600 700 800 900 1000

10−4

10−3

10−2

10−1

100

Iteration

MS

E


0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

Iteration

δ pr

obab

ility

mas

s


0 100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

Iteration

Ent

ropy


0 100 200 300 400 500 600 700 800 900 1000

100

Iteration

MS

E


0 100 200 300 400 500 600 700 800 900 10000

0.2

0.4

0.6

0.8

1

Iteration

δ pr

obab

ility

mas

s


0 100 200 300 400 500 600 700 800 900 10000

2

4

6

8

10

Iteration

Ent

ropy



MSE Performance, ε1 = 0.05, εi = 0.45,∀i 6= 1

0 50 100 150 200 250 300 350 400 450 500

100

Iteration

MS

E

I: avgI: minI: maxA: avgA: minA: maxCentralized

0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Iteration

δ pr

obab

ility

mas

s


0 50 100 150 200 250 300 350 400 450 5000

2

4

6

8

10

Iteration

Ent

ropy



0 50 100 150 200 250 300 350 400 450 500

100

Iteration

MS

E


0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Iteration

δ pr

obab

ility

mas

s


0 50 100 150 200 250 300 350 400 450 5000

2

4

6

8

10

Iteration

Ent

ropy



0 50 100 150 200 250 300 350 400 450 500

100

Iteration

MS

E


0 50 100 150 200 250 300 350 400 450 5000

0.2

0.4

0.6

0.8

1

Iteration

δ pr

obab

ility

mas

s


0 50 100 150 200 250 300 350 400 450 5000

2

4

6

8

10

Iteration

Ent

ropy




Main Contributions

1. Kronecker Graphical Lasso

I Sparse covariance estimation algorithm (KGlasso) introduced for thehigh-dimensional setting for Kronecker product structure.

I High-dimensional MSE convergence rate analysis.

I Analysis prescribes selection of regularization parameters.

2. Covariance Estimation via Kronecker Product Expansions

I Scalable covariance estimation algorithm (PRLS) introduced for thehigh-dimensional setting.

I Tradeoff between approximation error and estimation error.

I High-dimensional MSE convergence rate analysis.

I Analysis prescribes selection of regularization parameter.


Main Contributions

3. Centralized Collaborative 20 Questions

I Introduced model for centralized collaborative 20 questions.

I Characterized optimal policies & proved equivalence theorem thatsimplifies policy implementation.

I Incorporated human-in-the-loop by treating him as a collaborativeplayer.

I Linked information theoretic gains to MSE convergence rates.

4. Decentralized Collaborative 20 Questions

I Introduced model for decentralized collaborative 20 questions.

I Proved consensus of agents’ beliefs & global consistency ofdecentralized estimation algorithm.


Thank you!


G. I. Allen and R. Tibshirani. Transposable regularized covariance modelswith an application to missing data imputation. The Annals of AppliedStatistics, 4(2):764–790, 2010.

T. C. Aysal, M. E. Yildiz, A. D. Sarwate, and A. Scaglione. Broadcastgossip algorithms for consensus. IEEE Transactions on SignalProcessing, 57(7), July 2009.

F. Benezit, A. Dimakis, P. Thiran, and M. Vetterli. Order-optimalconsensus through randomized path averaging. IEEE Transactions onInformation Theory, 56(10):5150–5167, October 2010.

P. Bickel and E. Levina. Covariance regularization by thresholding.Annals of Statistics, 36(6):2577–2604, 2008.

Fetsje Bijma, Jan de Munck, and Rob Heethaar. The spatiotemporal megcovariance matrix modeled as a sum of kronecker products.NeuroImage, 27:402–415, 2005.

E. Bonilla, K. M. Chai, and C. Williams. Multi-task gaussian processprediction. Advances in Neural Information Processing Systems, pages153–160, 2008.

S. Boyd, A. Ghosh, B. Prabhakar, and D. Shah. Randomized GossipAlgorithms. IEEE Transactions on Information Theory, 52(6):2508–2530, June 2006.


M. V. Burnashev and K. Sh. Zigangirov. An interval estimation problemfor controlled observations. Problems in Information Transmission, 10:223–231, 1974.

T. Tony Cai, Z. Ren, and H. Zhou. Optimal rates of convergence forestimating toeplitz covariance matrices. Probability Theory andRelated Fields, March 2012.

R. Castro. Active Learning and Adaptive Sampling for Non-parametricInference. PhD thesis, Rice University, August 2007.

R. Castro and R. Nowak. Active learning and sampling. In Foundationsand Applications of Sensor Management. Springer, 2007.

N. Cressie. Statistics for Spatial Data. Wiley, New York, 1993.

G. Cybenko. Dynamic load balancing for distributed memorymultiprocessors. Journal of Parallel and Distributed Computing, 7(2):279–301, 1989.

J. C. de Munck, H. M. Huizenga, L. J. Waldorp, and R. M. Heethaar.Estimating stationary dipoles from meg/eeg data contaminated withspatially and temporally correlated background noise. IEEETransactions on Signal Processing, 50(7), July 2002.

J. C. de Munck, F. Bijma, P. Gaura, C. A. Sieluzycki, M. I. Branco, andR. M. Heethaar. A maximum-likelihood estimator for trial-to-trial


variations in noisy meg/eeg data sets. IEEE Transactions onBiomedical Engineering, 51(12), 2004.

M. H. DeGrout. Reaching a consensus. Journal of American StatisticalAssociation, 69:118–121, 1974.

A. Dimakis, A. Sarwate, and M. Wainwright. Geographic gossip:Efficient averaging for sensor networks. IEEE Transactions on SignalProcessing, 56(3):1205–1216, March 2006.

A. Dimakis, S. Kar, J. M. F. Moura, M. G. Rabbat, and A. Scaglione.Gossip algorithms for distributed signal processing. Proceedings of theIEEE, 98(11), November 2010.

J. Friedman, T. Hastie, and R. Tibshirani. Sparse inverse covarianceestimation with the graphical lasso. Biostatistics, 9(3):432–441, 2008.

M. G. Genton. Separable approximations of space-time covariancematrices. Environmetrics, 18:681–695, 2007.

A. Gersho and R. M. Gray. Vector quantization and Signal Compression.Kluwer Academic Press/Springer, 1992.

M. Horstein. Sequential transmission using noiseless feedback. IEEETransactions on Information Theory, pages 136–143, July 1963.

C.-J. Hsieh, M. A. Sustik, I. S. Dhillon, and P. Ravikumar. Sparse inversecovariance matrix estimation using quadratic approximation. Advancesin Neural Information Processing Systems, 24, 2011.


J. Huang, N. Liu, M. Pourahmadi, and L. Liu. Covariance matrixselection and estimation via penalised normal likelihood. Biometrika,93(1), 2006.

A. Jadbabaie, P. Molavi, A. Sandroni, and A. Tahbaz-Salehi.Non-bayesian social learning. Games and Economic Behavior, 76:210–225, 2012.

B. Jedynak, P. I. Frazier, and R. Sznitman. Twenty questions with noise:Bayes optimal policies for entropy loss. Journal of Applied Probability,49:114–136, 2012.

S. C. Jun, S. M. Plis, D. M. Ranken, and D. M. Schmidt. Spatiotemporalnoise covariance estimation from limited empiricalmagnetoencephalographic data. Physics in Medicine and Biology, 51:5549–5564, 2006.

E. Kalnay, M. Kanamitsu, R. Kistler, W. Collins, D. Deaven, L. Gandin,M. Iredell, S. Saha, G. White, J. Woollen, Y. Zhu, M. Chelliah,W. Ebisuzaki, W.Higgins, J. Janowiak, K. C. Mo, C. Ropelewski,J. Wang, A. Leetmaa, R. Reynolds, Roy Jenne, and Dennis Joseph.The ncep/ncar 40-year reanalysis project. Bulletin of the AmericanMeteorological Society, 77(3):437471, 1996.

S. Kar and J. M. F. Moura. Covergence rate analysis of distributed gossip


(linear parameter) estimation: Fundamental limits and tradeoffs. IEEEJournal of Selected Topics in Signal Processing, 5(4), August 2011.

N. El Karoui. Spectrum estimation for large dimensional covariancematrices using random matrix theory. Annals of Statistics, 36(6):2757–2790, 2008.

R. M. Karp and R. Kleinberg. Noisy binary search and its applications. InProceedings of the eighteenth annual ACM-SIAM symposium onDiscrete algorithms (SODA), pages 881–890, 2007.

Steffen L. Lauritzen. Graphical Models. Oxford University Press US, firstedition, 1996.

Charles Van Loan and Nikos Pitsianis. Approximation with kroneckerproducts. In Linear Algebra for Large Scale and Real TimeApplications, pages 293–314. Kluwer Publications, 1993.

P. Molavi, A. Jadbabaie, K. R. Rad, and A. Tahbaz-Salehi. Reachingconsensus with increasing information. IEEE Journal of Selected Topicsin Signal Processing, 7(2):358–369, April 2013.

M. Nokleby, W. Bajwa, R. Calderbank, and B. Aazhang. Towardresource-optimal consensus over the wireless medium. IEEE Journal ofSelected Topics in Signal Processing, 7(2), April 2013.


R. J. Owen. A bayesian sequential procedure for quantal response in thecontext of adaptive mental testing. Journal of the American StatisticalAssociation, 70:351–356, 1975.

D. Paul. Asymptotics of sample eigenstructure for a large dimensionalspiked covariance model. Statistica Sinica, 17:1617–1642, 2007.

N. Raj Rao, J. Mingo, R. Speicher, and A. Edelman. Statisticaleigen-inference from large wishart matrices. Annals of Statistics, 36(6):2850–2885, 2008.

A. Rothman, P. Bickel, E. Levina, and J. Zhu. Sparse permutationinvariant covariance estimation. Electronic Journal of Statistics, 2:494–515, 2008.

A. Rucci, S. Tebaldini, and F. Rocca. Skp-shrinkage estimator for sarmulti-baselines applications. In Proceedings of IEEE Radar Conference,2010.

V. Saligrama, M. Alanyali, and O. Savas. Distributed detection in sensornetworks with packet loss and finite capacity links. IEEE Transactionson Signal Processing, 54(11):4118–4132, November 2006.

S. Tebaldini. Algebraic synthesis of forest scenarios from multibaselinepolinsar data. IEEE Transactions on Geoscience and Remote Sensing,47(12), December 2009.


T. Tsiligkaridis and A. O. Hero. Low separation rank covarianceestimation using kronecker product expansions. In Proceedings of IEEEInternational Symposium on Information Theory (ISIT), July 2013a.

T. Tsiligkaridis and A. O. Hero. Covariance Estimation via KroneckerProduct Expansions. arXiv: 1302.2686, February 2013b.

T. Tsiligkaridis, A. O. Hero, and S. Zhou. Convergence Properties ofKronecker Graphical Lasso algorithms. arXiv:1204.0585, July 2012.

T. Tsiligkaridis, A. O. Hero, and S. Zhou. On Convergence of KroneckerGraphical Lasso Algorithms. IEEE Transactions on Signal Processing,61(7):1743–1755, April 2013a.

T. Tsiligkaridis, B. M. Sadler, and A. O. Hero. Collaborative 20 Questionsfor Target Localization. Preprint, arXiv: 1306.1922, August 2013b.

T. Tsiligkaridis, B. M. Sadler, and A. O. Hero. A Collaborative 20Questions model for target search with human-machine interaction. InProceedings of IEEE International Conference on Acoustics, Speech,and Signal Processing (ICASSP), 2013c.

J. Tsitsiklis. Problems in decentralized decision making and computation.PhD thesis, Massachussets Institute of Technology, Cambridge, MA,November 1984.


J. Tsitsiklis, D. Bertsekas, and M. Athans. Distributed asynchronousdeterministic and stochastic gradient optimization algorithms. IEEETransactions on Automatic Control, 31(9):803–812, September 1986.

R. Waeber, P. I. Frazier, and S. G. Henderson. Bisection search withnoisy responses. SIAM Journal of Control and Optimization, 53(3):2261–2279, 2013.

H. Wainer. Computerized Adaptive Testing: A Primer. Routledge, 2edition, 2000.

K. Werner, M. Jansson, and P. Stoica. On estimation of covariancematrices with Kronecker product structure. IEEE Transactions onSignal Processing, 56(2), February 2008.

Karl Werner and Magnus Jansson. Estimation of kronecker structuredchannel covariances using training data. In Proceedings of EUSIPCO,2007.

A. Wiesel, Y. Eldar, and A. O. Hero. Covariance estimation indecomposable gaussian graphical models. IEEE Transactions on SignalProcessing, 58(3):1482–1492, March 2010.

J. Yin and H. Li. Model selection and estimation in the matrix normalgraphical model. Journal of Multivariate Analysis, 107:119–140, 2012.


K. Yu, J. Lafferty, S. Zhu, and Y. Gong. Large-scale collaborativeprediction using a nonparametric random effects model. ICML, pages1185–1192, 2009.

M. Yuan and Y. Lin. Model selection and estimation in the gaussiangraphical model. Biometrika, 94:19–35, 2007.

Y. Zhang and J. Schneider. Learning multiple tasks with a sparsematrix-normal penalty. Advances in Neural Information ProcessingSystems, 23:2550–2558, 2010.

High Dimensional Separable Representations for Statistical...

Documents

Transcript of High Dimensional Separable Representations for Statistical...