Modern MDL Meets Data Mining Insight, Theory, and...

73
Modern MDL Meets Data Mining Insight, Theory, and Practice Part IIIStochastic Complexity Kenji Yamanishi Graduate School of Information Science and Technology, the University of Tokyo August 4 th 2019 KDD Tutorial

Transcript of Modern MDL Meets Data Mining Insight, Theory, and...

Page 1: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Modern MDL Meets Data MiningInsight, Theory, and Practice

ーPart IIIーStochastic Complexity

Kenji YamanishiGraduate School of Information Science and

Technology, the University of TokyoAugust 4th 2019

KDD Tutorial

Page 2: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Part III. Stochastic Complexity

3.1. Stochastic Complexity and NML Codelength3.2. G-function and Fourier Transformation3.3. Latent Variable Model Selection

3.3.1. Latent Stochastic Complexity3.3.2. Decomposed NML Codelength3.3.3. Applications to a Variety of Models

3.4. High-dimensional Sparse Model Selection3.4.1. High Dimensional Penalty Selection3.4.2. High Dimensional Minimax Prediction

Page 3: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

3.1. Stochastic Complexity and NML Codelength

Page 4: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

What is Information?

■Case1: Data distribution is known

Data sequence:

Probability mass(density) function:

Shannon information

Page 5: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Theorem 3.1.1 (Shannon’s Source Coding Theorem)

Shannon information gives optimal codelength whenthe true distribution is known in advance.

Characterization of Shannon Information

Kraft’s Inequality

Entropy

Page 6: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

■Case2: Data distribution is unknown.Data seq.:

Probabilistic model class.:

Normalized Maximum Likelihood (NML)Distribution

Stochastic Complexityof x relative to PM=NML codelength

Parametric Complexityof PM

Page 7: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Theorem 3.1.2 (Minimax optimality of NML codelength)[Shtarkov Probl. Inf. Trans. 87]

NML codelength achieves the minimum of the risk

Shtarkov’s minimax risk

Characterization of NML Codelength

baseline

Page 8: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

How to calculate Parametric Complexity

Theorem 3.1.3 (Asymptotic formula for parametric complexity)[Rissanen IEEE IT1996]

Page 9: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Example 3.1.1 (Multinomial Distribution)

Stochastic Complexity

Page 10: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Example 3.1.2 (1-dimensional Gaussian distribution)

Restricted Parameter Space

FisherInformation

Parameter Space

m.l.e.

Stochastic complexity

Page 11: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Characterization of Stochastic Complexity

Theorem 3.1.4 (Rissanen’s lower bound)[Rissanen 1989]

Stochastic complexity gives optimal codelength whenthe true distribution is unknown.

Page 12: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Sequential NML Codelength

Sequential NML (SNML)=Cumulative codelength for sequential NML coding

: data sequence

Sequential NMLTheorem 3.1.5(Property of SNML)For model classes under some regular conditions

E.g. Regression model [Rissanen IEEE IT2000] [Roos, Myllymaki, Rissanen MVA 2009]

Page 13: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Example 3.1.3.(Auto-regression model)

where

When σ is fixed, then SNNL dist. becomes

Page 14: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

MDL Criterion(1/2)

MDL(Minimum Description Length) criterion

Page 15: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

MDL Criterion(2/2)[Rissanen IEEE IT 1996]

Page 16: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Example 3.1.4. (Histogram Density)

・・・

K0 1

Page 17: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Optimality of MDL EstimationTheorem 3.1.6(Optimality of MDL estimator) [Rissanen 2012]

Page 18: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Why is the MDL?

・ Optimal solution to Shtarkov minimax risk・ Attaining Rissanen’s lower bound・ Consistency [Rissanen Auto. Control 1978]

[Barron 1989]・ Index of Resolvability [Barron and Cover IEEE IT1991]

・ Rapid convergence rate with PAC learning[Yamanishi Mach. Learning 1992][Rissanen Yu, Learn. and Geo. 1996][Chatterjee and Barron ISIT2014][Kawakita, Takeuchi, ICML2016]

・ Estimation optimality [Rissanen 2012]

・ A huge number of case studies

Andrew Barron

Page 19: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

3.2. G-Function and Fourier Transformation

Page 20: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

It is hard to calculate for general model M.Beyond Rissanen’s asymptotic formulae for small data

1) Calculate it non-asymptotically.2) Calculate it exactly and efficiently.

A) g-functionB) Fourier transformation C) Combinatorial Methods

Computational Problem in Parametric Complexity

Parametric Complexity

Page 21: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

g-Function

where

Density decomposition [Rissanen 2007, 2012]

Page 22: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Calculation of Parametric Complexityvia g-function

Variable transformation

Page 23: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Example 3.2.1 (g-function for exponential distributions)

:m.l.e.

Page 24: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

⇒ Extended into general exponential family[Hirai and Yamanishi IEEE IT2013]

Then

Page 25: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Fourier Transformation Method(1/2)

Theorem 3.2.1 (Parametric complexity based on Fourier transformation) [Suzuki and Yamanishi ISIT 2018]

Page 26: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Fourier Transformation Method(2/2)

Page 27: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Exponential Family

Theorem 3.2.2 (Parametric complexity of Exponential Family with Fourier method) [Suzuki and Yamanishi ISIT 2018]

Page 28: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Parametric Complexity of Exponential Family Computed with Fourier Method

[Suzuki and Yamanishi ISIT 2018]

Page 29: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

3.3. Latent Variable Model Selection

Page 30: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Motivation 1. Topic Model

How many topics lie in a document?

Topic 1Machine: 0.05Computer: 0.03Algorithm: 0.02

Topic 2Matrix: 0.04Math: 0.03

equation: 0.01…

Topic KBio: 0.05

Gene: 0.04Array: 0.01

Document 1

Document 2Document DDocument DDocument D

30

Latent Variable Model Selection

#topics =3

LDA: Latent Dirichlet Allocations

Page 31: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Motivation 2: Relational Model

How many blocks lie in a relational data?

Permutation

31

4 6 9 12

#blocks = 6

Latent Variable Model Selection

Page 32: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Latent Variable Models

Example 3.4.1 (Gaussian mixture model) Z=1

Z=2

Z=k

Page 33: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Nonidentifiability of Latent Variable Model

Marginalized model

Finite mixture model

Non-identifiable

Singular region

Complete variable model Estimate Z from X

Identifiable

Page 34: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Z=1 Z=2 ... Z=K

𝑋𝑋1 𝑋𝑋2 𝑋𝑋𝑁𝑁...

K Latent components

N Outputs

Normalized maximum likelihood (NML)Codelength for complete variable model

𝐿𝐿𝑁𝑁𝑁𝑁𝑁𝑁(𝑋𝑋,𝑍𝑍;𝐾𝐾)

3.3.1. Latent Stochastic Complexity

Code-length

Simulteneously

Latent Stochastic Complexity(LSC)

Latent Parametric Complexity

Page 35: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Finite Mixture Model

Finite Mixture Model

LatentParametric Complexity

Page 36: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Parametric Complexity of FMMTheorem 3.3.1 (Recurrence relation for parametric complexity fo FMM) [Hirai and Yamanishi IEEE IT2013]

Page 37: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Gaussian Mixture ModelsExample 3.3.1(Parametric complexity of multivariate GMM)

where

[Hirai Yamanishi IEEE IT 2013, 2019]

Page 38: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Latent Variable Model Selection with Latent Stochastic Complexity(1/2)

・Naïve Bayes Model[Kontkanen and Myllymaki 2008]

・Gaussian Mixture Model[Kyrgyzov, Kyrgyzov, Maître, Campedel MLDM2007][Hirai and Yamanishi IEEE IT 2013, 2019]

・General Relation Model[Sakai, Yamanishi IEEEBigData 2013]

・Principal Component Analysis/Canonical Component Analysis[Archambeau, Bach NIPS2009][Virtanen, Klami, Kaski ICML2011][Nakamura, Iwata, Yamanishi DSAA2017]

Page 39: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Latent Variable Model Selection with Latent Stochastic Complexity(2/2)

・Non-negative Matrix Factorization(NMF) Rank Estimation[Miettinen, Vreeken KDD2011][Yamauchi, Kawakita, Takeuchi ICONIP2012][Ito, Oeda, Yamanishi SDM2016][Squire, Benett, Niranjyan NeuroComput2017]

c.f. [Cemgil CIN2009] Variational Bayes[Hoffman, Blei,Cook 2010] Non-parametric Bayes

・Convolutional NMF[Suzuki, Miyaguchi, Yamanishi ICDM2016]

・Non-negative Tensor Factorization(NTF) Rank Estimation[Fu, Matsushima, Yamanishi Entropy2019]

Page 40: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Z=1

Z=2 ... Z=

K

...

K Latent components

N Outputs

=+

Computable efficiently and non-asymptotically

40

Separately!

[Wu, Sugawara, Yamanishi KDD2017][Yamanishi, Wu, Sugawara, Okada DAMI2019]

3.3.2. Decomposed Normalized Maximum Likelihood (DNML) Codelength

Code-length

Page 41: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

How to Compute DNML

DNML criterion

Page 42: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Finite Mixture Models

Multinomialdistribution

Page 43: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Efficient Computation of Parametric Complexity for Multinomial Distribution

Theorem 3.3.2 (Recurrence relation of parametric complexity of multinomial distribution) [Kontkanen, Myllymaki IPL 2006]

Petri Myllimaki

Page 44: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

44

3.3.3 Applications to a Variety of ClassesExample 3.3.2 (Latent Dirichlet Allocation: LDA)

where

Page 45: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

DNML converges to the true model as rapidly as LSC, but slower than AIC.

Estimated #topics

#Sample Noise Ratio

Error Rate

DNML is more robust against noise than AIC.

Empirical Evaluation -Synthetic Data-

True # topics

DNML is able to identify the true # topics of LDAmost robustly

[Wu, Sugawara, Yamanishi KDD2017

Page 46: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Empirical Evaluation-Benchmark Data-

Method 2 topics 3 topics 4 topics 5 topics 6 topics

DNML 2 3 4 5 7AIC 7 4 9 7 7Laplace 8 4 8 9 5ELBO 3 4 4 4 3CV 9 1 3 7 1HDP 80 98 94 92 98

DNML is able to identify the true # topics most exactly

Newsgroup Dataset

[Wu, Sugawara, Yamanishi KDD2017]

Page 47: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

47

Example 3.3.3 (Stochastic Block Models: SBM)

where

Page 48: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Estimated #topics

#Sample

Empirical Evaluation -Synthetic Data-

True # topics

AIC increases most rapidly but overfit data for large sample sizeDNML increases more slowly but converges to true one.

[Wu, Sugawara, Yamanishi KDD2017]

Page 49: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Example 3.3.4 (Gaussian Mixture Models: GMM)

Page 50: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Empirical Evaluation -Synthetic Data-

DNML performs best and can identify small components exactly.

m=5K*=10

m=5K*=5With small components

[Yamanishi et al.DAMI 2019]

Page 51: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Minimax Optimality of DNMLTheorem 3.3.3 (Estimation Optimality of DNML) [Yamanishi, Wu,Sugawara,Okada DAMI2019]

KL-Divergence

Page 52: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Wide Applicability of DNMLDNML is universally applicable to model selection

for a wide class of hierarchical latent variable models.

DNML is a conclusive solution tolatent variable model selection.

Page 53: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Classes of Hierarchical Latent Variable Models

Page 54: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

3.4. High-dimensional Sparse Model Selection

Page 55: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

3.4.1. High-dimensional Penalty Selection

Goal) Identify “essential” parameter subset in high dim. models• To achieve better generalization• To realize knowledge discovery in lower dimensional space

Method) Minimize regularized log loss:

• Small 𝜆𝜆𝑗𝑗 for essential parameters• Large 𝜆𝜆𝑗𝑗 for unnecessary parameters 55

where penalty

Page 56: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Luckiness NML Codelength (LNML)

• Luckiness Minimax Regret

[Grűnwald 2007]

where

Peter Grunwald

Page 57: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Upper Bounding on LNMLTheorem 3.3.5 (uLNML: An upper bound on LNML)

[Miyaguchi, Yamanishi Machine Learning 2018]

LNML codelength is uniformly upper-bounded by

Page 58: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Optimization Algorithm

• Upper bound on LNML = concave + convex function

• Concave-convex procedure (CCCP) [Yuille, Rangarajan NC 02]✅Monotone non-increasing optimization✅ Convergence to a saddle point

[Miyaguchi, Yamanishi Machine Learning 2018]

Concave Convex

Page 59: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

59

Example 3.3.4 (Linear regression model)

The upper bound on LNML is calculated as

Page 60: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Experiments: Linear RegressionーUCI Repositoryー

Sample size

Test Likelihood

Boston (13) Diabetes (10)

MillionSongs (90) ResidentialBuilding (103)

Proposed

• Better than grid-search-based methods (esp. if 𝑑𝑑 ≿ 𝑛𝑛)• More robust than RVM (relevant vector machine)

RVM

[Miyaguchi, Yamanishi Machine Learning 2018]

Page 61: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

• 𝑚𝑚-variate Gaussian model• Precision Θ represents conditional dependencies

• The upper bound on LNML

61

− ln𝑝𝑝𝑢𝑢𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝑥𝑥𝑛𝑛 𝛾𝛾

Example 3.3.5 (Graphical Model Estimation)

Page 62: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Generated w/ double-ring model, Can handle >103 dimension

62

𝑑𝑑 = 190

𝑑𝑑 = 4950

KL d

iver

genc

eKL

div

erge

nce

Training samplesTraining samples

Code

leng

thCo

de le

ngth

𝑑𝑑 = 190

𝑑𝑑 = 4950

Experiments: Graphical ModelーSynthetic Dataー

[Miyaguchi, Yamanishi Machine Learning 2018]

Page 63: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Luckiness NML for L1-Penalty

• Luckiness Minimax Regret [Miyaguchi, Matsushima and Yamanishi SDM2016]

where

Page 64: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

3.4.2. High-dimensional Minimax Prediction

Page 65: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

On-line Bayesian Prediction Strategy In conventional low dimensionality setting, the minimax regret is attained by e.g. the Bayesian prediction strategy

[Clarke and Barron JSPI 1994, Takeuchi and Barron ISIT1998]

Then cumulative log loss amounts

Problem: No counterpart in high-dim setting (𝑑𝑑≿𝑛𝑛)

Page 66: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

High-dimensional AsymptoticsAssumption

• Twice differetial of loss is uniformly upper bound by 𝐿𝐿• ℓ1-radius of P is at most 𝑅𝑅 < +∞

Theorem 3.3.6 (Asymptotics on worst regret)[Miyaguchi, Yamanishi AISTATS 2019]

Under the high-dim limit 𝜔𝜔 𝑛𝑛 = 𝑑𝑑 = 𝑒𝑒𝑜𝑜 𝑛𝑛 ,

Optimal within a factor of 2

Page 67: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Minimax Predictor in High-dimensional Setting

• One-dim illustration of ST priorGap is wide when 𝑑𝑑 ≫ 𝑛𝑛

Spike

Exponential tail ∝ 𝑒𝑒−λ 𝜃𝜃

Bayesian prediction with Spike-and-Tails (ST) prior[Miyaguchi and Yamanishi AISTATS2019]

Page 68: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

Summary• Stochastic complexity (SC) , namely the NML codelength, is the

well-defined key information quantity of data relative to the model. MDL is to minimize SC.

• Techniques for efficient computation of SC are established:1) asymptotic formula, 2) g-functions, 3)Fourier transform, 4) combinatorial method.

• When applying MDL to latent variable models, use complete variable models, and apply Latent Stochastic Complexity(LSC) or Decomposed NML (DNML) to realize efficient and optimal estimation.

• When applying MDL to high-dimensional sparse models, apply LNML to realize optimal penalty selection.

Page 69: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

References■ 3.1. Stochastic Complexity and NML Codelength・J.Rissanen: “Modeing by shortest description length,”Automatica,14:465–471,1978.・Y.M. Shtar’kov: “Universal sequential coding of single messages,” Problemy

Peredachi Informatsii, 23(3):3–17, 1987.・J.Rissanen: “Stochastic Complexity in Statistical Inquiries,” Wiley, 1989. ・A.Barron and T.Cover: “Minimum complexity density estimator,” IEEE Transactions on Information Theory, Vol.37, 4, pp:1034-1054, 1991.・K.Yamanishi: “A learning criterion for stochastic rules,” Machine Learning, Vol.9, pp:165-203, 1992.・ J.Rissanen: “Fisher information and stochastic complexity,” IEEE Transactionson Information Theory, 42(1):40–47, 1996.・ P.D. Grϋnwald: “Minimum Description Length Principle,” MIT Press, 2007.・J.Rissanen: “Optimal Estimation of Parameters,” Cambridge, 2012.・M.Kawakita and J.Takeuchi: “Barron and Cover’s theory in supervised learning and its applications to lasso,” in Proceedings of International Conference on Machine

Learning, 2016.

Page 70: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

References■ 3.2. G-function and Fourier Transformation Method・J. Rissanen: “MDL denoising,” IEEE Transactions on Information Theory,

46(7):2537–2543, 2000.・J.Rissanen: “Information and Complexity in Statistical Modeling,” Springer, 2007.・A. Suzuki and K. Yamanishi:"Exact calculation of normalized maximum likelihood

code length using Fourier analysis" Proceedings of IEEE Symposium on Information Theory (ISIT2018), pp: 1211-1215, 2018.

■ 3.3.1. Latent Stochastic Complexity・P. Kontkanen, P. Myllymaki, W. Buntine, J. Rissanen, and H. Tirri: “An mdl framework

for data clustering,” In Advances in Minimum Description Length: Theory and Applications. MIT Press, pp: 323–35, 2005.

・P. Kontkanen and P. Myllymäki:゛A linear-time algorithm for computing the multinomial stochastic complexity,” Information Processing Letters, 103(6), pp:227–233, 2007.

・I.O.Kyrgyzov, O.O.Kyrgyzov, H.Maître, M.Campedel:“Kernel MDL to determine the number of clusters,” in Proceedings of Workshop on Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science, vol 4571, pp:2013-217, 2007.

Page 71: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

References■ 3.3.1. Latent Stochastic Complexity(Cont.)・ S. Hirai and K. Yamanishi:゛Efficient computation of normalized maximum likelihood

codes for gaussian mixture models with its applications to clustering,” IEEE Transactions on Information Theory, 59(11):7718–7727, 2013.

・Y.Sakai and K.Yamanishi: 〝An NML-based model selection criterion for general relational data modeling," Proceedings of IEEE International Conference on Big Data (BigData 2013), pp:421--429, 2013.

・P.Miettenin and J.Vreeken: “MDL4BMF: Minimum Description Length for Boolean Matrix Factorization,” ACM Transactions on Knowledge Discovery from Data, Vol. 8 Issue 4, 2014.

・A. Suzuki, K.Miyaguchi, and K. Yamanishi: “Structure selection convolutive non-negative matrix factorization using normalized maximum likelihood coding," Proceedings of IEEE International Conference on Data Mining (ICDM2016), pp:1221-1226, 2016.

・Y.Ito, S.Oeda, and K.Yamanishi: “Rank selection for non-negative matrix factorization with normalized maximum likelihood coding." Proceedings of SIAM International Conference on Data Mining (SDM2016), pp:720-728, Mar. 2016.

・S.Squires, A.Prügel-Bennett, M.Niranjan: “Rank selection in nonnegative matrix factorization using minimum description length,” Neural Computation, 29, 2164–2176, 2017.

Page 72: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

References■ 3.3.1. Latent Stochastic Complexity(Cont.)・T.Nakamura, T.Iwata, and K.Yamanishi: "Latent dimensionality estimation for

probabilistic canonical correlation analysis using normalized maximum likelihood code-length," Proceedings of the 4th IEEE International Conference on Data Science and Advanced Analytics (DSAA2017), pp:716-725, 2017.

・Y. Fu, S. Matsushima, K. Yamanishi: "Model selection for non-negative tensor factorization with minimum description length," Entropy, Jul. 2019.

・S.Hirai and K.Yamanishi: “Correction to Efficient computation of normalized maximum likelihood codes for Gaussian mixture models with its applications to clustering,” IEEE Transactions on Information Theory, 2019.

■ 3.3.2. Decomposed Normalized Maximum Likelihood Codelength ・T. Wu, S. Sugawara, K.Yamanishi: "Decomposed normalized maximum likelihood

codelength criterion for selecting hierarchical latent variable models," Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2017), pp:1165--1174, 2017.

・K. Yamanishi, T. Wu, S.Sugawara, M.Okada: "The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models", Data Mining and Knowledge Discovery. 33(4): 1017-1058, 2019.

Page 73: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete

References■ 3.3.3. High-dimensional Sparse Model Selection・S. Chatterjee and A.Barron: “Information-theoretic validity of penalized likelihood,”in Proceedings of IEEE International Symposium on Information Theory, pp:3027-3031, 2014.・K. Miyaguchi, S.Matsushima, and K. Yamanishi: “Sparse graphical modeling via

stochastic complexity," Proceedings of 2017 SIAM International Conference on Data Mining (SDM2017), pp:723-731, 2017.

・K. Miyaguchi and K. Yamanishi: "High-dimensional penalty selection via minimum description length principle" Machine Learning, 107(8-10), pp:1283-1302, 2018.

・K. Miyaguchi and K. Yamanishi: "Adaptive minimax regret against smooth logarithmic losses over high-dimensional l1-balls via envelope complexity", Proceedings of

Artificial Intelligence and Statistics (AISTATS 2019), pp:3440-3448, 2019.

C.f. Bayesian prediction related to MDL・B.Clarke and A.Barron: “Jeffreys' prior is asymptotically least favorable under entropy

risk,” Journal of Statistical Planning and Inference, Vol. 41, Issue 1, pp:37-60,1994・J.Takeuchi and A.Barron: “Asymptotically minimax regret by Bayes mixtures,” in

Proceedings of IEEE International Symposium on Information Theory, 1998.