Modern MDL Meets Data Mining Insight, Theory, and...

Modern MDL Meets Data MiningInsight, Theory, and Practice

ーPart IIIーStochastic Complexity

Kenji YamanishiGraduate School of Information Science and

Technology, the University of TokyoAugust 4th 2019

KDD Tutorial

Part III. Stochastic Complexity

3.1. Stochastic Complexity and NML Codelength3.2. G-function and Fourier Transformation3.3. Latent Variable Model Selection

3.3.1. Latent Stochastic Complexity3.3.2. Decomposed NML Codelength3.3.3. Applications to a Variety of Models

3.4. High-dimensional Sparse Model Selection3.4.1. High Dimensional Penalty Selection3.4.2. High Dimensional Minimax Prediction

3.１. Stochastic Complexity and NML Codelength

What is Information?

■Case1: Data distribution is known

Data sequence：

Probability mass(density) function：

Shannon information

Theorem 3.1.1 （Shannon’s Source Coding Theorem）

Shannon information gives optimal codelength whenthe true distribution is known in advance.

Characterization of Shannon Information

Kraft’s Inequality

Entropy

■Case2: Data distribution is unknown.Data seq.：

Probabilistic model class.：

Normalized Maximum Likelihood (NML)Distribution

Stochastic Complexityof x relative to PM=NML codelength

Parametric Complexityof PM

Theorem 3.1.2 （Minimax optimality of NML codelength)[Shtarkov Probl. Inf. Trans. 87]

NML codelength achieves the minimum of the risk

Shtarkov’s minimax risk

Characterization of NML Codelength

baseline

How to calculate Parametric Complexity

Theorem 3.1.3 （Asymptotic formula for parametric complexity）[Rissanen IEEE IT1996]

Example 3.1.1 (Multinomial Distribution)

Stochastic Complexity

Example 3.1.2 （1-dimensional Gaussian distribution）

Restricted Parameter Space

FisherInformation

Parameter Space

m.l.e.

Stochastic complexity

Characterization of Stochastic Complexity

Theorem 3.1.4 （Rissanen’s lower bound）[Rissanen 1989]

Stochastic complexity gives optimal codelength whenthe true distribution is unknown.

Sequential NML Codelength

Sequential NML (SNML)=Cumulative codelength for sequential NML coding

: data sequence

Sequential NMLTheorem 3.1.5（Property of SNML）For model classes under some regular conditions

E.g. Regression model [Rissanen IEEE IT2000] [Roos, Myllymaki, Rissanen MVA 2009]

Example 3.1.3.(Auto-regression model)

where

When σ is fixed, then SNNL dist. becomes

MDL Criterion（1/2）

MDL(Minimum Description Length） criterion

MDL Criterion（2/2）[Rissanen IEEE IT 1996]

Example 3.1.4. （Histogram Density）

・・・

K0 1

Optimality of MDL EstimationTheorem 3.1.6（Optimality of MDL estimator) [Rissanen 2012]

Why is the MDL?

・ Optimal solution to Shtarkov minimax risk・ Attaining Rissanen’s lower bound・ Consistency [Rissanen Auto. Control 1978]

[Barron 1989]・ Index of Resolvability [Barron and Cover IEEE IT1991]

・ Rapid convergence rate with PAC learning[Yamanishi Mach. Learning 1992][Rissanen Yu, Learn. and Geo. 1996][Chatterjee and Barron ISIT2014][Kawakita, Takeuchi, ICML2016]

・ Estimation optimality [Rissanen 2012]

・ A huge number of case studies

Andrew Barron

3.2. G-Function and Fourier Transformation

It is hard to calculate for general model M.Beyond Rissanen’s asymptotic formulae for small data

1) Calculate it non-asymptotically.2) Calculate it exactly and efficiently.

A) g-functionB) Fourier transformation C) Combinatorial Methods

Computational Problem in Parametric Complexity

Parametric Complexity

g-Function

where

Density decomposition [Rissanen 2007, 2012]

Calculation of Parametric Complexityvia g-function

Variable transformation

Example 3.2.1 （g-function for exponential distributions）

：m.l.e.

⇒

⇒ Extended into general exponential family[Hirai and Yamanishi IEEE IT2013]

Then

Fourier Transformation Method(1/2)

Theorem 3.2.1 （Parametric complexity based on Fourier transformation) [Suzuki and Yamanishi ISIT 2018]

Fourier Transformation Method(2/2)

Exponential Family

Theorem 3.2.2 （Parametric complexity of Exponential Family with Fourier method) [Suzuki and Yamanishi ISIT 2018]

Parametric Complexity of Exponential Family Computed with Fourier Method

[Suzuki and Yamanishi ISIT 2018]

3.3. Latent Variable Model Selection

Motivation 1. Topic Model

How many topics lie in a document?

Topic 1Machine: 0.05Computer: 0.03Algorithm: 0.02

…

Topic 2Matrix: 0.04Math: 0.03

equation: 0.01…

Topic KBio: 0.05

Gene: 0.04Array: 0.01

…

Document 1

Document 2Document DDocument DDocument D

30

Latent Variable Model Selection

#topics =3

LDA: Latent Dirichlet Allocations

Motivation ２： Relational Model

How many blocks lie in a relational data?

Permutation

31

4 6 9 12

#blocks = 6

Latent Variable Model Selection

Latent Variable Models

Example 3.4.1 （Gaussian mixture model） Z=1

Z=2

Z=k

Nonidentifiability of Latent Variable Model

Marginalized model

Finite mixture model

Non-identifiable

Singular region

Complete variable model Estimate Z from X

Identifiable

Z=1 Z=2 ... Z=K

𝑋𝑋1 𝑋𝑋2 𝑋𝑋𝑁𝑁...

K Latent components

N Outputs

Normalized maximum likelihood (NML)Codelength for complete variable model

𝐿𝐿𝑁𝑁𝑁𝑁𝑁𝑁(𝑋𝑋,𝑍𝑍;𝐾𝐾)

3.3.1. Latent Stochastic Complexity

Code-length

Simulteneously

Latent Stochastic Complexity(LSC)

Latent Parametric Complexity

Finite Mixture Model

Finite Mixture Model

LatentParametric Complexity

Parametric Complexity of FMMTheorem 3.3.1 （Recurrence relation for parametric complexity fo FMM） [Hirai and Yamanishi IEEE IT2013]

Gaussian Mixture ModelsExample 3.3.1（Parametric complexity of multivariate GMM）

where

[Hirai Yamanishi IEEE IT 2013, 2019]

Latent Variable Model Selection with Latent Stochastic Complexity(1/2)

・Naïve Bayes Model[Kontkanen and Myllymaki 2008]

・Gaussian Mixture Model[Kyrgyzov, Kyrgyzov, Maître, Campedel MLDM2007][Hirai and Yamanishi IEEE IT 2013, 2019]

・General Relation Model[Sakai, Yamanishi IEEEBigData 2013]

・Principal Component Analysis/Canonical Component Analysis[Archambeau, Bach NIPS2009][Virtanen, Klami, Kaski ICML2011][Nakamura, Iwata, Yamanishi DSAA2017]

Latent Variable Model Selection with Latent Stochastic Complexity(2/2)

・Non-negative Matrix Factorization(NMF) Rank Estimation[Miettinen, Vreeken KDD2011][Yamauchi, Kawakita, Takeuchi ICONIP2012][Ito, Oeda, Yamanishi SDM2016][Squire, Benett, Niranjyan NeuroComput2017]

c.f. [Cemgil CIN2009] Variational Bayes[Hoffman, Blei,Cook 2010] Non-parametric Bayes

・Convolutional NMF[Suzuki, Miyaguchi, Yamanishi ICDM2016]

・Non-negative Tensor Factorization(NTF) Rank Estimation[Fu, Matsushima, Yamanishi Entropy2019]

Z=1

Z=2 ... Z=

K

...

K Latent components

N Outputs

=+

Computable efficiently and non-asymptotically

40

Separately!

[Wu, Sugawara, Yamanishi KDD2017][Yamanishi, Wu, Sugawara, Okada DAMI2019]

3.3.2. Decomposed Normalized Maximum Likelihood (DNML) Codelength

Code-length

How to Compute DNML

DNML criterion

Finite Mixture Models

Multinomialdistribution

Efficient Computation of Parametric Complexity for Multinomial Distribution

Theorem 3.3.2 (Recurrence relation of parametric complexity of multinomial distribution) [Kontkanen, Myllymaki IPL 2006]

Petri Myllimaki

44

3.3.3 Applications to a Variety of ClassesExample 3.3.2 (Latent Dirichlet Allocation: LDA)

where

DNML converges to the true model as rapidly as LSC, but slower than AIC.

Estimated #topics

#Sample Noise Ratio

Error Rate

DNML is more robust against noise than AIC.

Empirical Evaluation -Synthetic Data-

True # topics

DNML is able to identify the true # topics of LDAmost robustly

[Wu, Sugawara, Yamanishi KDD2017

Empirical Evaluation-Benchmark Data-

Method 2 topics 3 topics 4 topics 5 topics 6 topics

DNML 2 3 4 5 7AIC 7 4 9 7 7Laplace 8 4 8 9 5ELBO 3 4 4 4 3CV 9 1 3 7 1HDP 80 98 94 92 98

DNML is able to identify the true # topics most exactly

Newsgroup Dataset

[Wu, Sugawara, Yamanishi KDD2017]

47

Example 3.3.3 (Stochastic Block Models: SBM)

where

Estimated #topics

#Sample


True # topics

AIC increases most rapidly but overfit data for large sample sizeDNML increases more slowly but converges to true one.

[Wu, Sugawara, Yamanishi KDD2017]

Example 3.3.4 (Gaussian Mixture Models: GMM)


DNML performs best and can identify small components exactly.

m=5K*=10

m=5K*=5With small components

[Yamanishi et al.DAMI 2019]

Minimax Optimality of DNMLTheorem 3.3.3 (Estimation Optimality of DNML) [Yamanishi, Wu,Sugawara,Okada DAMI2019]

KL－Divergence

Wide Applicability of DNMLDNML is universally applicable to model selection

for a wide class of hierarchical latent variable models.

DNML is a conclusive solution tolatent variable model selection.

Classes of Hierarchical Latent Variable Models

3.4. High-dimensional Sparse Model Selection

3.4.1. High-dimensional Penalty Selection

Goal) Identify “essential” parameter subset in high dim. models• To achieve better generalization• To realize knowledge discovery in lower dimensional space

Method) Minimize regularized log loss:

• Small 𝜆𝜆𝑗𝑗 for essential parameters• Large 𝜆𝜆𝑗𝑗 for unnecessary parameters 55

where penalty

Luckiness NML Codelength (LNML)

• Luckiness Minimax Regret

[Grűnwald 2007]

where

Peter Grunwald

Upper Bounding on LNMLTheorem 3.3.5 (uLNML: An upper bound on LNML)

[Miyaguchi, Yamanishi Machine Learning 2018]

LNML codelength is uniformly upper-bounded by

Optimization Algorithm

• Upper bound on LNML = concave + convex function

• Concave-convex procedure (CCCP) [Yuille, Rangarajan NC 02]✅Monotone non-increasing optimization✅ Convergence to a saddle point


Concave Convex

59

Example 3.3.4 (Linear regression model)

The upper bound on LNML is calculated as

Experiments: Linear RegressionーUCI Repositoryー

Sample size

Test Likelihood

Boston (13) Diabetes (10)

MillionSongs (90) ResidentialBuilding (103)

Proposed

• Better than grid-search-based methods (esp. if 𝑑𝑑 ≿ 𝑛𝑛)• More robust than RVM (relevant vector machine)

RVM


• 𝑚𝑚-variate Gaussian model• Precision Θ represents conditional dependencies

• The upper bound on LNML

61

− ln𝑝𝑝𝑢𝑢𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝑥𝑥𝑛𝑛 𝛾𝛾

Example 3.3.5 (Graphical Model Estimation)

Generated w/ double-ring model, Can handle >103 dimension

62

𝑑𝑑 = 190

𝑑𝑑 = 4950

KL d

iver

genc

eKL

div

erge

nce

Training samplesTraining samples

Code

leng

thCo

de le

ngth

𝑑𝑑 = 190

𝑑𝑑 = 4950

Experiments: Graphical ModelーSynthetic Dataー


Luckiness NML for L1-Penalty

• Luckiness Minimax Regret [Miyaguchi, Matsushima and Yamanishi SDM2016]

where

3.4.2. High-dimensional Minimax Prediction

On-line Bayesian Prediction Strategy In conventional low dimensionality setting, the minimax regret is attained by e.g. the Bayesian prediction strategy

[Clarke and Barron JSPI 1994, Takeuchi and Barron ISIT1998]

Then cumulative log loss amounts

Problem: No counterpart in high-dim setting (𝑑𝑑≿𝑛𝑛)

High-dimensional AsymptoticsAssumption

• Twice differetial of loss is uniformly upper bound by 𝐿𝐿• ℓ1-radius of P is at most 𝑅𝑅 < +∞

Theorem 3.3.6 (Asymptotics on worst regret)[Miyaguchi, Yamanishi AISTATS 2019]

Under the high-dim limit 𝜔𝜔 𝑛𝑛 = 𝑑𝑑 = 𝑒𝑒𝑜𝑜 𝑛𝑛 ,

Optimal within a factor of 2

Minimax Predictor in High-dimensional Setting

• One-dim illustration of ST priorGap is wide when 𝑑𝑑 ≫ 𝑛𝑛

Spike

Exponential tail ∝ 𝑒𝑒−λ 𝜃𝜃

Bayesian prediction with Spike-and-Tails (ST) prior[Miyaguchi and Yamanishi AISTATS2019]

Summary• Stochastic complexity (SC) , namely the NML codelength, is the

well-defined key information quantity of data relative to the model. MDL is to minimize SC.

• Techniques for efficient computation of SC are established:1) asymptotic formula, 2) g-functions, 3)Fourier transform, 4) combinatorial method.

• When applying MDL to latent variable models, use complete variable models, and apply Latent Stochastic Complexity(LSC) or Decomposed NML (DNML) to realize efficient and optimal estimation.

• When applying MDL to high-dimensional sparse models, apply LNML to realize optimal penalty selection.

References■ 3.1. Stochastic Complexity and NML Codelength・J.Rissanen: “Modeing by shortest description length,”Automatica,14:465–471,1978.・Y.M. Shtar’kov: “Universal sequential coding of single messages,” Problemy

Peredachi Informatsii, 23(3):3–17, 1987.・J.Rissanen: “Stochastic Complexity in Statistical Inquiries,” Wiley, 1989. ・A.Barron and T.Cover: “Minimum complexity density estimator,” IEEE Transactions on Information Theory, Vol.37, 4, pp:1034-1054, 1991.・K.Yamanishi： “A learning criterion for stochastic rules,” Machine Learning, Vol.9, pp:165-203, 1992.・ J.Rissanen: “Fisher information and stochastic complexity,” IEEE Transactionson Information Theory, 42(1):40–47, 1996.・ P.D. Grϋnwald: “Minimum Description Length Principle,” MIT Press, 2007.・J.Rissanen： “Optimal Estimation of Parameters,” Cambridge, 2012.・M.Kawakita and J.Takeuchi: “Barron and Cover’s theory in supervised learning and its applications to lasso,” in Proceedings of International Conference on Machine

Learning, 2016.

References■ 3.2. G-function and Fourier Transformation Method・J. Rissanen: “MDL denoising,” IEEE Transactions on Information Theory,

46(7):2537–2543, 2000.・J.Rissanen: “Information and Complexity in Statistical Modeling,” Springer, 2007.・A. Suzuki and K. Yamanishi："Exact calculation of normalized maximum likelihood

code length using Fourier analysis" Proceedings of IEEE Symposium on Information Theory (ISIT2018), pp: 1211-1215, 2018.

■ 3.3.1. Latent Stochastic Complexity・P. Kontkanen, P. Myllymaki, W. Buntine, J. Rissanen, and H. Tirri: “An mdl framework

for data clustering,” In Advances in Minimum Description Length: Theory and Applications. MIT Press, pp: 323–35, 2005.

・P. Kontkanen and P. Myllymäki：゛A linear-time algorithm for computing the multinomial stochastic complexity,” Information Processing Letters, 103(6), pp:227–233, 2007.

・I.O.Kyrgyzov, O.O.Kyrgyzov, H.Maître, M.Campedel：“Kernel MDL to determine the number of clusters,” in Proceedings of Workshop on Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science, vol 4571, pp:2013-217, 2007.

References■ 3.3.1. Latent Stochastic Complexity(Cont.)・ S. Hirai and K. Yamanishi:゛Efficient computation of normalized maximum likelihood

codes for gaussian mixture models with its applications to clustering,” IEEE Transactions on Information Theory, 59(11):7718–7727, 2013.

・Y.Sakai and K.Yamanishi: 〝An NML-based model selection criterion for general relational data modeling," Proceedings of IEEE International Conference on Big Data (BigData 2013)， pp:421--429, 2013.

・P.Miettenin and J.Vreeken: “MDL4BMF: Minimum Description Length for Boolean Matrix Factorization,” ACM Transactions on Knowledge Discovery from Data, Vol. 8 Issue 4, 2014.

・A. Suzuki, K.Miyaguchi, and K. Yamanishi: “Structure selection convolutive non-negative matrix factorization using normalized maximum likelihood coding," Proceedings of IEEE International Conference on Data Mining (ICDM2016), pp:1221-1226, 2016.

・Y.Ito, S.Oeda, and K.Yamanishi: “Rank selection for non-negative matrix factorization with normalized maximum likelihood coding." Proceedings of SIAM International Conference on Data Mining (SDM2016), pp:720-728, Mar. 2016.

・S.Squires, A.Prügel-Bennett, M.Niranjan: “Rank selection in nonnegative matrix factorization using minimum description length,” Neural Computation, 29, 2164–2176, 2017.

References■ 3.3.1. Latent Stochastic Complexity(Cont.)・T.Nakamura, T.Iwata, and K.Yamanishi: "Latent dimensionality estimation for

probabilistic canonical correlation analysis using normalized maximum likelihood code-length," Proceedings of the 4th IEEE International Conference on Data Science and Advanced Analytics (DSAA2017), pp:716-725, 2017.

・Y. Fu, S. Matsushima, K. Yamanishi: "Model selection for non-negative tensor factorization with minimum description length," Entropy, Jul. 2019.

・S.Hirai and K.Yamanishi: “Correction to Efficient computation of normalized maximum likelihood codes for Gaussian mixture models with its applications to clustering,” IEEE Transactions on Information Theory, 2019.

■ 3.3.2. Decomposed Normalized Maximum Likelihood Codelength ・T. Wu, S. Sugawara, K.Yamanishi: "Decomposed normalized maximum likelihood

codelength criterion for selecting hierarchical latent variable models," Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2017), pp:1165--1174, 2017.

・K. Yamanishi, T. Wu, S.Sugawara, M.Okada: "The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models", Data Mining and Knowledge Discovery. 33(4): 1017-1058, 2019.

References■ 3.3.3. High-dimensional Sparse Model Selection・S. Chatterjee and A.Barron: “Information-theoretic validity of penalized likelihood,”in Proceedings of IEEE International Symposium on Information Theory, pp:3027-3031, 2014.・K. Miyaguchi, S.Matsushima, and K. Yamanishi: “Sparse graphical modeling via

stochastic complexity," Proceedings of 2017 SIAM International Conference on Data Mining (SDM2017), pp:723-731, 2017.

・K. Miyaguchi and K. Yamanishi: "High-dimensional penalty selection via minimum description length principle" Machine Learning, 107(8-10), pp:1283-1302, 2018.

・K. Miyaguchi and K. Yamanishi: "Adaptive minimax regret against smooth logarithmic losses over high-dimensional l1-balls via envelope complexity", Proceedings of

Artificial Intelligence and Statistics (AISTATS 2019), pp:3440-3448, 2019.

C.f. Bayesian prediction related to MDL・B.Clarke and A.Barron: “Jeffreys' prior is asymptotically least favorable under entropy

risk,” Journal of Statistical Planning and Inference, Vol. 41, Issue 1, pp:37-60,1994・J.Takeuchi and A.Barron: “Asymptotically minimax regret by Bayes mixtures,” in

Proceedings of IEEE International Symposium on Information Theory, 1998.

Modern MDL Meets Data Mining Insight, Theory, and...

Documents

Transcript of Modern MDL Meets Data Mining Insight, Theory, and...