Modern MDL Meets Data Mining Insight, Theory, and...
Transcript of Modern MDL Meets Data Mining Insight, Theory, and...
![Page 1: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/1.jpg)
Modern MDL Meets Data MiningInsight, Theory, and Practice
ーPart IIIーStochastic Complexity
Kenji YamanishiGraduate School of Information Science and
Technology, the University of TokyoAugust 4th 2019
KDD Tutorial
![Page 2: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/2.jpg)
Part III. Stochastic Complexity
3.1. Stochastic Complexity and NML Codelength3.2. G-function and Fourier Transformation3.3. Latent Variable Model Selection
3.3.1. Latent Stochastic Complexity3.3.2. Decomposed NML Codelength3.3.3. Applications to a Variety of Models
3.4. High-dimensional Sparse Model Selection3.4.1. High Dimensional Penalty Selection3.4.2. High Dimensional Minimax Prediction
![Page 3: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/3.jpg)
3.1. Stochastic Complexity and NML Codelength
![Page 4: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/4.jpg)
What is Information?
■Case1: Data distribution is known
Data sequence:
Probability mass(density) function:
Shannon information
![Page 5: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/5.jpg)
Theorem 3.1.1 (Shannon’s Source Coding Theorem)
Shannon information gives optimal codelength whenthe true distribution is known in advance.
Characterization of Shannon Information
Kraft’s Inequality
Entropy
![Page 6: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/6.jpg)
■Case2: Data distribution is unknown.Data seq.:
Probabilistic model class.:
Normalized Maximum Likelihood (NML)Distribution
Stochastic Complexityof x relative to PM=NML codelength
Parametric Complexityof PM
![Page 7: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/7.jpg)
Theorem 3.1.2 (Minimax optimality of NML codelength)[Shtarkov Probl. Inf. Trans. 87]
NML codelength achieves the minimum of the risk
Shtarkov’s minimax risk
Characterization of NML Codelength
baseline
![Page 8: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/8.jpg)
How to calculate Parametric Complexity
Theorem 3.1.3 (Asymptotic formula for parametric complexity)[Rissanen IEEE IT1996]
![Page 9: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/9.jpg)
Example 3.1.1 (Multinomial Distribution)
Stochastic Complexity
![Page 10: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/10.jpg)
Example 3.1.2 (1-dimensional Gaussian distribution)
Restricted Parameter Space
FisherInformation
Parameter Space
m.l.e.
Stochastic complexity
![Page 11: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/11.jpg)
Characterization of Stochastic Complexity
Theorem 3.1.4 (Rissanen’s lower bound)[Rissanen 1989]
Stochastic complexity gives optimal codelength whenthe true distribution is unknown.
![Page 12: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/12.jpg)
Sequential NML Codelength
Sequential NML (SNML)=Cumulative codelength for sequential NML coding
: data sequence
Sequential NMLTheorem 3.1.5(Property of SNML)For model classes under some regular conditions
E.g. Regression model [Rissanen IEEE IT2000] [Roos, Myllymaki, Rissanen MVA 2009]
![Page 13: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/13.jpg)
Example 3.1.3.(Auto-regression model)
where
When σ is fixed, then SNNL dist. becomes
![Page 14: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/14.jpg)
MDL Criterion(1/2)
MDL(Minimum Description Length) criterion
![Page 15: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/15.jpg)
MDL Criterion(2/2)[Rissanen IEEE IT 1996]
![Page 16: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/16.jpg)
Example 3.1.4. (Histogram Density)
・・・
K0 1
![Page 17: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/17.jpg)
Optimality of MDL EstimationTheorem 3.1.6(Optimality of MDL estimator) [Rissanen 2012]
![Page 18: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/18.jpg)
Why is the MDL?
・ Optimal solution to Shtarkov minimax risk・ Attaining Rissanen’s lower bound・ Consistency [Rissanen Auto. Control 1978]
[Barron 1989]・ Index of Resolvability [Barron and Cover IEEE IT1991]
・ Rapid convergence rate with PAC learning[Yamanishi Mach. Learning 1992][Rissanen Yu, Learn. and Geo. 1996][Chatterjee and Barron ISIT2014][Kawakita, Takeuchi, ICML2016]
・ Estimation optimality [Rissanen 2012]
・ A huge number of case studies
Andrew Barron
![Page 19: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/19.jpg)
3.2. G-Function and Fourier Transformation
![Page 20: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/20.jpg)
It is hard to calculate for general model M.Beyond Rissanen’s asymptotic formulae for small data
1) Calculate it non-asymptotically.2) Calculate it exactly and efficiently.
A) g-functionB) Fourier transformation C) Combinatorial Methods
Computational Problem in Parametric Complexity
Parametric Complexity
![Page 21: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/21.jpg)
g-Function
where
Density decomposition [Rissanen 2007, 2012]
![Page 22: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/22.jpg)
Calculation of Parametric Complexityvia g-function
Variable transformation
![Page 23: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/23.jpg)
Example 3.2.1 (g-function for exponential distributions)
:m.l.e.
⇒
![Page 24: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/24.jpg)
⇒ Extended into general exponential family[Hirai and Yamanishi IEEE IT2013]
Then
![Page 25: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/25.jpg)
Fourier Transformation Method(1/2)
Theorem 3.2.1 (Parametric complexity based on Fourier transformation) [Suzuki and Yamanishi ISIT 2018]
![Page 26: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/26.jpg)
Fourier Transformation Method(2/2)
![Page 27: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/27.jpg)
Exponential Family
Theorem 3.2.2 (Parametric complexity of Exponential Family with Fourier method) [Suzuki and Yamanishi ISIT 2018]
![Page 28: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/28.jpg)
Parametric Complexity of Exponential Family Computed with Fourier Method
[Suzuki and Yamanishi ISIT 2018]
![Page 29: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/29.jpg)
3.3. Latent Variable Model Selection
![Page 30: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/30.jpg)
Motivation 1. Topic Model
How many topics lie in a document?
Topic 1Machine: 0.05Computer: 0.03Algorithm: 0.02
…
Topic 2Matrix: 0.04Math: 0.03
equation: 0.01…
Topic KBio: 0.05
Gene: 0.04Array: 0.01
…
Document 1
Document 2Document DDocument DDocument D
30
Latent Variable Model Selection
#topics =3
LDA: Latent Dirichlet Allocations
![Page 31: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/31.jpg)
Motivation 2: Relational Model
How many blocks lie in a relational data?
Permutation
31
4 6 9 12
#blocks = 6
Latent Variable Model Selection
![Page 32: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/32.jpg)
Latent Variable Models
Example 3.4.1 (Gaussian mixture model) Z=1
Z=2
Z=k
![Page 33: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/33.jpg)
Nonidentifiability of Latent Variable Model
Marginalized model
Finite mixture model
Non-identifiable
Singular region
Complete variable model Estimate Z from X
Identifiable
![Page 34: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/34.jpg)
Z=1 Z=2 ... Z=K
𝑋𝑋1 𝑋𝑋2 𝑋𝑋𝑁𝑁...
K Latent components
N Outputs
Normalized maximum likelihood (NML)Codelength for complete variable model
𝐿𝐿𝑁𝑁𝑁𝑁𝑁𝑁(𝑋𝑋,𝑍𝑍;𝐾𝐾)
3.3.1. Latent Stochastic Complexity
Code-length
Simulteneously
Latent Stochastic Complexity(LSC)
Latent Parametric Complexity
![Page 35: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/35.jpg)
Finite Mixture Model
Finite Mixture Model
LatentParametric Complexity
![Page 36: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/36.jpg)
Parametric Complexity of FMMTheorem 3.3.1 (Recurrence relation for parametric complexity fo FMM) [Hirai and Yamanishi IEEE IT2013]
![Page 37: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/37.jpg)
Gaussian Mixture ModelsExample 3.3.1(Parametric complexity of multivariate GMM)
where
[Hirai Yamanishi IEEE IT 2013, 2019]
![Page 38: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/38.jpg)
Latent Variable Model Selection with Latent Stochastic Complexity(1/2)
・Naïve Bayes Model[Kontkanen and Myllymaki 2008]
・Gaussian Mixture Model[Kyrgyzov, Kyrgyzov, Maître, Campedel MLDM2007][Hirai and Yamanishi IEEE IT 2013, 2019]
・General Relation Model[Sakai, Yamanishi IEEEBigData 2013]
・Principal Component Analysis/Canonical Component Analysis[Archambeau, Bach NIPS2009][Virtanen, Klami, Kaski ICML2011][Nakamura, Iwata, Yamanishi DSAA2017]
![Page 39: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/39.jpg)
Latent Variable Model Selection with Latent Stochastic Complexity(2/2)
・Non-negative Matrix Factorization(NMF) Rank Estimation[Miettinen, Vreeken KDD2011][Yamauchi, Kawakita, Takeuchi ICONIP2012][Ito, Oeda, Yamanishi SDM2016][Squire, Benett, Niranjyan NeuroComput2017]
c.f. [Cemgil CIN2009] Variational Bayes[Hoffman, Blei,Cook 2010] Non-parametric Bayes
・Convolutional NMF[Suzuki, Miyaguchi, Yamanishi ICDM2016]
・Non-negative Tensor Factorization(NTF) Rank Estimation[Fu, Matsushima, Yamanishi Entropy2019]
![Page 40: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/40.jpg)
Z=1
Z=2 ... Z=
K
...
K Latent components
N Outputs
=+
Computable efficiently and non-asymptotically
40
Separately!
[Wu, Sugawara, Yamanishi KDD2017][Yamanishi, Wu, Sugawara, Okada DAMI2019]
3.3.2. Decomposed Normalized Maximum Likelihood (DNML) Codelength
Code-length
![Page 41: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/41.jpg)
How to Compute DNML
DNML criterion
![Page 42: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/42.jpg)
Finite Mixture Models
Multinomialdistribution
![Page 43: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/43.jpg)
Efficient Computation of Parametric Complexity for Multinomial Distribution
Theorem 3.3.2 (Recurrence relation of parametric complexity of multinomial distribution) [Kontkanen, Myllymaki IPL 2006]
Petri Myllimaki
![Page 44: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/44.jpg)
44
3.3.3 Applications to a Variety of ClassesExample 3.3.2 (Latent Dirichlet Allocation: LDA)
where
![Page 45: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/45.jpg)
DNML converges to the true model as rapidly as LSC, but slower than AIC.
Estimated #topics
#Sample Noise Ratio
Error Rate
DNML is more robust against noise than AIC.
Empirical Evaluation -Synthetic Data-
True # topics
DNML is able to identify the true # topics of LDAmost robustly
[Wu, Sugawara, Yamanishi KDD2017
![Page 46: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/46.jpg)
Empirical Evaluation-Benchmark Data-
Method 2 topics 3 topics 4 topics 5 topics 6 topics
DNML 2 3 4 5 7AIC 7 4 9 7 7Laplace 8 4 8 9 5ELBO 3 4 4 4 3CV 9 1 3 7 1HDP 80 98 94 92 98
DNML is able to identify the true # topics most exactly
Newsgroup Dataset
[Wu, Sugawara, Yamanishi KDD2017]
![Page 47: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/47.jpg)
47
Example 3.3.3 (Stochastic Block Models: SBM)
where
![Page 48: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/48.jpg)
Estimated #topics
#Sample
Empirical Evaluation -Synthetic Data-
True # topics
AIC increases most rapidly but overfit data for large sample sizeDNML increases more slowly but converges to true one.
[Wu, Sugawara, Yamanishi KDD2017]
![Page 49: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/49.jpg)
Example 3.3.4 (Gaussian Mixture Models: GMM)
![Page 50: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/50.jpg)
Empirical Evaluation -Synthetic Data-
DNML performs best and can identify small components exactly.
m=5K*=10
m=5K*=5With small components
[Yamanishi et al.DAMI 2019]
![Page 51: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/51.jpg)
Minimax Optimality of DNMLTheorem 3.3.3 (Estimation Optimality of DNML) [Yamanishi, Wu,Sugawara,Okada DAMI2019]
KL-Divergence
![Page 52: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/52.jpg)
Wide Applicability of DNMLDNML is universally applicable to model selection
for a wide class of hierarchical latent variable models.
DNML is a conclusive solution tolatent variable model selection.
![Page 53: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/53.jpg)
Classes of Hierarchical Latent Variable Models
![Page 54: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/54.jpg)
3.4. High-dimensional Sparse Model Selection
![Page 55: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/55.jpg)
3.4.1. High-dimensional Penalty Selection
Goal) Identify “essential” parameter subset in high dim. models• To achieve better generalization• To realize knowledge discovery in lower dimensional space
Method) Minimize regularized log loss:
• Small 𝜆𝜆𝑗𝑗 for essential parameters• Large 𝜆𝜆𝑗𝑗 for unnecessary parameters 55
where penalty
![Page 56: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/56.jpg)
Luckiness NML Codelength (LNML)
• Luckiness Minimax Regret
[Grűnwald 2007]
where
Peter Grunwald
![Page 57: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/57.jpg)
Upper Bounding on LNMLTheorem 3.3.5 (uLNML: An upper bound on LNML)
[Miyaguchi, Yamanishi Machine Learning 2018]
LNML codelength is uniformly upper-bounded by
![Page 58: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/58.jpg)
Optimization Algorithm
• Upper bound on LNML = concave + convex function
• Concave-convex procedure (CCCP) [Yuille, Rangarajan NC 02]✅Monotone non-increasing optimization✅ Convergence to a saddle point
[Miyaguchi, Yamanishi Machine Learning 2018]
Concave Convex
![Page 59: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/59.jpg)
59
Example 3.3.4 (Linear regression model)
The upper bound on LNML is calculated as
![Page 60: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/60.jpg)
Experiments: Linear RegressionーUCI Repositoryー
Sample size
Test Likelihood
Boston (13) Diabetes (10)
MillionSongs (90) ResidentialBuilding (103)
Proposed
• Better than grid-search-based methods (esp. if 𝑑𝑑 ≿ 𝑛𝑛)• More robust than RVM (relevant vector machine)
RVM
[Miyaguchi, Yamanishi Machine Learning 2018]
![Page 61: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/61.jpg)
• 𝑚𝑚-variate Gaussian model• Precision Θ represents conditional dependencies
• The upper bound on LNML
61
− ln𝑝𝑝𝑢𝑢𝑁𝑁𝑁𝑁𝑁𝑁𝑁𝑁 𝑥𝑥𝑛𝑛 𝛾𝛾
Example 3.3.5 (Graphical Model Estimation)
![Page 62: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/62.jpg)
Generated w/ double-ring model, Can handle >103 dimension
62
𝑑𝑑 = 190
𝑑𝑑 = 4950
KL d
iver
genc
eKL
div
erge
nce
Training samplesTraining samples
Code
leng
thCo
de le
ngth
𝑑𝑑 = 190
𝑑𝑑 = 4950
Experiments: Graphical ModelーSynthetic Dataー
[Miyaguchi, Yamanishi Machine Learning 2018]
![Page 63: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/63.jpg)
Luckiness NML for L1-Penalty
• Luckiness Minimax Regret [Miyaguchi, Matsushima and Yamanishi SDM2016]
where
![Page 64: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/64.jpg)
3.4.2. High-dimensional Minimax Prediction
![Page 65: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/65.jpg)
On-line Bayesian Prediction Strategy In conventional low dimensionality setting, the minimax regret is attained by e.g. the Bayesian prediction strategy
[Clarke and Barron JSPI 1994, Takeuchi and Barron ISIT1998]
Then cumulative log loss amounts
Problem: No counterpart in high-dim setting (𝑑𝑑≿𝑛𝑛)
![Page 66: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/66.jpg)
High-dimensional AsymptoticsAssumption
• Twice differetial of loss is uniformly upper bound by 𝐿𝐿• ℓ1-radius of P is at most 𝑅𝑅 < +∞
Theorem 3.3.6 (Asymptotics on worst regret)[Miyaguchi, Yamanishi AISTATS 2019]
Under the high-dim limit 𝜔𝜔 𝑛𝑛 = 𝑑𝑑 = 𝑒𝑒𝑜𝑜 𝑛𝑛 ,
Optimal within a factor of 2
![Page 67: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/67.jpg)
Minimax Predictor in High-dimensional Setting
• One-dim illustration of ST priorGap is wide when 𝑑𝑑 ≫ 𝑛𝑛
Spike
Exponential tail ∝ 𝑒𝑒−λ 𝜃𝜃
Bayesian prediction with Spike-and-Tails (ST) prior[Miyaguchi and Yamanishi AISTATS2019]
![Page 68: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/68.jpg)
Summary• Stochastic complexity (SC) , namely the NML codelength, is the
well-defined key information quantity of data relative to the model. MDL is to minimize SC.
• Techniques for efficient computation of SC are established:1) asymptotic formula, 2) g-functions, 3)Fourier transform, 4) combinatorial method.
• When applying MDL to latent variable models, use complete variable models, and apply Latent Stochastic Complexity(LSC) or Decomposed NML (DNML) to realize efficient and optimal estimation.
• When applying MDL to high-dimensional sparse models, apply LNML to realize optimal penalty selection.
![Page 69: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/69.jpg)
References■ 3.1. Stochastic Complexity and NML Codelength・J.Rissanen: “Modeing by shortest description length,”Automatica,14:465–471,1978.・Y.M. Shtar’kov: “Universal sequential coding of single messages,” Problemy
Peredachi Informatsii, 23(3):3–17, 1987.・J.Rissanen: “Stochastic Complexity in Statistical Inquiries,” Wiley, 1989. ・A.Barron and T.Cover: “Minimum complexity density estimator,” IEEE Transactions on Information Theory, Vol.37, 4, pp:1034-1054, 1991.・K.Yamanishi: “A learning criterion for stochastic rules,” Machine Learning, Vol.9, pp:165-203, 1992.・ J.Rissanen: “Fisher information and stochastic complexity,” IEEE Transactionson Information Theory, 42(1):40–47, 1996.・ P.D. Grϋnwald: “Minimum Description Length Principle,” MIT Press, 2007.・J.Rissanen: “Optimal Estimation of Parameters,” Cambridge, 2012.・M.Kawakita and J.Takeuchi: “Barron and Cover’s theory in supervised learning and its applications to lasso,” in Proceedings of International Conference on Machine
Learning, 2016.
![Page 70: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/70.jpg)
References■ 3.2. G-function and Fourier Transformation Method・J. Rissanen: “MDL denoising,” IEEE Transactions on Information Theory,
46(7):2537–2543, 2000.・J.Rissanen: “Information and Complexity in Statistical Modeling,” Springer, 2007.・A. Suzuki and K. Yamanishi:"Exact calculation of normalized maximum likelihood
code length using Fourier analysis" Proceedings of IEEE Symposium on Information Theory (ISIT2018), pp: 1211-1215, 2018.
■ 3.3.1. Latent Stochastic Complexity・P. Kontkanen, P. Myllymaki, W. Buntine, J. Rissanen, and H. Tirri: “An mdl framework
for data clustering,” In Advances in Minimum Description Length: Theory and Applications. MIT Press, pp: 323–35, 2005.
・P. Kontkanen and P. Myllymäki:゛A linear-time algorithm for computing the multinomial stochastic complexity,” Information Processing Letters, 103(6), pp:227–233, 2007.
・I.O.Kyrgyzov, O.O.Kyrgyzov, H.Maître, M.Campedel:“Kernel MDL to determine the number of clusters,” in Proceedings of Workshop on Machine Learning and Data Mining in Pattern Recognition. Lecture Notes in Computer Science, vol 4571, pp:2013-217, 2007.
![Page 71: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/71.jpg)
References■ 3.3.1. Latent Stochastic Complexity(Cont.)・ S. Hirai and K. Yamanishi:゛Efficient computation of normalized maximum likelihood
codes for gaussian mixture models with its applications to clustering,” IEEE Transactions on Information Theory, 59(11):7718–7727, 2013.
・Y.Sakai and K.Yamanishi: 〝An NML-based model selection criterion for general relational data modeling," Proceedings of IEEE International Conference on Big Data (BigData 2013), pp:421--429, 2013.
・P.Miettenin and J.Vreeken: “MDL4BMF: Minimum Description Length for Boolean Matrix Factorization,” ACM Transactions on Knowledge Discovery from Data, Vol. 8 Issue 4, 2014.
・A. Suzuki, K.Miyaguchi, and K. Yamanishi: “Structure selection convolutive non-negative matrix factorization using normalized maximum likelihood coding," Proceedings of IEEE International Conference on Data Mining (ICDM2016), pp:1221-1226, 2016.
・Y.Ito, S.Oeda, and K.Yamanishi: “Rank selection for non-negative matrix factorization with normalized maximum likelihood coding." Proceedings of SIAM International Conference on Data Mining (SDM2016), pp:720-728, Mar. 2016.
・S.Squires, A.Prügel-Bennett, M.Niranjan: “Rank selection in nonnegative matrix factorization using minimum description length,” Neural Computation, 29, 2164–2176, 2017.
![Page 72: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/72.jpg)
References■ 3.3.1. Latent Stochastic Complexity(Cont.)・T.Nakamura, T.Iwata, and K.Yamanishi: "Latent dimensionality estimation for
probabilistic canonical correlation analysis using normalized maximum likelihood code-length," Proceedings of the 4th IEEE International Conference on Data Science and Advanced Analytics (DSAA2017), pp:716-725, 2017.
・Y. Fu, S. Matsushima, K. Yamanishi: "Model selection for non-negative tensor factorization with minimum description length," Entropy, Jul. 2019.
・S.Hirai and K.Yamanishi: “Correction to Efficient computation of normalized maximum likelihood codes for Gaussian mixture models with its applications to clustering,” IEEE Transactions on Information Theory, 2019.
■ 3.3.2. Decomposed Normalized Maximum Likelihood Codelength ・T. Wu, S. Sugawara, K.Yamanishi: "Decomposed normalized maximum likelihood
codelength criterion for selecting hierarchical latent variable models," Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2017), pp:1165--1174, 2017.
・K. Yamanishi, T. Wu, S.Sugawara, M.Okada: "The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models", Data Mining and Knowledge Discovery. 33(4): 1017-1058, 2019.
![Page 73: Modern MDL Meets Data Mining Insight, Theory, and Practiceeda.mmci.uni-saarland.de/events/mdldm19/slides/mdldm19-part3.pdf · Normalized maximum likelihood (NML) Codelength for complete](https://reader034.fdocuments.us/reader034/viewer/2022042215/5ebc1ef803a2716bef022994/html5/thumbnails/73.jpg)
References■ 3.3.3. High-dimensional Sparse Model Selection・S. Chatterjee and A.Barron: “Information-theoretic validity of penalized likelihood,”in Proceedings of IEEE International Symposium on Information Theory, pp:3027-3031, 2014.・K. Miyaguchi, S.Matsushima, and K. Yamanishi: “Sparse graphical modeling via
stochastic complexity," Proceedings of 2017 SIAM International Conference on Data Mining (SDM2017), pp:723-731, 2017.
・K. Miyaguchi and K. Yamanishi: "High-dimensional penalty selection via minimum description length principle" Machine Learning, 107(8-10), pp:1283-1302, 2018.
・K. Miyaguchi and K. Yamanishi: "Adaptive minimax regret against smooth logarithmic losses over high-dimensional l1-balls via envelope complexity", Proceedings of
Artificial Intelligence and Statistics (AISTATS 2019), pp:3440-3448, 2019.
C.f. Bayesian prediction related to MDL・B.Clarke and A.Barron: “Jeffreys' prior is asymptotically least favorable under entropy
risk,” Journal of Statistical Planning and Inference, Vol. 41, Issue 1, pp:37-60,1994・J.Takeuchi and A.Barron: “Asymptotically minimax regret by Bayes mixtures,” in
Proceedings of IEEE International Symposium on Information Theory, 1998.