Brief Review Probability and Statistics. Probability distributions Continuous distributions.
Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs
description
Transcript of Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs
Pattern Recognition and Machine Learning-Chapter 2: Probability Distributions (Part 2) + Graphs
Affiliation: Kyoto UniversityName: Kevin Chien, Dr. Oba
Shigeyuki, Dr. Ishii ShinDate: Nov 04, 2011
1
Terminologies
For understanding distributions
2
Terminologies• Schur complement: relationship between
original matrix and its inverse.
• Completing the square: converting quadratic of form ax2+bx+c to a(…)2+const for equating quadratic components with normal Gaussian to find unknowns, or for solving quadratic.
• Robbins-Monro algorithm: iterative root finding for unobserved regression function M(x) expressed as a mean. Ie. E[N(x)]=M(x) 3
Terminologies (cont.)• [Stochastic appro., wiki., 2011]
– Condition on that
• Trace Tr(W) is sum of diagonals
• Degree of freedom: dimension of subspace. Here it refers to a hyperparameter.
4
Distributions
Gaussian distributions and motives
5
Conditional Gaussian Distribution• Derivation of conditional mean and variance:
– Noting Schur complement
• Linear Gaussian model: observations are weighted sum of underlying latent variables. Mean is linear w.r.t. dependent variable Xb. Variance is independent of Xb. 6
Assume y=Xa, x=Xb
Marginal Gaussian Distribution• Goal is also to identify mean and variance by
‘completing the square’.
• Solving above integration while noting Schur complement and compare components
7
Bayesian relationship with Gaussian distr. (quick view)
• Consider multivariable Gaussian where– Thus– According to Bayesian equation
• The conditional Gaussian must have form where exponent is difference of p(x,y) and p(x)– Ie. becomes
8
Bayesian relationship with Gaussian distr.
• Starting from
• Mean and var. for joint Gaussian distr. P(x,y)
• Mean and variance for P(x|y)
9
Can be seem as prior Can be seem as likelihood
Can be seem as posterior
Bayesian relationship with Gaussian distr., sequential est.
• Estimate mean by (N-1)+1 observations
• Robbins-Monro algorithm looks like the above form, and can solve mean from maximum likelihood. – solve for by Robbin-Monro
10
Bayesian relationship with Univariate Gaussian distr.
• Conjugate prior for precision (inv. cov.) of univariate Gaussian is gamma function
• Conjugate prior of univariate Gaussian is Gaussian-gamma function
11
• Conjugate prior for precision (inv. cov.) mat. of Multivariate Gaussian is Wishart distr.
• Conjugate prior of Multivariate Gaussian is Gaussian-Wishart distr.
12
Bayesian relationship with Multivariate Gaussian distr.
Distributions
Gaussian distributions variations
13
Student’s t-distr• Use in analysis of variance on whether effect
is real and statistical significant using t-distri. w/ n-1 degree of freedom.
• If Xi are normal random then– T-distr. has lower peak and longer tail (allow more
outliers thus robust) than Gaussian distr.
• Obtain by Sum up infinite number of univariate Gaussian of same mean but different precision
14
n
n
nXtn
)(
)(2
Student’s t-distr (cont.)• For multivariate Gaussian ,
corresponding t-distri.
– Mahalanobis dist.
• Mean, variance
15
Gaussian with periodic variables• To avoid mean been dependent on choice of
origin use polar coordinate
– Solve for theta
• Von Mises distr. a special case of von Mises-Fiser for N-dimensional sphere: stationary distribution of a drift process on the circle
16
Gaussian with periodic variables (cont.)
• From Gaussian of Cartesian coordinate to polar
– Becomes
– Von Mises distr.• Mean• Precision (concentration)
17
Gaussian with periodic variables: mean and variance
• Solving log likelihood
– mean– precision ‘m’
• By noting
18
Mixture of Gaussians• In part1 we already know one limitation of
Gaussian is unimodal property.– Solution: linear comb. (superposition) of Gaussians
• Mixing coefficients sum to 1
• Posterior here is known as ‘responsibilities’
– Log likelihood:19
Exponential family• Natural form
– Normalize by
• 1) Bernoulli– Becomes
• 2) Multinomial– Becomes
20
Exponential family (cont.)• 3) Univariate Gaussian
– Becomes
• Solve for natural parameter
– Becomes
– From max. likelihood
21
Parameters of Distributions
And interesting methodologies
22
Uninformative priors• “Subjective Bayesian”: avoid incorrect
assumption by using uninformative (ex. uniform distr.) prior.– Improper prior: prior need not sum to 1 for
posterior to sum to 1 as per Bayes equation.
• 1) location parameter for translation invariance
• 2) scale parameter for scale invariance in
23
Nonparametric methods• Instead of assume form of distribution, use
nonparametric methods.• 1) Histogram of constant bin width
– Good for sequential data– Problem: discontinuity, dimensionality increase exp.
• 2) Kernel estimators: sum of Parzen windows– ‘N’ Observations falling in region R (volume V) is ‘K’
– becomes24
Nonparametric method: Kernel estimators
• 2) Kernel estimators: fix V, determine K– Form of kernel function for points falling in R– h>0 is fixed parameter bandwidth for smoothing– Parzen estimator. Can choose k(u) (ex. Gaussian)
25
Nonparametric method: Nearest-neighbor
• 3) Nearest neighbor: this time use data to grow V Prior:
– Same as kernel estimator: training set is store as knowledge base.
– ‘k’ is number of neighbors, larger ‘k’ for smoother, and less complex boundary, fewer regions.
– For classifying N points into Nk points
in class Ck from Bayesian maximize
26
Nonparametric method: Nearest-neighbor (cont.)
• 3) Nearest neighbor: assign new point to class Ck
by majority vote of its k nearest neighbors……………… - for k=1 and n->∞ , error is bounded by
Bayes error rate
27
[k-nearest neighbor algorithm, wiki., 2011]
Ch.2 Basic Graph Concepts
From David Barber’s book
28
Directed and undirected graphs
29
Representations of Graphs
30
• Singled connected (tree): only one path from A to B
• Spanning tree of undirected graph: singly connected subset covering all vertices
• Graph representation (numerical)• Edge list: ex.
• Adjacency matrix A: N vertex then NxN where Aij=1 if there is an edge from i to j. For undirected graph this will be symmetric.
Representations of Graphs (cont.)
31
• Directed graph: If vertices are labeled in ancestral order (parent before children) then we have strictly upper triangular adjacency matrix• Provided there are no edge from a vertex to
itself• K maximum clique undirected graph has a N x K
matrix, where each column Ck express which nodes form a clique.• 2 cliques: vertices {1,2,3}
and {2,3,4}
Incidence Matrix
32
• Adjacency matrix A and incidence matrix Zinc
• Maximum clique incidence matrix Z • Property:
• Note: Zinc columns denote edges, and rows denote vertices
Additional Information
33
• Excerpt of graph and equations from [Pattern Recognition and Machine Learning, Bishop C.M.] page 84-127.
• Excerpt of graph and equations from [Bayesian Reasoning and Machine Learning, David Barber] page 19-23.
• Slide uploaded to Google group. Use with reference.