Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Lian Yan and David J. Miller

Intelligent Database Systems Lab

Advisor ： Dr.Hsu

Graduate ： Keng-Wei Chang

Author ： Lian Yan and David J. Miller

國立雲林科技大學National Yunlin University of Science and Technology

General statistical inference for discrete and

mixed spaces by an approximate application

of the maximum entropy principle

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 11, NO. 3, MAY 2000


Outline

Motivation Objective Introduction Maximum Entropy Joint PMF Extensions for More General Inference Problems Experimental Results Conclusions and Possible Extensions

N.Y.U.S.T.

I.M.


Motivation maximum entropy (ME) joint probability mass

function (pmf) powerful and not require expression of conditional

independence the huge learning complexity has severely limited

the use of this approach

N.Y.U.S.T.

I.M.


Objective propose an approach can quite tractable

learning extend to with mixed data

N.Y.U.S.T.

I.M.


1. Introduction probability mass function (pmf) joint pmf, can compute a posteriori probabilities

for a single, fixed feature given knowledge of the remaining feature values statistical classification

with some feature values missing statistical classification for any (e.g., user-specified) discrete feature dimensions given

values for the other features generalized classification

N.Y.U.S.T.

I.M.


1. Introduction Multiple Networks Approach Bayesian Networks Maximum Entropy Models Advantages of the Proposed ME Method ove

r BN’s

N.Y.U.S.T.

I.M.


1.1 Multiple Networks Approach

multilayer perceptrons (MLP’s), radial basis functions, support vector machines one would train one network for each feature example: documents classification to multiple topics one network was used to make an individual yes/no decision for presence of each possible topic multiple networks approach

N.Y.U.S.T.

I.M.


1.1 Multiple Networks Approach

several potential difficulties increased learning and storage complexities accuracy of inferences

ignores dependencies between features example:

network predict F1 = 1 and F2 = 1 respectively

but the joint event (F1=1, F2=1) has zero probability

N.Y.U.S.T.

I.M.


1.2 Bayesian Networks

handles missing features and captures dependencies between the multiple features

joint pmf explicitly a product of conditional probability versatile tools for inference that have a convenient, i

nformative representation…

N.Y.U.S.T.

I.M.


1.2 Bayesian Networks

several difficulties with BN explicitly conditional independence relations betwee

n features optimizing over the set of possible BN structures

sequential, greedy methods may be suboptimal sequential learning where to stop to avoid overfitting

N.Y.U.S.T.

I.M.


1.3 Maximum Entropy Models

Cheeseman proposed maximum entropy (ME) joint pmf consistent with ar

bitrary lower order probability constraints powerful, allowing joint pmf to express general depe

ndencies between features

N.Y.U.S.T.

I.M.


1.3 Maximum Entropy Models

several difficulties with ME difficult learning for estimating the ME

Ku and Kullback proposed an iterative algorithm, satisfies one constraint at a time, but cause violation of others

they only presented results for dimension N = 4 and J = 2 discrete values per feature

Peral cites complexity as the main barriers to using ME

N.Y.U.S.T.

I.M.


1.4 Advantages of the Proposed ME Method over BN’s

our approach not requir explicit conditional independence an effective joint optimization learning technique

N.Y.U.S.T.

I.M.


2. Maximum Entropy Joint PMFN.Y.U.S.T.

I.M.

a random feature vector

full discrete feature space

|}|,...,3,2,1{ and ),,...,,( 21 iiiiN AAAFFFFF

NAAA ...21


2. Maximum Entropy Joint PMF pairwise pmf

constrain the joint pmf to agree with

the ME joint pmf consistent with these pairwise pmf’s has the Gibbs form

N.Y.U.S.T.

I.M.

},],,[{ mnmFFP nm

][FP

Lagrange multiplier


2. Maximum Entropy Joint PMF Lagrange multiplier

equality constraint on the individual pairwise probability

the joint pmf is specified by the set of Lagrange multipliers

these probabilities also depend on Γ, they can often be tractably computed

N.Y.U.S.T.

I.M.

],[ nnmm fFfFP

] ,,),,({ nnmmnmnnmm AfAfmfFfF


2. Maximum Entropy Joint PMF two major difficulties

optimization requires calculating intractable cost D will require marginalizations over the joint pm

f intractable approximate ME was inspired

N.Y.U.S.T.

I.M.

][ fP


2.1 Review of the ME Formulation for Classification

random feature vector still has intractable form (1) classification does require computing

but rather just the a posteriori probabilities

N.Y.U.S.T.

I.M.

},...,2,1{ ),,(~

KCCFF

][~

FP

still not feasible!


2.1 Review of the ME Formulation for Classification

here we review a tractable , approximate method Joint PMF Form Support Approximation Lagrangian Formulation

N.Y.U.S.T.

I.M.


2.1.1 Joint PMF Form

via Bayes rule

where

N.Y.U.S.T.

I.M.

}...],[{ 21 NAAAffP


2.1.2 Support Approximation

the approximation may have some effect on accuracy of the learned model

but will not sacrifice our capability full feature space subset computationally feasible example:

N =19 40 billion 100 reduction is huge

N.Y.U.S.T.

I.M.

)},({ cCfF ii

NAAA ...21


2.1.3 Lagrangian Formulation

i.e.,

then the joint entropy for

N.Y.U.S.T.

I.M.

|} |,...,1),,...,,({ )()(2

)(1 mffff m

Nmm

m

F



suggest the cross entropy

the cross entropy/Kullback distance

N.Y.U.S.T.

I.M.



For pairwise constraints involving the class label P[Fk, C]

N.Y.U.S.T.

I.M.



overall constraint cost D is formed as a sum of all the individual pairwise costs

given D and H, can form the Lagrangian cost function

N.Y.U.S.T.

I.M.


3. Extensions for More General Inference Problems

General statistical Inference Joint PMF Representation Support Approximation Lagrangian Formulatoin

Discussion Mixed Discrete and Continuous Feature Space

N.Y.U.S.T.

I.M.


3.1.1 Joint PMF Representation

the posteriori probabilities have

N.Y.U.S.T.

I.M.


3.1.1 Joint PMF Representation

respect to each feature Fi, the joint pmf as

N.Y.U.S.T.

I.M.


3.1.2 Support Approximation

reduced joint pmf for

if there is a set

N.Y.U.S.T.

I.M.

}: {)( )()()( iimm

i ffffS

)( iF


3.1.3 Lagrangian Formulatoin

the joint entropy H can be written

N.Y.U.S.T.

I.M.



pairwise pmf PM[Fk, Fl] can be calculated in two different ways

and

N.Y.U.S.T.

I.M.



overall constraint cost D

N.Y.U.S.T.

I.M.


3.1.3 Lagrangian FormulatoinN.Y.U.S.T.

I.M.


N.Y.U.S.T.

I.M.


3.2. Discussion

Choice of Constraints encode all probabilities of second order

Tractability of Learning Qualitative Comparison of Methods

N.Y.U.S.T.

I.M.


3.3. Mixed Discrete and Continuous Feature Space

feature vector will be written

our objective is to learn

N.Y.U.S.T.

I.M.

),( AF

),...,,(

),...,,(

21

21

c

d

N

N

AAAA

FFFF

]},|[{ afcP



given our choice of constraints, these probabilities

decompose the joint density as

N.Y.U.S.T.

I.M.



a conditional mean constraint on Ai given C = c

a pair of continuous features Ai, Aj

N.Y.U.S.T.

I.M.


4. Experiment Results

Evaluation of generalized classification performance used solely for classification Mushroom, Congress, Nursery, Zoo, Hepatitis

Generalized classification performance on data sets indicates multiple possible class features Solar Flare, Flag, Horse Colic

Classification performance on data sets with mixed continuous and discrete features Credit Approval, Hepatitis, Horse Colic

N.Y.U.S.T.

I.M.



the ME method was compared with BN DT powerful extension of DT mixtures of DT multilayer perceptrons (MLP)

N.Y.U.S.T.

I.M.



for a arbitrary feature to be inrerred, Fi, computes the a posteriori probabilities

N.Y.U.S.T.

I.M.


use the following criteria to evaluate all the methods(1) misclassification rate on the test set for the data set’s

class label

(2) (1) with a single feature missing randomly

(3) average misclassification rate on the test set

(4) misclassification rate on the test set, based on

predicting a pair of randomly chosen features

N.Y.U.S.T.

I.M.



N.Y.U.S.T.

I.M.


4. Experiment Results N.Y.U.S.T.

I.M.


5. Conclusions and Possible Extensions Regression Large-Scale Problems Model Selection-Searching for ME Constraints Applications

N.Y.U.S.T.

I.M.


Personal opinion …

N.Y.U.S.T.

I.M.

Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Lian Yan and David J. Miller

Documents

Transcript of Advisor ： Dr.Hsu Graduate ： Keng-Wei Chang Author ： Lian Yan and David J. Miller