Research Article A Bayesian Classifier Learning...

Hindawi Publishing CorporationMathematical Problems in EngineeringVolume 2013 Article ID 975953 9 pageshttpdxdoiorg1011552013975953

Research ArticleA Bayesian Classifier Learning Algorithm Based onOptimization Model

Sanyang Liu Mingmin Zhu and Youlong Yang

Department of Mathematics Xidian University Xirsquoan 710071 China

Correspondence should be addressed to Sanyang Liu liusanyang126com

Received 6 September 2012 Accepted 10 December 2012

Academic Editor Cesar Cruz-Hernandez

Copyright copy 2013 Sanyang Liu et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

Naive Bayes classifier is a simple and effective classification method but its attribute independence assumption makes it unableto express the dependence among attributes and affects its classification performance In this paper we summarize the existingimproved algorithms and propose a Bayesian classifier learning algorithm based on optimization model (BC-OM) BC-OM usesthe chi-squared statistic to estimate the dependence coefficients among attributes with which it constructs the objective functionas an overall measure of the dependence for a classifier structure Therefore a problem of searching for an optimal classifier can beturned into finding the maximum value of the objective function in feasible fields In addition we have proved the existence anduniqueness of the numerical solution BC-OM offers a new opinion for the research of extended Bayesian classifierTheoretical andexperimental results show that the new algorithm is correct and effective

1 Introduction

With the development of information technology in par-ticular the progress of network technology multimediatechnology and communication technology massive dataanalysis and processing become more and more importantSince Bayesian network as classifier has a solid mathemat-ical basis and takes the prior information of samples intoconsideration it is now one of the hottest areas in machinelearning and data mining fields Moreover it has beenapplied to a wide range of tasks such as natural spokendialog systems vision recognitionmedical diagnosis geneticregulatory network inference and so forth [1ndash8] Naive Bayes(NB) [9ndash11] is a simple and effective classification modelAlthough its performance can be comparable with otherclassification methods such as decision trees and neuralnetwork its attribute of independence assumption limits itsreal application Extending its structure is a direct way toovercome the limitation of naive Bayes [12ndash14] since attributedependencies can be explicitly represented by arcs Tree-augmented naive Bayes (TAN) [9] is an extended tree-likenaive Bayes in which the class node directly points to allattribute nodes and an attribute node can have only oneparent fromanother attribute nodeOn this basis Cheng et al

presented Bayesain-network-Augmented naive Bayes (BAN)[15 16] which further expanded the tree-like structure ofTAN classifier and allowed the dependency relation betweenany two attribute nodes In constructing BAN they use ascoring function based on the minimum description lengthprinciple Unfortunately the search for the best network isperformed in the space of all possible networks and thenumber of elements in this space increases exponentially withthe number of nodes finding the best structure is NP-hard[17 18]

Based on above analysis this paper presents a Bayesianclassifier learning algorithm based on optimization model(BC-OM) for the first time inspired by constraint-basedBayesian network structure learning method [19ndash22] Wediscuss the classification principles of Bayesian classifier froma new view Because chi-squared tests are a standard tool formeasuring the dependency between pairs of variables [23]BC-OM first introduces the chi-squared statistic to definethe dependence coefficients of variables Then it uses thedependence coefficients to construct an overall measure ofthe dependence in a classifier structure from which theobjective function for our optimizationmodel can be derivedTherefore a problem of searching for an optimal classifier canbe turned into finding the maximum value of the objective

2 Mathematical Problems in Engineering

(a) NB

(b) TAN

(c) BAN

Figure 1 Illustration of NB TAN and BAN structures

function in feasible fields The function extremum corre-sponds to the best classifier Finally BC-OM improves theefficiency of classification and delete irrelevant or redundantattributes by using the d-separation rule of Bayesian networkTheoretical and experimental results show the proposedalgorithm is not only effective in improving the accuracy butalso has a high learning speed and simple solving procedure

The remainder of this paper is organized as followsSection 2 reviews the existing Bayesian network classifiersWe describe our algorithm and its theoretical proofs inSection 3 Section 4 details the experimental procedures andresults of the proposed algorithm Finally in Section 5 weconclude and outline our future work

2 Background

In this section we discuss previous work that is relevant tothis paper and describe some of the notations used firstlyWeuse boldface capital letters such asVEX for sets of variablesGeneral variables are denoted by italic capital letters or indexitalic capital letters 119883 119877 119881

specific values taken by thesevariables are denoted 119909

119903minus

119907minus

Specially we use the sameitalic letters119883119884119881

for graph nodes which corresponds withthe random variables

Classification is a basic task in data analysis and patternrecognition that requires the construction of a classifier thatis a function that assigns a class label to instances describedby a set of attributes The induction of classifiers from datasets of preclassified instances is a central problem in machinelearning Let V = 119881

1198812

119881119899

represent the variable setwhich corresponds with the training data set 119863 We assumethat 119881

is the class variable and 1198812

1198813

119881119899

is the setof attribute variables Bayesian networks are often used forclassification problems in which themain task is to constructthe classifier structure 119866 from a given set of training datawith class labels and then compute the posterior probability119875119866

(1199071

| 1198812

1198813

119881119899

) where 1199071

is the value that 1198811

takesThus it only needs to predict the class with the highest valueof probability 119875

119866

(1199071

| 1198812

1198813

119881119899

) that is

1199071

lowast

= arg max119907

1isindom(119881

119875119866

(1199071

| 1198812

1198813

119881119899

According to Bayes theorem maximizing 119875119866

(1199071

| 1198812

1198813

119881119899

) is equivalent to maximizing119875(1199071

) sdot119875119866

(1198812

1198813

119881119899

| 1199071

)The difference between the existing Bayesian classifiers is thecomputing mode of 119875

119866

(sdot)

Figure 1 schematically illustrates the structures of theBayesian classifiers considered in this paper In naive Bayeseach attribute node has the class node as its parent butdoes not have any parent from attribute nodes Computing119875119866

(1198812

1198813

119881119899

| 1199071

) is equal toprod119899119894=2

119875119866

(119881119894

| 1199071

) Because thevalues of 119875

119866

(1199071

) and 119875119866

(119881119894

| 1199071

) can be easily estimated fromtraining examples naive Bayes is easy to construct Howeverits prerequisite of condition independence assumption anddata completeness limit its real application TAN takes thenaive Bayes and adds edges to it in which the class nodedirectly points to all attribute nodes and an attribute node canhave only one parent from another attribute node Comput-ing119875119866

(1198812

1198813

119881119899

| 1199071

) is equivalent toprod119899119894=2

119875119866

(119881119894

| 119881119895

1199071

)where119881

119895

isin 1198811

1198812

119881(119894minus1)

It is an efficient extend of naiveBayes BAN is a specific case of general Bayesian networkclassifier in which the class node also directly points to allattribute nodes but there is no limitation on the arcs amongattribute nodes (except that they do not form any directedcycle) It is clear that TAN and BAN are useful to modelcorrelations among attribute nodes that cannot be capturedby naive Bayes They embody a good tradeoff betweenthe equality of the approximation of correlations amongattributes and the computational complexity in the learningstage In addition existing algorithms using the same idea toconstruct the structure of Bayesian classifier which first learnthe dependent relationships among attribute variables usingBayesian network structure learning algorithm then add theclass variable as the root node of the network It is equivalentto learning the best Bayesian network among those in which1198811

is a root Thus even if we could improve the performanceof a naive Bayes classifier in this way the computational effortrequired may not be worthwhile

Based on above analysis this paper presents an optimiza-tion model to learn the structure of Bayesian classifier whichinspired by constraint-based Bayesian network structurelearning method It is the first time that a problem of struc-tural learning for a Bayesian classifier is transformed intoits related mathematical programming problem by definingobjective function and feasible region And we also propose anewmethod tomeasure the dependent relationships betweenattributes The theoretical basis of this method is establishedbyTheorem 1 [24]

Theorem 1 Given a data set 119863 and a variable set V = 1198811

1198812

119881119899

if the hypothesis that 119881119894

and 119881119895

are conditionally

Mathematical Problems in Engineering 3

independent given 119881119896

is true then the statistics 1198802119894119895|119896

= 2sum119886119887119888

119873119886119887119888

119894119895119896

log[119873119886119887119888119894119895119896

119873119888

119896

(119873119886119888

119894119896

119873119887119888

119895119896

)] approximates to a distribution1205942

(119897) with 119897 = (119903119894

minus1)(119903119895

minus1)119903119896

degrees of freedom where 119903119894

119903119895

and 119903119896

represent the number of configurations for the variables119881119894

119881119895

and 119881119896

respectively 119873119886119887119888119894119895119896

is the number of cases in 119863where 119881

119894

= 119886 119881119895

= 119887 and 119881119896

= 119888 119873119886119888119894119896

is the number of casesin119863 where119881

119894

= 119886 and119881119896

= 119888 and119873119888119896

is the number of cases in119863 where 119881

119896

= 119888

3 A Bayesian Classifier Learning AlgorithmBased on Optimization Model

31 Optimization Model Design In this subsection we givesome basic concepts and theorems which is the foundationof the method proposed in this paper

A Bayesian classifier is a graphical representation of ajoint probability distribution that includes two componentsOne is a directed acyclic graph 119866 = (VE) where the nodeset V = 119881

1198812

119881119899

represents the class and attributevariables and the edge set E represents direct dependencyrelationships between variables The other is a joint probabil-ity distribution Θ = 120579

119894

| 120579119894

= 119875(119881119894

| pa(119881119894

)) 119894 = 1 2 119899

that quantifies the effects of pa(119881119894

) has on the variable 119881119894

in119866 where pa(119881

119894

) = 119881119895

| 119881119895

rarr 119881119894

isin 119864 We assumethat 119881

is the class node and 1198812

1198813

119881119899

is the set ofattribute nodes The structure of 119866 reflects the underlyingprobabilistic dependence relations among the nodes and a setof assertions about conditional independenciesThe problemof data classification can be stated as follows the learning goalis first to find the classifier structure that best matches119863 andestimate the parameters using the training data set119863 then toassign class label to test instances Since119866 is a directed acyclicgraph it can be represented by a binary node-node adjacencymatrix119883 = (119909

119894119895

) Entry (119894 119895) is 1 if there is a directed arc fromnode 119894 to node 119895 and 0 otherwise That is

119909119894119895

= 1 if 119881

119894

997888rarr 119881119895

isin 119864

0 if 119881119894

997888rarr 119881119895

notin 119864(2)

Let 119877 = 119883 + 1198832

+ sdot sdot sdot + 119883119899 be the sum of powers of the

adjacency matrix Entry 119903119894119895

isin 119877 is equal to the number ofdirected paths from node 119881

119894

to node 119881119895

in the graph [25]We wish to be able to use a mathematical programming

formulation and this formulation requires that we are able tomeasure the impact of adding or removing a single arc fromthe network In order to approximate the impact of addingsuch an arc we define the dependence coefficient

Definition 2 Given a data set 119863 and a variable set V = 1198811

1198812

119881119899

we define the dependence coefficient 119888119894119895120572

betweenvariables119881

119894

and119881119895

as 119888119894119895120572

= min119896 = 119894119895

1198802

119894119895|119896

minus1205942

119894119895|119896120572

where1198802119894119895|119896

is the 1198802 statistics of 119881119894

and 119881119895

given 119881119896

1205942119894119895|119896120572

is the criticalvalue at the significance level 120572 of a 1205942 distribution with (119903

119894

1)(119903119895

minus 1)119903119896

degrees of freedom

Obviously 119888119894119895120572

is a conservative estimate of the degree ofdependence between two nodes If 119888

119894119895120572

gt 0 then regardless

of the other variable involved there is statistically significantdependence between 119881

119894

and 119881119895

so there should be an arcbetween them If 119888

119894119895120572

lt 0 then there is at least one way ofconditioning the relationship so that significant dependenceis not present We define 119862 = (119888

119894119895

) the dependence coefficientmatrix corresponding to the variable set V that is

119888119894119895

= 119888119894119895120572

if 119894 = 119895

0 otherwise(3)

Lemma 3 Given a data set 119863 and a variable set V 119881119894

and 119881119895

are locally conditionally independent at the significance level 120572if and only if there is a node 119881

119896

isin V such that 1198802119894119895|119896

lt 1205942

119894119895|119896120572

The proof of Lemma 3 can be obtained directly byDefinition 2 and chi-square hypothesis test According toLemma 3 119881

119894

and 119881119895

are locally conditionally independentat the significance level 120572 if and only if there is a node119881119896

isin V such that 119888119894119895

lt 0 Further 119881119894

and 119881119895

are globallyconditionally independent at the significance level 120572 if andonly if 1198802

119894119895|119896

lt 1205942

119894119895|119896120572

for any 119896 = 119894 119895 Based on this we usethe dependence coefficients to construct an overall measureof the dependence which will be treated as the objectivefunction for our mathematical program

Definition 4 For a Bayesian classifier with adjacency matrix119883 = (119909

119894119895

) the global dependence measure of the network isgiven by

F (119883 120572) =

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909119894119895

According to the measure of Definition 4 if119881119894

and119881119895

areconditionally independent by Lemma 3 119888

119894119895

lt 0 and henceadding an arc between 119881

119894

and 119881119895

decreases the value of FThus we wish to find the feasible solution which increasesF The optimal solution corresponds to the best classifierstructure We next explain what constitutes feasible network

Given the variable set V = 1198811

1198812

119881119899

1198811

is the classnode and 119881

119894

119894 = 2 119899 is the set of attribute nodes Adirected network is a feasible classifier structure if and onlyif the following conditions are satisfied

(1) for any attribute node 119881119894

isin V 119894 = 2 119899 there is nodirected edge from 119881

119894

to 1198811

(2) for any node 119881

119894

isin V 119894 = 1 2 119899 there is nodirected path from 119881

119894

to 119881119894

namely the graph isacyclic

(3) there exists at least one attribute node 119881119894

isin V 119894 =2 119899 which is dependent with class node 119881

namely there is an attribute node 119881

119894

such that 119881119894

canbe reached from 119881

by a directed path

In order to incorporate the requirements of the abovethree conditions into a mathematical programming formu-lation we express them by the following constrains

(1) sum119899119894=2

1199091198941

(1) Input Data set119863 Variable set V = (1198811

1198812

119881119899

) (1198811

is the class node and others are attributenodes)

(2) For any pair of variable 119881119894

and 119881119895

contained in V calculate the dependence coefficient 119888119894119895

byDefinition 2

(3) Solve the mathematical programming (OM) and obtain the optimal solution119883lowast = (119909119894119895

)(4) Build the classifier structure 119866lowast = (V Elowast) by119883lowast = (119909

119894119895

)(5) For any variable 119881

119894

isin V 119894 = 1 if there is a subset Z sub V 1198811

119881119894

such that Z d-separates 119881119894

and1198811

then delete 119881119894

in 119866lowast update 119866lowast as V = V 119881119894

Elowast = Elowast 119881119894

rarr 119881119895

or 119881119895

rarr 119881119894

119881119895

isin adj(119881119894

)(6) Output The built classifier 119866lowast = (V Elowast)

Algorithm 1 BC-OM

(2) sum119899119894=1

119903119894119894

= 0(3) sum119899119895=2

1199031119895

The feasible classifiers are those that satisfy constrains (1)ndash(3)Thus learning best Bayesian classifier can be transformedinto the following related mathematical programming prob-lem where the objective function is a global dependencemeasure of the network and the feasible region is the set ofclassifiers with reachability constrains (1)ndash(3) that is

max F (119883 120572) =

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909119894119895

st119899

119894=2

1199091198941

119899

119894=1

119903119894119894

119899

119895=2

1199031119895

119909119894119895

isin 0 1

32 BC-OMAlgorithmand Its Correctness In this subsectionwe present the main algorithm of this paper Our methodstarts with finding the best Bayesian classifier by solving theabove optimization model Second we use the d-separationrule of Bayesian network to delete irrelevant or redundantattributes in the network which have low dependence degreewith the class variable The parameters of modified networkcan be estimated Third classification is done by applyingobtained classifier to predict the class label of test dataWe prove the correctness of proposed method under thefaithfulness assumption for the data distribution

Given a directed acyclic graph 119866 = (VE) where V is thenode set and E the set of directed edges A path 120588 betweentwo distinct nodes 119881

and 119881119897

is a sequence of distinct nodesin which the first node is 119881

the last one is 119881119897

and twoconsecutive nodes are connected by an edge that is 120588 =

1198811

1198901

1198812

sdot sdot sdot 119890119897minus1

119881119897

where 119890119894

denotes 119881119894

rarr 119881119894+1

or 119881119894

larr 119881119894+1

for 119894 = 1 2 (119897 minus 1)

Definition 5 A path 120588 is said to be d-separated by a set Z in adirected acyclic graph 119866 if and only if (1) 120588 contains a ldquohead-to-tail meetingrdquo 119881

119894

rarr 119881119895

rarr 119881119896

or a ldquotail-to-tail meetingrdquo119881119894

larr 119881119895

rarr 119881119896

such that the middle node 119881119895

is in Z or (2) 120588

contains a ldquohead-to-head meetingrdquo119881119894

rarr 119881119895

larr 119881119896

such thatthe middle node 119881

119895

is not in Z and no descendant of 119881119895

is inZ Specially two distinct sets of nodes X and Y are said to bed-separated by a set Z in 119866 if Z d-separates every path fromany node in X to any node in Y [26]

In this paper we assume that all the distributions arecompatible with 119866 [27] We also assume that all indepen-dencies of a probability distribution of variables in V canbe checked by d-separations of 119866 called the faithfulnessassumption [26] The faithfulness assumption means thatall independencies and conditional independencies amongvariables can be represented by 119866 Now we formally describeour method in the following Algorithm 1

From the detailed steps of BC-OM we can see that BC-OM classifier relaxes the restrictions on condition variableand further meets the need of practical application Since itsnetwork structure is similar to that of BANrsquos BC-OM doesnot need to build all possible networks in which class nodeis a root and removes irrelevant or redundant nodes fromthe network before the process of estimating the networkparameters which greatly reduces the calculation for poste-rior probability of class variable In fact the training processof BC-OM is different from other BN classifiers Its main taskis to solve the mathematical programming (OM) To createthe dependence coefficient matrix corresponding to (OM)BC-OM needs to compute the conditional statistics 1198802

119894119895|119896

Moreover just as other constraint based algorithms themaincost of BC-OM is the number of conditional independencetests for computing the dependence coefficients of any twovariables in step 2 The number of conditional independencetests is 1198622

119899

sdot 1198621

119899minus2

and the computing complexity is 119874(1198993)The total complexity of BC-OM is bound by 119874(1198993 sdot 119873)where 119899 is the number of variables in the network and 119873is the number of cases in data set 119863 In principle BC-OM is a structure-extension-based algorithm In BC-OMwe essentially extend the structure of TAN by relaxing theparent set of each attribute nodeThus the resulting structureis more complex than TAN but more simple than BANTherefore BC-OM is a good tradeoff between the modelcomplexity and accuracy compared with TAN and BANNext we prove the correctness of BC-OM algorithm underthe faithfulness assumption

The next two results establish the existence and unique-ness properties of solution to (OM)

Theorem 6 Let X = (119909119894119895

)119899times119899

| 119909119894119895

isin 0 1 There alwaysexists an119883 isin X such that 119883 is a feasible point of (OM)

Proof Given the set of variables119881 = 1198811

1198812

119881119899

where1198811

119881119899

are the attribute variablesWe give a matrix119883 as follows

119909119894119895

= 1 119894 = 1 119895 = 2 119899

0 otherwise(5)

Obviously the adjacency matrix 119883 always satisfies the con-strains (1)ndash(3) In fact the graph represented by119883 is theNaiveBayes classifier Thus119883 is a feasible solution of (OM)

According to Theorem 6 we can prove that there exists afeasible classifier which satisfy constrains (1)ndash(3) Theorem 7further shows that such classifier is unique under certaincondition

Theorem 7 Let 119883lowast be the optimal solution of (OM) 1198621

119888119894119895

| 119909119894119895

= 1 and 1198622

= 119888119894119895

| 119909119894119895

= 0 be the coefficientsets where 119909

119894119895

is the element of119883lowast119883lowast is the unique solution of(OM) if and only if any element in 119862

cannot be expressed asthe sum of any number of elements in 119862

Proof Without loss of generality we suppose by reductionto absurdity that 119883(1) and 119883(2) are two optimal solutions of(OM)The values of the objective function is the same in bothsolutions that is

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909(1)

119894119895

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909(2)

119894119895

997904rArr

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

(119909(1)

119894119895

minus 119909(2)

119894119895

) = 0 (7)

Let 119862(1)1

= 119888119894119895

| 119909(1)

119894119895

= 1 119862(1)2

= 119888119894119895

| 119909(1)

119894119895

= 0 Accordingto the assumption of 119883(1) = 119883(2) and 119909(1)

119894119895

119909(2)119894119895

isin 0 1 theremust exist 1 le 119894

1198950

le 119899 such that 119909(1)119894

0119895

minus 119909(2)

119894

0119895

= 1 namely119909(1)

119894

0119895

= 1 119909(2)

119894

0119895

= 0 Then by (7)

119888119894

0119895

119894 = 119894

119895 = 119895

119888119894119895

(119909(1)

119894119895

minus 119909(2)

119894119895

) = 0 (8)

119909(1)

119894119895

minus 119909(2)

119894119895

1 119909(1)

119894119895

= 1 119909(2)

119894119895

0 119909(1)

119894119895

= 119909(2)

119894119895

minus1 119909(1)

119894119895

= 0 119909(2)

119894119895

then by (8) there must exist 119888119894

1119895

119888119894

119896119895

119896

isin 119862(1)

such that119909(1)

119894

119903119895

119903

minus119909(2)

119894

119903119895

119903

= minus1 119903 = 1 119896 and 119888119894

0119895

+sum119896

119903=1

119888119894

119903119895

119903

(119909(1)

119894

119903119895

119903

minus119909(2)

119894

119903119895

119903

0 namely

119888119894

0119895

119896

119903=1

119888119894

119903119895

119903

where 119909(1)119894

0119895

= 1 119909(1)

119894

119903119895

119903

= 0 119903 = 1 119896 This contradictswith the known condition that any element in 119862(1)

cannot beexpressed as the sum of any number of elements in 119862(1)

Theorem 8 Let 119866 = (VE) V = (1198811

1198812

119881119899

) be theclassifier structure obtained by step 4 of BC-OM where 119881

119881119899

are attribute variables119866lowast

= (VlowastElowast) denotes the final output of BC-OM then theclassification results obtained by 119866 and 119866lowast are consistent

Proof Without loss of generality suppose (1198891198942

119889119894119899

) is anexample to be classified The classifier represented by 119866 isgiven as follows

119907lowast

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

= 1198891198942

119907119899

= 119889119894119899

) (11)

We write the right side of (11) as 119901(1199071

| 1199072

119907119899

) in shortWe can suppose that 119896 redundant variables were deleted instep 5 of BC-OM say the last 119896 variables 119881

119899minus119896+1

119881119899minus119896+2

119881119899

Then Vlowast = 1198811

1198812

119881119899minus119896

According to step 51198812

119881119899minus119896

d-separates 1198811

and 119881119899minus119896+1

119881119899

Thus1198811

is conditional independent with 119881119899minus119896+1

119881119899

given1198812

119881119899minus119896

Equation (11) can be reduced as follows

119907lowast

= arg max119907

1isindom(119881

119901 (1199071

1199072

119907119899

119901 (1199072

119907119899

lArrrArr 119907lowast

= arg max119907

1isindom(119881

119901 (1199071

1199072

119907119899

= arg max119907

1isindom(119881

119901 (1199071

119907119899minus119896+1

119907119899

| 1199072

119907119899minus119896

sdot 119901 (1199072

119907119899minus119896

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

119907119899minus119896

sdot 119901 (119907119899minus119896+1

119907119899

| 1199072

119907119899minus119896

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

119907119899minus119896

We get the results

Theorem 8 reveals that it is effective and correct to removeredundant or irrelevant attributes using d-separation ruleand the performance of Bayesian classifier can be improved

4 Experimental Results

We run our experiments on 20 data sets from the UCI repos-itory of Machine Learning datasets [28] which represent awide range of domains and data characteristics Table 1 showsthe description of the 20 data sets which are ordered byascending number of samples In our experiments missingvalues are replaced with the modes and means of thecorresponding attribute values from the available data Forexample if the sex of someone is missing it can be replacedby the mode (the value with the highest frequency) of thesexes of all the others Besides we manually delete three

Table 1 Descriptions of UCI datasets used in the experiments

NO Dataset Instances Attributes Classes Missing value (NY)1 Shuttle 15 6 2 N2 Balloon 16 4 2 N3 Lenses 24 4 3 N4 Soybean (small) 47 35 4 N5 Post-patient 90 8 3 Y6 Zoo 101 17 7 N7 Hayes-roth 132 5 3 N8 Iris 150 4 3 N9 Glass 214 10 7 N10 Heart 267 22 2 N11 Ionosphere 351 34 2 N12 Monkrsquos 432 7 2 N13 Vote 435 16 2 Y14 Balance-scale 625 5 3 N15 Tic-tac-toe 958 9 2 N16 Car-evaluation 1728 6 4 N17 Chess (kr-vs-kp) 3196 36 2 N18 Mushroom 8124 22 2 Y19 Nursery 12960 8 5 N20 Letter 20000 16 26 N

useless attributes the attribute ldquoID numberrdquo in the datasetldquoGlassrdquo the attribute ldquonamerdquo in the dataset ldquoHayes-rothrdquo andthe attribute ldquoanimal namerdquo in the dataset ldquoZoordquo

The experimental platform is a personal computer withPentium 4 306GHz CPU 099GB memory and WindowsXP Our implementation is based on the BayesNet Toolboxfor Matlab [29] which provides source code to performseveral operations on Bayesian networks The purpose ofthese experiments is to compare the performance of theproposed BC-OM with Naive Bayes TAN and BAN in termsof classifier accuracy The accuracy of each model is basedon the percentage of successful predictions on the test setsof each data set In all experiments the accuracy of eachmodel on each data set are obtained via 10 runs of 5-fold crossvalidation Runs with the various algorithms are carried outon the same training sets and evaluated on the same test setsIn particular the cross-validation folds are the same for all theexperiments on each data set Finally we compared relatedalgorithms via two-tailed 119905-test with a 95 percent confidencelevel According to the statistical theory we speak of tworesults for a data set as being ldquosignificantly differentrdquo only ifthe probability of significant difference is at least 95 percent[30]

Table 2 shows the accuracy (and standard deviation ofaccuracy) of each model on each data set and the averagevalues and standard deviation on all data sets are summarizedat the bottom of the table In each row the best of thefour classifier results are displayed in bold If anotherrsquosperformance is not significantly different from the best itis also highlighted but if the differences between all fourclassifies are not statistically significant then none of them

is highlighted From our experiments we can see that BC-OM is best in 6 cases NB TAN and BAN are best in 65 and 3 cases respectively When the number of samples islarger than 400 the performance of TAN and BAN is betterthan that of NB and BC-OM is best Although it can be seenthat the performance of BC-OM and TAN becomes similaras the sample size increase BC-OM has a higher accuracyon average From a general point of view we can see thatfrom the first data set to the last one the highlighted numberschange from few to even more in the sixth column of Table 2It means the advantage of BC-OM is more evident with theincrease of data size

Table 3 shows the compared results of two-tailed 119905-testin which each entry 119908119905119897 means that the model in thecorresponding row wins in 119908 data sets ties in 119905 data setsand loses in 119897 data sets compared to the model in thecorresponding column From Table 3 we can see that BC-OM significantly outperforms NB (9 wins and 4 losses)TAN (12 wins and 5 losses) and BAN (11 wins and 5 losses)in accuracy Figures 2 3 and 4 show two scatter-plotscomparing BC-OMwith NB TAN and BAN respectively Inthe scatter plot each point represents a data set where the 119909coordinate of a point is the percentage of misclassificationsaccording to NB or TAN or BAN and the 119910 coordinate isthe percentage of misclassifications according to BC-OMThus points below the diagonal line correspond to datasets on which BC-OM performs better From Figures 2and 3 we can see that BC-OM generally outperforms NBand TAN as is also demonstrated in Table 3 It providesstrong evidence that BC-OM is performing well against theother two classifiers both in terms of accuracy as well as

Table 2 The detailed experimental results on accuracy and standard deviation

NO Dataset NB TAN BAN BC-OM1 Shuttle 8352 plusmn 585 8533 plusmn 703 8443 plusmn 683 8233 plusmn 594

2 Balloon 8915 plusmn 306 8516 plusmn 315 8716 plusmn 292 8375 plusmn 357

3 Lenses 9256 plusmn 244 9325 plusmn 285 8967 plusmn 316 9112 plusmn 302

4 Soybean (small) 9091 plusmn 235 8498 plusmn 316 8736 plusmn 244 8974 plusmn 323

5 Post-patient 8029 plusmn 339 7994 plusmn 347 8126 plusmn 356 8012 plusmn 348

6 Zoo 9346 plusmn 512 9521 plusmn 478 9434 plusmn 516 9488 plusmn 509

7 Hayes-roth 8182 plusmn 527 8734 plusmn 489 8458 plusmn 374 8455 plusmn 356

8 Iris 9413 plusmn 612 9347 plusmn 665 9381 plusmn 680 8907 plusmn 725

9 Glass 6602 plusmn 988 5967 plusmn 952 5832 plusmn 998 5696 plusmn 981

10 Heart 8241 plusmn 662 8253 plusmn 701 8265 plusmn 703 8153 plusmn 691

11 Ionosphere 9054 plusmn 403 9061 plusmn 398 9145 plusmn 414 9099 plusmn 48212 Monkrsquos 8428 plusmn 114 8329 plusmn 129 8379 plusmn 123 8664 plusmn 10213 Vote 9488 plusmn 357 9037 plusmn 334 9263 plusmn 398 9097 plusmn 295

14 Balance-scale 9091 plusmn 151 8497 plusmn 244 8794 plusmn 231 9077 plusmn 18515 Tic-tac-toe 7011 plusmn 445 7316 plusmn 447 7574 plusmn 483 7621 plusmn 38216 Car-evaluation 8495 plusmn 274 9344 plusmn 189 892 plusmn 102 9542 plusmn 10317 Chess (kr-vs-kp) 8788 plusmn 191 9168 plusmn 166 9027 plusmn 112 9265 plusmn 1418 Mushroom 9285 plusmn 056 9989 plusmn 01 9637 plusmn 076 9991 plusmn 0119 Nursery 9054 plusmn 114 9407 plusmn 129 9231 plusmn 123 8995 plusmn 109

20 Letter 6932 plusmn 397 7758 plusmn 183 7345 plusmn 167 7766 plusmn 201Average 8685 plusmn 481 8663 plusmn 37 8569 plusmn 369 8676 plusmn 357

Table 3 The compared results of two-tailed 119905-test on accuracy withthe 95 percent confidence level

119908119905119897 NB TAN BANTAN 1334 mdash mdashBAN 1244 1055 mdashBC-OM 974 1235 1145

the percentage of misclassifications Figure 4 also shows BC-OM outperforming BAN though the difference in perfor-mance is not as marked as in the results of Figures 2 and 3 Inother words the performance of BC-OM and BAN is similarin terms of the percentage ofmisclassificationsHowever BC-OMhas a higher accuracy and amore simple graph structurewhich suggests that BC-OM is able to handle very large datasets and is a more promising classifier

5 Conclusions

In many real-world applications classification is oftenrequired to make optimal decisions In this paper we sum-marize the existing improved algorithms for naive Bayesand propose a novel Bayesian classifier model BC-OM Weconducted a systematic experimental study on a number ofUCI datasetsThe experimental results show that BC-OMhasa better performance compared to the other state-of-the-artmodels for augmenting naive Bayes It is clear that in somesituations it would be useful to model correlations among

0 5 10 15 20 25 30 35 40 45 500

NB error

Figure 2 Relative errors of BC-OM and NB

attributes BC-OM is a good tradeoff between the quality ofthe approximation of correlations among attributes and thecomputational complexity in the learning stage Consideringits simplicity BC-OM is a promisingmodel that could be usedin many field

In addition we use the chi-squared statistic to estimatethe dependence coefficients among attributes from datasetWe believe that the use of more sophisticated methods couldimprove the performance of the current BC-OM and make

0 5 10 15 20 25 30 35 40 45 500

TAN error

Figure 3 Relative errors of BC-OM and TAN

0 5 10 15 20 25 30 35 40 45 50

BAN error

Figure 4 Relative errors of BC-OM and BAN

its advantage stronger This is the main research direction forour future work

Acknowledgments

This work was supported by the National Natural ScienceFoundation of China (nos 60974082 and 61075055) theNational Funds of China for Young Scientists (no 11001214)and the Fundamental Research Funds for the Central Univer-sities (no K5051270013)

References

[1] Z Cai S Sun S Si and B Yannou ldquoIdentifying product failurerate based on a conditional Bayesian network classifierrdquo ExpertSystems with Applications vol 38 no 5 pp 5036ndash5043 2011

[2] D C Kim X Wang C R Yang and J Gao ldquoLearningbiological network using mutual information and conditionalindependencerdquo BMC Bioinformatics vol 11 no 3 article S92010

[3] N C Hsieh and L P Hung ldquoA data driven ensemble classifierfor credit scoring analysisrdquo Expert Systems with Applicationsvol 37 no 1 pp 534ndash545 2010

[4] S Mukherjee and N Sharma ldquoIntrusion detection using naiveBayes classifier with feature reductionrdquo Procedia Technologyvol 4 pp 119ndash128 2012

[5] Y Sun Y Y Tang S XDing S P Lv andY F Cui ldquoDiagnose themild cognitive impairment by constructing Bayesian networkwithmissing datardquo Expert Systems with Applications vol 38 no1 pp 442ndash449 2011

[6] V Aquaro M Bardoscia R Bellotti A Consiglio F de Carloand G Ferri ldquoA Bayesian networks approach to operationalriskrdquo Physica A vol 389 no 8 pp 1721ndash1728 2010

[7] S Nikolajewa R Pudimat M Hiller M Platzer and RBackofen ldquoBioBayesNet a web server for feature extractionand Bayesian network modeling of biological sequence datardquoNucleic acids research vol 35 pp W688ndash693 2007

[8] I Ben-Gal A Shani A Gohr et al ldquoIdentification of transcrip-tion factor binding siteswith variable-order BayesiannetworksrdquoBioinformatics vol 21 no 11 pp 2657ndash2666 2005

[9] N Friedman D Geiger andMGoldszmidt ldquoBayesian networkclassifiersrdquoMachine Learning vol 29 no 2-3 pp 131ndash163 1997

[10] P Langley W Iba and K Thompson ldquoAn analysis of Bayesianclassifiersrdquo in Proceedings of the 10th National Conference onArtificial Intelligence P Rosenbloom and P Szolovits Eds pp223ndash228 AAAI Press Menlo Park Calif USA 1992

[11] M Minsky ldquoSteps toward artificial intelligencerdquo Proceedings ofthe Institute of Radio Engineers vol 49 pp 8ndash30 1961

[12] L Jiang H Zhang and Z Cai ldquoA novel bayes model hiddennaive bayesrdquo IEEE Transactions on Knowledge and Data Engi-neering vol 21 no 10 pp 1361ndash1371 2009

[13] A A Balamurugan R Rajaram S Pramala S Rajalakshmi CJeyendran and J Dinesh Surya Prakash ldquoNB+ an improvedNaive Bayesian algorithmrdquo Knowledge-Based Systems vol 24no 5 pp 563ndash569 2011

[14] L Jiang Z Cai D Wang and H Zhang ldquoImproving treeaugmented naive Bayes for class probability estimationrdquo Know-ledge-Based Systems vol 26 pp 239ndash245 2012

[15] J Cheng and R Greiner ldquoComparing Bayesian network clas-sifiersrdquo in Proceedings of the 15th Conference on Uncertainty inArtificial Intelligence K B Laskey and H Prade Eds pp 101ndash108 Morgan Kaufmann San Francisco Calif USA 1999

[16] M G Madden ldquoOn the classification performance of TAN andgeneral Bayesian networksrdquo Knowledge-Based Systems vol 22no 7 pp 489ndash495 2009

[17] J Pearl Probabilistic Reasoning in Intelligent Systems MorganKaufmann San Francisco Calif USA 1988

[18] D M Chickering ldquoLearning Bayesian networks is NP-Com-pleterdquo inLearning FromData Artificial Intelligence and StatisticsV D Fisher and H Lenz Eds pp 121ndash130 Springer New YorkNY USA 1996

[19] C Borgelt ldquoA conditional independence algorithm for learningundirected graphical modelsrdquo Journal of Computer and SystemSciences vol 76 no 1 pp 21ndash33 2010

[20] A M MartacuteJnez-RodracuteJguez J H May and L G VargasldquoAn optimization-based approach for the design of Bayesiannetworksrdquo Mathematical and Computer Modelling vol 48 no7-8 pp 1265ndash1278 2008

[21] L M de Campos ldquoA scoring function for learning Bayesiannetworks based on mutual information and conditional inde-pendence testsrdquo Journal ofMachine Learning Research vol 7 pp2149ndash2187 2006

[22] X W Chen G Anantha and X T Lin ldquoImproving bayesiannetwork structure learning with mutual information-basednode ordering in the K2 algorithmrdquo IEEE Transactions onKnowledge and Data Engineering vol 20 no 5 pp 628ndash6402008

[23] W J Conover Practical Nonparametric Statistics John Wiley ampSons New York NY USA 1999

[24] S Kullback Information Theory and Statistics Dover 1968[25] G Chartrand and P Zhang Introduction to Graph Theory

McGraw-Hill New York NY USA 2005[26] R E Neapolitan Learning Bayesian Networks Prentice Hall

Englewood Cliffs NJ USA 2004[27] DM Chickering andCMeek ldquoOn the incompatibility of faith-

fulness and monotone DAG faithfulnessrdquo Artificial Intelligencevol 170 no 8-9 pp 653ndash666 2006

[28] A Frank and A Asuncion UCI Machine Learning RepositoryUniversity of California School of Information and ComputerScience Irvine Calif USA 2010 httparchiveicsucieduml

[29] K P Murphy BaysNetToolbox For Matlab 2007 httpscodegooglecompbnt

[30] J Demsar ldquoStatistical comparisons of classifiers over multipledata setsrdquo Journal of Machine Learning Research vol 7 pp 1ndash302006

Submit your manuscripts athttpwwwhindawicom

Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

MathematicsJournal of

Mathematical Problems in Engineering

Hindawi Publishing Corporationhttpwwwhindawicom

Differential EquationsInternational Journal of

Volume 2014

Applied MathematicsJournal of

Probability and StatisticsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Journal of

Mathematical PhysicsAdvances in

Complex AnalysisJournal of

OptimizationJournal of

CombinatoricsHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of

Operations ResearchAdvances in

Journal of

Function Spaces

Abstract and Applied AnalysisHindawi Publishing Corporationhttpwwwhindawicom Volume 2014

International Journal of Mathematics and Mathematical Sciences

The Scientific World JournalHindawi Publishing Corporation httpwwwhindawicom Volume 2014

Algebra

Discrete Dynamics in Nature and Society

Decision SciencesAdvances in

Discrete MathematicsJournal of

Volume 2014 Hindawi Publishing Corporationhttpwwwhindawicom Volume 2014

Stochastic AnalysisInternational Journal of

(a) NB

(b) TAN

(c) BAN

Figure 1 Illustration of NB TAN and BAN structures

function in feasible fields The function extremum corre-sponds to the best classifier Finally BC-OM improves theefficiency of classification and delete irrelevant or redundantattributes by using the d-separation rule of Bayesian networkTheoretical and experimental results show the proposedalgorithm is not only effective in improving the accuracy butalso has a high learning speed and simple solving procedure

The remainder of this paper is organized as followsSection 2 reviews the existing Bayesian network classifiersWe describe our algorithm and its theoretical proofs inSection 3 Section 4 details the experimental procedures andresults of the proposed algorithm Finally in Section 5 weconclude and outline our future work

2 Background

In this section we discuss previous work that is relevant tothis paper and describe some of the notations used firstlyWeuse boldface capital letters such asVEX for sets of variablesGeneral variables are denoted by italic capital letters or indexitalic capital letters 119883 119877 119881

specific values taken by thesevariables are denoted 119909

119903minus

119907minus

Specially we use the sameitalic letters119883119884119881

for graph nodes which corresponds withthe random variables

Classification is a basic task in data analysis and patternrecognition that requires the construction of a classifier thatis a function that assigns a class label to instances describedby a set of attributes The induction of classifiers from datasets of preclassified instances is a central problem in machinelearning Let V = 119881

1198812

119881119899

represent the variable setwhich corresponds with the training data set 119863 We assumethat 119881

1198813

119881119899

is the setof attribute variables Bayesian networks are often used forclassification problems in which themain task is to constructthe classifier structure 119866 from a given set of training datawith class labels and then compute the posterior probability119875119866

(1199071

| 1198812

1198813

119881119899

) where 1199071

is the value that 1198811

takesThus it only needs to predict the class with the highest valueof probability 119875

119866

(1199071

| 1198812

1198813

119881119899

) that is

1199071

lowast

= arg max119907

1isindom(119881

119875119866

(1199071

| 1198812

1198813

119881119899

According to Bayes theorem maximizing 119875119866

(1199071

| 1198812

1198813

119881119899

) is equivalent to maximizing119875(1199071

) sdot119875119866

(1198812

1198813

119881119899

| 1199071

)The difference between the existing Bayesian classifiers is thecomputing mode of 119875

119866

(sdot)

Figure 1 schematically illustrates the structures of theBayesian classifiers considered in this paper In naive Bayeseach attribute node has the class node as its parent butdoes not have any parent from attribute nodes Computing119875119866

(1198812

1198813

119881119899

| 1199071

) is equal toprod119899119894=2

119875119866

(119881119894

| 1199071

) Because thevalues of 119875

119866

(1199071

) and 119875119866

(119881119894

| 1199071

) can be easily estimated fromtraining examples naive Bayes is easy to construct Howeverits prerequisite of condition independence assumption anddata completeness limit its real application TAN takes thenaive Bayes and adds edges to it in which the class nodedirectly points to all attribute nodes and an attribute node canhave only one parent from another attribute node Comput-ing119875119866

(1198812

1198813

119881119899

| 1199071

) is equivalent toprod119899119894=2

119875119866

(119881119894

| 119881119895

1199071

)where119881

119895

isin 1198811

1198812

119881(119894minus1)

It is an efficient extend of naiveBayes BAN is a specific case of general Bayesian networkclassifier in which the class node also directly points to allattribute nodes but there is no limitation on the arcs amongattribute nodes (except that they do not form any directedcycle) It is clear that TAN and BAN are useful to modelcorrelations among attribute nodes that cannot be capturedby naive Bayes They embody a good tradeoff betweenthe equality of the approximation of correlations amongattributes and the computational complexity in the learningstage In addition existing algorithms using the same idea toconstruct the structure of Bayesian classifier which first learnthe dependent relationships among attribute variables usingBayesian network structure learning algorithm then add theclass variable as the root node of the network It is equivalentto learning the best Bayesian network among those in which1198811

is a root Thus even if we could improve the performanceof a naive Bayes classifier in this way the computational effortrequired may not be worthwhile

Based on above analysis this paper presents an optimiza-tion model to learn the structure of Bayesian classifier whichinspired by constraint-based Bayesian network structurelearning method It is the first time that a problem of struc-tural learning for a Bayesian classifier is transformed intoits related mathematical programming problem by definingobjective function and feasible region And we also propose anewmethod tomeasure the dependent relationships betweenattributes The theoretical basis of this method is establishedbyTheorem 1 [24]

Theorem 1 Given a data set 119863 and a variable set V = 1198811

1198812

119881119899

if the hypothesis that 119881119894

and 119881119895

are conditionally

= 2sum119886119887119888

119873119886119887119888

119894119895119896

log[119873119886119887119888119894119895119896

119873119888

119896

(119873119886119888

119894119896

119873119887119888

119895119896

(119897) with 119897 = (119903119894

minus1)(119903119895

minus1)119903119896

119903119895

and 119903119896

119881119895

and 119881119896

respectively 119873119886119887119888119894119895119896

119894

= 119886 119881119895

= 119887 and 119881119896

= 119888 119873119886119888119894119896

119894

= 119886 and119881119896

= 119888 and119873119888119896

119896

= 119888

1198812

119881119899

119894

| 120579119894

= 119875(119881119894

| pa(119881119894

)) 119894 = 1 2 119899

in119866 where pa(119881

119894

) = 119881119895

| 119881119895

rarr 119881119894

1198813

119881119899

119894119895

119909119894119895

= 1 if 119881

119894

997888rarr 119881119895

isin 119864

0 if 119881119894

997888rarr 119881119895

notin 119864(2)

Let 119877 = 119883 + 1198832

119894

to node 119881119895

1198812

119881119899

119894

and119881119895

as 119888119894119895120572

= min119896 = 119894119895

1198802

119894119895|119896

minus1205942

119894119895|119896120572

where1198802119894119895|119896

and 119881119895

given 119881119896

1205942119894119895|119896120572

119894

1)(119903119895

minus 1)119903119896

degrees of freedom

Obviously 119888119894119895120572

119894119895120572

119894

and 119881119895

119894119895120572

119894119895

119888119894119895

= 119888119894119895120572

if 119894 = 119895

0 otherwise(3)

and 119881119895

119896

isin V such that 1198802119894119895|119896

lt 1205942

119894119895|119896120572

119894

and 119881119895

isin V such that 119888119894119895

lt 0 Further 119881119894

and 119881119895

119894119895|119896

lt 1205942

119894119895|119896120572

119894119895

F (119883 120572) =

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909119894119895

and119881119895

119894119895

119894

and 119881119895

1198812

119881119899

1198811

119894

to 1198811

119894

to 119881119894

119894

such that 119881119894

by a directed path

(1) sum119899119894=2

1199091198941

1198812

119881119899

) (1198811

and 119881119895

byDefinition 2

119894119895

119894

119881119894

and1198811

rarr 119881119895

or 119881119895

rarr 119881119894

119881119895

isin adj(119881119894

Algorithm 1 BC-OM

(2) sum119899119894=1

119903119894119894

= 0(3) sum119899119895=2

1199031119895

max F (119883 120572) =

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909119894119895

st119899

119894=2

1199091198941

119899

119894=1

119903119894119894

119899

119895=2

1199031119895

119909119894119895

isin 0 1

and 119881119897

1198811

1198901

1198812

119881119897

where 119890119894

denotes 119881119894

rarr 119881119894+1

or 119881119894

larr 119881119894+1

for 119894 = 1 2 (119897 minus 1)

119894

rarr 119881119895

rarr 119881119896

larr 119881119895

rarr 119881119896

is in Z or (2) 120588

rarr 119881119895

larr 119881119896

119895

119894119895|119896

119899

sdot 1198621

119899minus2

Theorem 6 Let X = (119909119894119895

)119899times119899

| 119909119894119895

1198812

119881119899

where1198811

119881119899

119909119894119895

= 1 119894 = 1 119895 = 2 119899

0 otherwise(5)

119888119894119895

| 119909119894119895

= 1 and 1198622

= 119888119894119895

| 119909119894119895

119894119895

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909(1)

119894119895

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909(2)

119894119895

997904rArr

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

(119909(1)

119894119895

minus 119909(2)

119894119895

) = 0 (7)

Let 119862(1)1

= 119888119894119895

| 119909(1)

119894119895

= 1 119862(1)2

= 119888119894119895

| 119909(1)

119894119895

119909(2)119894119895

1198950

le 119899 such that 119909(1)119894

0119895

minus 119909(2)

119894

0119895

= 1 namely119909(1)

119894

0119895

= 1 119909(2)

119894

0119895

= 0 Then by (7)

119888119894

0119895

119894 = 119894

119895 = 119895

119888119894119895

(119909(1)

119894119895

minus 119909(2)

119894119895

) = 0 (8)

119909(1)

119894119895

minus 119909(2)

119894119895

1 119909(1)

119894119895

= 1 119909(2)

119894119895

0 119909(1)

119894119895

= 119909(2)

119894119895

minus1 119909(1)

119894119895

= 0 119909(2)

119894119895

1119895

119888119894

119896119895

119896

isin 119862(1)

such that119909(1)

119894

119903119895

119903

minus119909(2)

119894

119903119895

119903

= minus1 119903 = 1 119896 and 119888119894

0119895

+sum119896

119903=1

119888119894

119903119895

119903

(119909(1)

119894

119903119895

119903

minus119909(2)

119894

119903119895

119903

0 namely

119888119894

0119895

119896

119903=1

119888119894

119903119895

119903

where 119909(1)119894

0119895

= 1 119909(1)

119894

119903119895

119903

Theorem 8 Let 119866 = (VE) V = (1198811

1198812

119881119899

119889119894119899

119907lowast

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

= 1198891198942

119907119899

= 119889119894119899

) (11)

| 1199072

119907119899

119899minus119896+1

119881119899minus119896+2

119881119899

1198812

119881119899minus119896

d-separates 1198811

and 119881119899minus119896+1

119881119899

Thus1198811

119881119899

given1198812

119881119899minus119896

119907lowast

= arg max119907

1isindom(119881

119901 (1199071

1199072

119907119899

119901 (1199072

119907119899

= arg max119907

1isindom(119881

119901 (1199071

1199072

119907119899

= arg max119907

1isindom(119881

119901 (1199071

119907119899minus119896+1

119907119899

| 1199072

119907119899minus119896

sdot 119901 (1199072

119907119899minus119896

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

119907119899minus119896

sdot 119901 (119907119899minus119896+1

119907119899

| 1199072

119907119899minus119896

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

119907119899minus119896

We get the results

5 Conclusions

0 5 10 15 20 25 30 35 40 45 500

NB error

0 5 10 15 20 25 30 35 40 45 500

TAN error

0 5 10 15 20 25 30 35 40 45 50

BAN error

Acknowledgments

References

Volume 2014

Journal of

Function Spaces

Algebra

= 2sum119886119887119888

119873119886119887119888

119894119895119896

log[119873119886119887119888119894119895119896

119873119888

119896

(119873119886119888

119894119896

119873119887119888

119895119896

(119897) with 119897 = (119903119894

minus1)(119903119895

minus1)119903119896

119903119895

and 119903119896

119881119895

and 119881119896

respectively 119873119886119887119888119894119895119896

119894

= 119886 119881119895

= 119887 and 119881119896

= 119888 119873119886119888119894119896

119894

= 119886 and119881119896

= 119888 and119873119888119896

119896

= 119888

1198812

119881119899

119894

| 120579119894

= 119875(119881119894

| pa(119881119894

)) 119894 = 1 2 119899

in119866 where pa(119881

119894

) = 119881119895

| 119881119895

rarr 119881119894

1198813

119881119899

119894119895

119909119894119895

= 1 if 119881

119894

997888rarr 119881119895

isin 119864

0 if 119881119894

997888rarr 119881119895

notin 119864(2)

Let 119877 = 119883 + 1198832

119894

to node 119881119895

1198812

119881119899

119894

and119881119895

as 119888119894119895120572

= min119896 = 119894119895

1198802

119894119895|119896

minus1205942

119894119895|119896120572

where1198802119894119895|119896

and 119881119895

given 119881119896

1205942119894119895|119896120572

119894

1)(119903119895

minus 1)119903119896

degrees of freedom

Obviously 119888119894119895120572

119894119895120572

119894

and 119881119895

119894119895120572

119894119895

119888119894119895

= 119888119894119895120572

if 119894 = 119895

0 otherwise(3)

and 119881119895

119896

isin V such that 1198802119894119895|119896

lt 1205942

119894119895|119896120572

119894

and 119881119895

isin V such that 119888119894119895

lt 0 Further 119881119894

and 119881119895

119894119895|119896

lt 1205942

119894119895|119896120572

119894119895

F (119883 120572) =

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909119894119895

and119881119895

119894119895

119894

and 119881119895

1198812

119881119899

1198811

119894

to 1198811

119894

to 119881119894

119894

such that 119881119894

by a directed path

(1) sum119899119894=2

1199091198941

1198812

119881119899

) (1198811

and 119881119895

byDefinition 2

119894119895

119894

119881119894

and1198811

rarr 119881119895

or 119881119895

rarr 119881119894

119881119895

isin adj(119881119894

Algorithm 1 BC-OM

(2) sum119899119894=1

119903119894119894

= 0(3) sum119899119895=2

1199031119895

max F (119883 120572) =

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909119894119895

st119899

119894=2

1199091198941

119899

119894=1

119903119894119894

119899

119895=2

1199031119895

119909119894119895

isin 0 1

and 119881119897

1198811

1198901

1198812

119881119897

where 119890119894

denotes 119881119894

rarr 119881119894+1

or 119881119894

larr 119881119894+1

for 119894 = 1 2 (119897 minus 1)

119894

rarr 119881119895

rarr 119881119896

larr 119881119895

rarr 119881119896

is in Z or (2) 120588

rarr 119881119895

larr 119881119896

119895

119894119895|119896

119899

sdot 1198621

119899minus2

Theorem 6 Let X = (119909119894119895

)119899times119899

| 119909119894119895

1198812

119881119899

where1198811

119881119899

119909119894119895

= 1 119894 = 1 119895 = 2 119899

0 otherwise(5)

119888119894119895

| 119909119894119895

= 1 and 1198622

= 119888119894119895

| 119909119894119895

119894119895

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909(1)

119894119895

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909(2)

119894119895

997904rArr

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

(119909(1)

119894119895

minus 119909(2)

119894119895

) = 0 (7)

Let 119862(1)1

= 119888119894119895

| 119909(1)

119894119895

= 1 119862(1)2

= 119888119894119895

| 119909(1)

119894119895

119909(2)119894119895

1198950

le 119899 such that 119909(1)119894

0119895

minus 119909(2)

119894

0119895

= 1 namely119909(1)

119894

0119895

= 1 119909(2)

119894

0119895

= 0 Then by (7)

119888119894

0119895

119894 = 119894

119895 = 119895

119888119894119895

(119909(1)

119894119895

minus 119909(2)

119894119895

) = 0 (8)

119909(1)

119894119895

minus 119909(2)

119894119895

1 119909(1)

119894119895

= 1 119909(2)

119894119895

0 119909(1)

119894119895

= 119909(2)

119894119895

minus1 119909(1)

119894119895

= 0 119909(2)

119894119895

1119895

119888119894

119896119895

119896

isin 119862(1)

such that119909(1)

119894

119903119895

119903

minus119909(2)

119894

119903119895

119903

= minus1 119903 = 1 119896 and 119888119894

0119895

+sum119896

119903=1

119888119894

119903119895

119903

(119909(1)

119894

119903119895

119903

minus119909(2)

119894

119903119895

119903

0 namely

119888119894

0119895

119896

119903=1

119888119894

119903119895

119903

where 119909(1)119894

0119895

= 1 119909(1)

119894

119903119895

119903

Theorem 8 Let 119866 = (VE) V = (1198811

1198812

119881119899

119889119894119899

119907lowast

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

= 1198891198942

119907119899

= 119889119894119899

) (11)

| 1199072

119907119899

119899minus119896+1

119881119899minus119896+2

119881119899

1198812

119881119899minus119896

d-separates 1198811

and 119881119899minus119896+1

119881119899

Thus1198811

119881119899

given1198812

119881119899minus119896

119907lowast

= arg max119907

1isindom(119881

119901 (1199071

1199072

119907119899

119901 (1199072

119907119899

= arg max119907

1isindom(119881

119901 (1199071

1199072

119907119899

= arg max119907

1isindom(119881

119901 (1199071

119907119899minus119896+1

119907119899

| 1199072

119907119899minus119896

sdot 119901 (1199072

119907119899minus119896

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

119907119899minus119896

sdot 119901 (119907119899minus119896+1

119907119899

| 1199072

119907119899minus119896

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

119907119899minus119896

We get the results

5 Conclusions

0 5 10 15 20 25 30 35 40 45 500

NB error

0 5 10 15 20 25 30 35 40 45 500

TAN error

0 5 10 15 20 25 30 35 40 45 50

BAN error

Acknowledgments

References

Volume 2014

Journal of

Function Spaces

Algebra

1198812

119881119899

) (1198811

and 119881119895

byDefinition 2

119894119895

119894

119881119894

and1198811

rarr 119881119895

or 119881119895

rarr 119881119894

119881119895

isin adj(119881119894

Algorithm 1 BC-OM

(2) sum119899119894=1

119903119894119894

= 0(3) sum119899119895=2

1199031119895

max F (119883 120572) =

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909119894119895

st119899

119894=2

1199091198941

119899

119894=1

119903119894119894

119899

119895=2

1199031119895

119909119894119895

isin 0 1

and 119881119897

1198811

1198901

1198812

119881119897

where 119890119894

denotes 119881119894

rarr 119881119894+1

or 119881119894

larr 119881119894+1

for 119894 = 1 2 (119897 minus 1)

119894

rarr 119881119895

rarr 119881119896

larr 119881119895

rarr 119881119896

is in Z or (2) 120588

rarr 119881119895

larr 119881119896

119895

119894119895|119896

119899

sdot 1198621

119899minus2

Theorem 6 Let X = (119909119894119895

)119899times119899

| 119909119894119895

1198812

119881119899

where1198811

119881119899

119909119894119895

= 1 119894 = 1 119895 = 2 119899

0 otherwise(5)

119888119894119895

| 119909119894119895

= 1 and 1198622

= 119888119894119895

| 119909119894119895

119894119895

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909(1)

119894119895

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909(2)

119894119895

997904rArr

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

(119909(1)

119894119895

minus 119909(2)

119894119895

) = 0 (7)

Let 119862(1)1

= 119888119894119895

| 119909(1)

119894119895

= 1 119862(1)2

= 119888119894119895

| 119909(1)

119894119895

119909(2)119894119895

1198950

le 119899 such that 119909(1)119894

0119895

minus 119909(2)

119894

0119895

= 1 namely119909(1)

119894

0119895

= 1 119909(2)

119894

0119895

= 0 Then by (7)

119888119894

0119895

119894 = 119894

119895 = 119895

119888119894119895

(119909(1)

119894119895

minus 119909(2)

119894119895

) = 0 (8)

119909(1)

119894119895

minus 119909(2)

119894119895

1 119909(1)

119894119895

= 1 119909(2)

119894119895

0 119909(1)

119894119895

= 119909(2)

119894119895

minus1 119909(1)

119894119895

= 0 119909(2)

119894119895

1119895

119888119894

119896119895

119896

isin 119862(1)

such that119909(1)

119894

119903119895

119903

minus119909(2)

119894

119903119895

119903

= minus1 119903 = 1 119896 and 119888119894

0119895

+sum119896

119903=1

119888119894

119903119895

119903

(119909(1)

119894

119903119895

119903

minus119909(2)

119894

119903119895

119903

0 namely

119888119894

0119895

119896

119903=1

119888119894

119903119895

119903

where 119909(1)119894

0119895

= 1 119909(1)

119894

119903119895

119903

Theorem 8 Let 119866 = (VE) V = (1198811

1198812

119881119899

119889119894119899

119907lowast

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

= 1198891198942

119907119899

= 119889119894119899

) (11)

| 1199072

119907119899

119899minus119896+1

119881119899minus119896+2

119881119899

1198812

119881119899minus119896

d-separates 1198811

and 119881119899minus119896+1

119881119899

Thus1198811

119881119899

given1198812

119881119899minus119896

119907lowast

= arg max119907

1isindom(119881

119901 (1199071

1199072

119907119899

119901 (1199072

119907119899

= arg max119907

1isindom(119881

119901 (1199071

1199072

119907119899

= arg max119907

1isindom(119881

119901 (1199071

119907119899minus119896+1

119907119899

| 1199072

119907119899minus119896

sdot 119901 (1199072

119907119899minus119896

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

119907119899minus119896

sdot 119901 (119907119899minus119896+1

119907119899

| 1199072

119907119899minus119896

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

119907119899minus119896

We get the results

5 Conclusions

0 5 10 15 20 25 30 35 40 45 500

NB error

0 5 10 15 20 25 30 35 40 45 500

TAN error

0 5 10 15 20 25 30 35 40 45 50

BAN error

Acknowledgments

References

Volume 2014

Journal of

Function Spaces

Algebra

Theorem 6 Let X = (119909119894119895

)119899times119899

| 119909119894119895

1198812

119881119899

where1198811

119881119899

119909119894119895

= 1 119894 = 1 119895 = 2 119899

0 otherwise(5)

119888119894119895

| 119909119894119895

= 1 and 1198622

= 119888119894119895

| 119909119894119895

119894119895

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909(1)

119894119895

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

119909(2)

119894119895

997904rArr

119899

119894=1

119899

119895=1119895 = 119894

119888119894119895

(119909(1)

119894119895

minus 119909(2)

119894119895

) = 0 (7)

Let 119862(1)1

= 119888119894119895

| 119909(1)

119894119895

= 1 119862(1)2

= 119888119894119895

| 119909(1)

119894119895

119909(2)119894119895

1198950

le 119899 such that 119909(1)119894

0119895

minus 119909(2)

119894

0119895

= 1 namely119909(1)

119894

0119895

= 1 119909(2)

119894

0119895

= 0 Then by (7)

119888119894

0119895

119894 = 119894

119895 = 119895

119888119894119895

(119909(1)

119894119895

minus 119909(2)

119894119895

) = 0 (8)

119909(1)

119894119895

minus 119909(2)

119894119895

1 119909(1)

119894119895

= 1 119909(2)

119894119895

0 119909(1)

119894119895

= 119909(2)

119894119895

minus1 119909(1)

119894119895

= 0 119909(2)

119894119895

1119895

119888119894

119896119895

119896

isin 119862(1)

such that119909(1)

119894

119903119895

119903

minus119909(2)

119894

119903119895

119903

= minus1 119903 = 1 119896 and 119888119894

0119895

+sum119896

119903=1

119888119894

119903119895

119903

(119909(1)

119894

119903119895

119903

minus119909(2)

119894

119903119895

119903

0 namely

119888119894

0119895

119896

119903=1

119888119894

119903119895

119903

where 119909(1)119894

0119895

= 1 119909(1)

119894

119903119895

119903

Theorem 8 Let 119866 = (VE) V = (1198811

1198812

119881119899

119889119894119899

119907lowast

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

= 1198891198942

119907119899

= 119889119894119899

) (11)

| 1199072

119907119899

119899minus119896+1

119881119899minus119896+2

119881119899

1198812

119881119899minus119896

d-separates 1198811

and 119881119899minus119896+1

119881119899

Thus1198811

119881119899

given1198812

119881119899minus119896

119907lowast

= arg max119907

1isindom(119881

119901 (1199071

1199072

119907119899

119901 (1199072

119907119899

= arg max119907

1isindom(119881

119901 (1199071

1199072

119907119899

= arg max119907

1isindom(119881

119901 (1199071

119907119899minus119896+1

119907119899

| 1199072

119907119899minus119896

sdot 119901 (1199072

119907119899minus119896

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

119907119899minus119896

sdot 119901 (119907119899minus119896+1

119907119899

| 1199072

119907119899minus119896

= arg max119907

1isindom(119881

119901 (1199071

| 1199072

119907119899minus119896

We get the results

5 Conclusions

0 5 10 15 20 25 30 35 40 45 500

NB error

0 5 10 15 20 25 30 35 40 45 500

TAN error

0 5 10 15 20 25 30 35 40 45 50

BAN error

Acknowledgments

References

Volume 2014

Journal of

Function Spaces

Algebra

5 Conclusions

0 5 10 15 20 25 30 35 40 45 500

NB error

0 5 10 15 20 25 30 35 40 45 500

TAN error

0 5 10 15 20 25 30 35 40 45 50

BAN error

Acknowledgments

References

Volume 2014

Journal of

Function Spaces

Algebra

5 Conclusions

0 5 10 15 20 25 30 35 40 45 500

NB error

0 5 10 15 20 25 30 35 40 45 500

TAN error

0 5 10 15 20 25 30 35 40 45 50

BAN error

Acknowledgments

References

Volume 2014

Journal of

Function Spaces

Algebra

0 5 10 15 20 25 30 35 40 45 500

TAN error

0 5 10 15 20 25 30 35 40 45 50

BAN error

Acknowledgments

References

Volume 2014

Journal of

Function Spaces

Algebra

Volume 2014

Journal of

Function Spaces

Algebra

Volume 2014

Journal of

Function Spaces

Algebra

Research Article A Bayesian Classifier Learning...

Documents

Transcript of Research Article A Bayesian Classifier Learning...

Learning Maximum Likelihood Bounded Semi-Naïve Bayesian Network Classifier Kaizhu Huang, Irwin King, Michael R. Lyu Multimedia Information Processing Laboratory.

Building Classifiers using Bayesian Networks...• Comparison of Naive Bayes, Unsupervised Bayesian Networks, TAN, C4.5 (Decision Tree) and Selective naive Bayesian classifier on 22

Learning Classifier Systems. Learning Classifier Systems (LCS) The system has three layers: – A performance system that interacts with environment, –

Quarterly MSUG Meeting Linda Vance · Presented a use case that covered supervised machine learning using the Bayesian Network Classifier Node Effectively Fighting Fraud –Constantine

Learning: Naïve Bayes Classifier

Collaborative Filtering with the Simple Bayesian Classifier

Bayesian Learning - ml.cs.tsinghua.edu.cnml.cs.tsinghua.edu.cn/~changliu/static/Bayesian Learning - Basics and... · Bayesian Learning •Timelines 1980 2019 Today 1980 1984 1988

05 Machine Learning - Supervised Fisher Classifier

Bayesian Learning - Brigham Young Universitymartinez/classes/478/slides/Bayes.pdfBayes Optimal Classifier lBest question is what is the most probable classification cfor a given instance,

A Bayesian Approach to Joint Feature Selection and Classifier Design

Bayesian Learning and Learning Bayesian Networks.

Bayesian Reinforcement Learning - mlg.eng.cam.ac.ukmlg.eng.cam.ac.uk/rowan/files/BayesianReinforcementLearning.pdf · Introduction Bayesian Reinforcement Learning Bayesian Reinforcement

The Naïve Bayes Classifier - svivek · •The naïve Bayes Classifier •Learning the naïve Bayes Classifier •Practical concerns 2. Today’s lecture •The naïve Bayes Classifier

GECCO 2014 - Learning Classifier System Tutorial

Bayesian Learning and Learning Bayesian Networks

Bayesian Learning - KTH · mathematician Thomas Bayes (1702 - 1761). • Basis for learning schemes such as the naive Bayes classifier, Bayesian belief networks, and the EM algorithm.

Naïve Bayesian classifier and genetic risk score for genetic risk … · 2015-02-19 · Naïve Bayesian Classifier and Genetic Risk Score for Genetic Risk Prediction of a Categorical

Full Bayesian inference (Learning)...Learning paradigms Learning as inference Bayesian learning, full Bayesian inference, Bayesian model averaging Model identification, maximum likelihood

Bayesian Learning, Regression-based learning. Overview Bayesian Learning Full MAP learning Maximum Likelihood Learning Learning Bayesian Networks.

SESSION ID: SPO1-T08 Machine Learning: The What and Why … · NERD ALERT. Let’s define the helpful data science terms. Machine Learning Algorithms. Ground Truth . Classifier. Bayesian.