Decision Tree c45

Machine Learning in Real World:C4.5

OutlineHandling Numeric AttributesFinding Best Split(s)Dealing with Missing ValuesPruningPre-pruning, Post-pruning, Error EstimatesFrom Trees to Rules

Industrial-strength algorithmsFor an algorithm to be useful in a wide range of real-world applications it must:Permit numeric attributesAllow missing valuesBe robust in the presence of noiseBe able to approximate arbitrary concept descriptions (at least in principle) Basic schemes need to be extended to fulfill these requirementswitten & eibe

C4.5 HistoryID3, CHAID 1960sC4.5 innovations (Quinlan):permit numeric attributesdeal sensibly with missing valuespruning to deal with for noisy dataC4.5 - one of best-known and most widely-used learning algorithmsLast research version: C4.8, implemented in Weka as J4.8 (Java)Commercial successor: C5.0 (available from Rulequest)

Numeric attributesStandard method: binary splitsE.g. temp < 45Unlike nominal attributes, every attribute has many possible split pointsSolution is straightforward extension: Evaluate info gain (or other measure) for every possible split point of attributeChoose best split pointInfo gain for best split point is info gain for attributeComputationally more demandingwitten & eibe

Weather data nominal valueswitten & eibe

OutlookTemperatureHumidityWindyPlaySunnyHotHighFalseNoSunnyHot High TrueNoOvercast Hot HighFalseYesRainyMildNormalFalseYes

If outlook = sunny and humidity = high then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity = normal then play = yesIf none of the above then play = yes

Weather data - numeric

OutlookTemperatureHumidityWindyPlaySunny8585FalseNoSunny80 90 TrueNoOvercast 83 86FalseYesRainy7580FalseYes

If outlook = sunny and humidity > 83 then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity < 85 then play = yesIf none of the above then play = yes

ExampleSplit on temperature attribute:

E.g.temperature 71.5: yes/4, no/2 temperature 71.5: yes/5, no/3 Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bitsPlace split points halfway between valuesCan evaluate all split points in one pass!witten & eibe

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

Avoid repeated sorting!Sort instances by the values of the numeric attributeTime complexity for sorting: O (n log n) Q. Does this have to be repeated at each node of the tree?A: No! Sort order for children can be derived from sort order for parentTime complexity of derivation: O (n)Drawback: need to create and store an array of sorted indices for each numeric attribute witten & eibe

More speeding up Entropy only needs to be evaluated between points of different classes (Fayyad & Irani, 1992)Potential optimal breakpoints

Breakpoints between values of the same class cannotbe optimalvalueclass

64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

Binary vs. multi-way splitsSplitting (multi-way) on a nominal attribute exhausts all information in that attributeNominal attribute is tested (at most) once on any path in the treeNot so for binary splits on numeric attributes!Numeric attribute may be tested several times along a path in the treeDisadvantage: tree is hard to readRemedy:pre-discretize numeric attributes, oruse multi-way splits instead of binary oneswitten & eibe

Missing as a separate valueMissing value denoted ? in C4.XSimple idea: treat missing as a separate valueQ: When this is not appropriate?A: When values are missing due to different reasons Example 1: gene expression could be missing when it is very high or very low Example 2: field IsPregnant=missing for a male patient should be treated differently (no) than for a female patient of age 25 (unknown)

Missing values - advancedSplit instances with missing values into piecesA piece going down a branch receives a weight proportional to the popularity of the branchweights sum to 1Info gain works with fractional instancesuse sums of weights instead of countsDuring classification, split the instance into pieces in the same wayMerge probability distribution using weightswitten & eibe

PruningGoal: Prevent overfitting to noise in the dataTwo strategies for pruning the decision tree:Postpruning - take a fully-grown decision tree and discard unreliable partsPrepruning - stop growing a branch when information becomes unreliablePostpruning preferred in practiceprepruning can stop too early

PrepruningBased on statistical significance testStop growing the tree when there is no statistically significant association between any attribute and the class at a particular nodeMost popular test: chi-squared testID3 used chi-squared test in addition to information gainOnly statistically significant attributes were allowed to be selected by information gain procedurewitten & eibe

Early stoppingPre-pruning may stop the growth process prematurely: early stoppingClassic example: XOR/Parity-problemNo individual attribute exhibits any significant association to the classStructure is only visible in fully expanded treePre-pruning wont expand the root nodeBut: XOR-type problems rare in practiceAnd: pre-pruning faster than post-pruningwitten & eibe

abclass1000201131014110

Post-pruningFirst, build full treeThen, prune itFully-grown tree shows all attribute interactions Problem: some subtrees might be due to chance effectsTwo pruning operations: Subtree replacementSubtree raisingPossible strategies:error estimationsignificance testingMDL principlewitten & eibe

Subtree replacement, 1Bottom-upConsider replacing a tree only after considering all its subtreesEx: labor negotiations witten & eibe

Subtree replacement, 2What subtree can we replace?

Subtreereplacement, 3Bottom-upConsider replacing a tree only after considering all its subtreeswitten & eibe

*Subtree raisingDelete nodeRedistribute instancesSlower than subtree replacement(Worthwhile?)witten & eibeX

Estimating error ratesPrune only if it reduces the estimated errorError on the training data is NOT a useful estimator Q: Why it would result in very little pruning?Use hold-out set for pruning (reduced-error pruning)C4.5s methodDerive confidence interval from training dataUse a heuristic limit, derived from this, for pruningStandard Bernoulli-process-based methodShaky statistical assumptions (based on training data)witten & eibe

*Mean and varianceMean and variance for a Bernoulli trial: p, p (1p)Expected success rate f=S/NMean and variance for f : p, p (1p)/NFor large enough N, f follows a Normal distributionc% confidence interval [z X z] for random variable with 0 mean is given by:

With a symmetric distribution:

witten & eibe

*Confidence limitsConfidence limits for the normal distribution with 0 mean and a variance of 1:

Thus:

To use this we have to reduce our random variable f to have 0 mean and unit variance1 0 1 1.65witten & eibe

Pr[X z]z0.1%3.090.5%2.581%2.335%1.6510%1.2820%0.8425%0.6940%0.25

*Transforming fTransformed value for f : (i.e. subtract the mean and divide by the standard deviation) Resulting equation:

Solving for p:witten & eibe

C4.5s methodError estimate for subtree is weighted sum of error estimates for all its leavesError estimate for a node (upper bound):

If c = 25% then z = 0.69 (from normal distribution)f is the error on the training dataN is the number of instances covered by the leafwitten & eibe

Examplef=0.33 e=0.47f=0.5 e=0.72f=0.33 e=0.47f = 5/14 e = 0.46 e < 0.51 so prune!Combined using ratios 6:2:6 gives 0.51witten & eibe

*Complexity of tree inductionAssumem attributesn training instancestree depth O (log n)Building a treeO (m n log n)Subtree replacementO (n)Subtree raisingO (n (log n)2)Every instance may have to be redistributed at every node between its leaf and the rootCost for redistribution (on average): O (log n)Total cost: O (m n log n) + O (n (log n)2)witten & eibe

From trees to rules how?

How can we produce a set of rules from a decision tree?

From trees to rules simpleSimple way: one rule for each leafC4.5rules: greedily prune conditions from each rule if this reduces its estimated errorCan produce duplicate rulesCheck for this at the endThenlook at each class in turnconsider the rules for that classfind a good subset (guided by MDL)Then rank the subsets to avoid conflictsFinally, remove rules (greedily) if this decreases error on the training datawitten & eibe

C4.5rules: choices and optionsC4.5rules slow for large and noisy datasetsCommercial version C5.0rules uses a different techniqueMuch faster and a bit more accurateC4.5 has two parametersConfidence value (default 25%): lower values incur heavier pruningMinimum number of instances in the two most popular branches (default 2)witten & eibe

*Classification rulesCommon procedure: separate-and-conquer Differences:Search method (e.g. greedy, beam search, ...)Test selection criteria (e.g. accuracy, ...)Pruning method (e.g. MDL, hold-out set, ...)Stopping criterion (e.g. minimum accuracy)Post-processing stepAlso: Decision list vs. one rule set for each classwitten & eibe

*Test selection criteriaBasic covering algorithm:keep adding conditions to a rule to improve its accuracyAdd the condition that improves accuracy the mostMeasure 1: p/tttotal instances covered by rule pnumber of these that are positiveProduce rules that dont cover negative instances, as quickly as possibleMay produce rules with very small coverage special cases or noise?Measure 2: Information gain p (log(p/t) log(P/T))P and T the positive and total numbers before the new condition was addedInformation gain emphasizes positive rather than negative instancesThese interact with the pruning mechanism usedwitten & eibe

*Missing values,numeric attributesCommon treatment of missing values: for any test, they failAlgorithm must eitheruse other tests to separate out positive instancesleave them uncovered until later in the processIn some cases its better to treat missing as a separate valueNumeric attributes are treated just like they are in decision treeswitten & eibe

*Pruning rulesTwo main strategies:Incremental pruningGlobal pruningOther difference: pruning criterionError on hold-out set (reduced-error pruning)Statistical significanceMDL principleAlso: post-pruning vs. pre-pruningwitten & eibe

Summary Decision Treessplits binary, multi-waysplit criteria entropy, gini, missing value treatmentpruning rule extraction from treesBoth C4.5 and CART are robust toolsNo method is always superior experiment!witten & eibe

Decision Tree c45

Documents

Transcript of Decision Tree c45