Decision Tree c45

download Decision Tree c45

of 30

description

Decision Tree c45

Transcript of Decision Tree c45

  • Machine Learning in Real World:C4.5

  • OutlineHandling Numeric AttributesFinding Best Split(s)Dealing with Missing ValuesPruningPre-pruning, Post-pruning, Error EstimatesFrom Trees to Rules

  • Industrial-strength algorithmsFor an algorithm to be useful in a wide range of real-world applications it must:Permit numeric attributesAllow missing valuesBe robust in the presence of noiseBe able to approximate arbitrary concept descriptions (at least in principle) Basic schemes need to be extended to fulfill these requirementswitten & eibe

  • C4.5 HistoryID3, CHAID 1960sC4.5 innovations (Quinlan):permit numeric attributesdeal sensibly with missing valuespruning to deal with for noisy dataC4.5 - one of best-known and most widely-used learning algorithmsLast research version: C4.8, implemented in Weka as J4.8 (Java)Commercial successor: C5.0 (available from Rulequest)

  • Numeric attributesStandard method: binary splitsE.g. temp < 45Unlike nominal attributes, every attribute has many possible split pointsSolution is straightforward extension: Evaluate info gain (or other measure) for every possible split point of attributeChoose best split pointInfo gain for best split point is info gain for attributeComputationally more demandingwitten & eibe

  • Weather data nominal valueswitten & eibe

    OutlookTemperatureHumidityWindyPlaySunnyHotHighFalseNoSunnyHot High TrueNoOvercast Hot HighFalseYesRainyMildNormalFalseYes

    If outlook = sunny and humidity = high then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity = normal then play = yesIf none of the above then play = yes

  • Weather data - numeric

    OutlookTemperatureHumidityWindyPlaySunny8585FalseNoSunny80 90 TrueNoOvercast 83 86FalseYesRainy7580FalseYes

    If outlook = sunny and humidity > 83 then play = noIf outlook = rainy and windy = true then play = noIf outlook = overcast then play = yesIf humidity < 85 then play = yesIf none of the above then play = yes

  • ExampleSplit on temperature attribute:

    E.g.temperature 71.5: yes/4, no/2 temperature 71.5: yes/5, no/3 Info([4,2],[5,3]) = 6/14 info([4,2]) + 8/14 info([5,3]) = 0.939 bitsPlace split points halfway between valuesCan evaluate all split points in one pass!witten & eibe

    64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

  • Avoid repeated sorting!Sort instances by the values of the numeric attributeTime complexity for sorting: O (n log n) Q. Does this have to be repeated at each node of the tree?A: No! Sort order for children can be derived from sort order for parentTime complexity of derivation: O (n)Drawback: need to create and store an array of sorted indices for each numeric attribute witten & eibe

  • More speeding up Entropy only needs to be evaluated between points of different classes (Fayyad & Irani, 1992)Potential optimal breakpoints

    Breakpoints between values of the same class cannotbe optimalvalueclass

    64 65 68 69 70 71 72 72 75 75 80 81 83 85Yes No Yes Yes Yes No No Yes Yes Yes No Yes Yes No

  • Binary vs. multi-way splitsSplitting (multi-way) on a nominal attribute exhausts all information in that attributeNominal attribute is tested (at most) once on any path in the treeNot so for binary splits on numeric attributes!Numeric attribute may be tested several times along a path in the treeDisadvantage: tree is hard to readRemedy:pre-discretize numeric attributes, oruse multi-way splits instead of binary oneswitten & eibe

  • Missing as a separate valueMissing value denoted ? in C4.XSimple idea: treat missing as a separate valueQ: When this is not appropriate?A: When values are missing due to different reasons Example 1: gene expression could be missing when it is very high or very low Example 2: field IsPregnant=missing for a male patient should be treated differently (no) than for a female patient of age 25 (unknown)

  • Missing values - advancedSplit instances with missing values into piecesA piece going down a branch receives a weight proportional to the popularity of the branchweights sum to 1Info gain works with fractional instancesuse sums of weights instead of countsDuring classification, split the instance into pieces in the same wayMerge probability distribution using weightswitten & eibe

  • PruningGoal: Prevent overfitting to noise in the dataTwo strategies for pruning the decision tree:Postpruning - take a fully-grown decision tree and discard unreliable partsPrepruning - stop growing a branch when information becomes unreliablePostpruning preferred in practiceprepruning can stop too early

  • PrepruningBased on statistical significance testStop growing the tree when there is no statistically significant association between any attribute and the class at a particular nodeMost popular test: chi-squared testID3 used chi-squared test in addition to information gainOnly statistically significant attributes were allowed to be selected by information gain procedurewitten & eibe

  • Early stoppingPre-pruning may stop the growth process prematurely: early stoppingClassic example: XOR/Parity-problemNo individual attribute exhibits any significant association to the classStructure is only visible in fully expanded treePre-pruning wont expand the root nodeBut: XOR-type problems rare in practiceAnd: pre-pruning faster than post-pruningwitten & eibe

    abclass1000201131014110

  • Post-pruningFirst, build full treeThen, prune itFully-grown tree shows all attribute interactions Problem: some subtrees might be due to chance effectsTwo pruning operations: Subtree replacementSubtree raisingPossible strategies:error estimationsignificance testingMDL principlewitten & eibe

  • Subtree replacement, 1Bottom-upConsider replacing a tree only after considering all its subtreesEx: labor negotiations witten & eibe

  • Subtree replacement, 2What subtree can we replace?

  • Subtreereplacement, 3Bottom-upConsider replacing a tree only after considering all its subtreeswitten & eibe

  • *Subtree raisingDelete nodeRedistribute instancesSlower than subtree replacement(Worthwhile?)witten & eibeX

  • Estimating error ratesPrune only if it reduces the estimated errorError on the training data is NOT a useful estimator Q: Why it would result in very little pruning?Use hold-out set for pruning (reduced-error pruning)C4.5s methodDerive confidence interval from training dataUse a heuristic limit, derived from this, for pruningStandard Bernoulli-process-based methodShaky statistical assumptions (based on training data)witten & eibe

  • *Mean and varianceMean and variance for a Bernoulli trial: p, p (1p)Expected success rate f=S/NMean and variance for f : p, p (1p)/NFor large enough N, f follows a Normal distributionc% confidence interval [z X z] for random variable with 0 mean is given by:

    With a symmetric distribution:

    witten & eibe

  • *Confidence limitsConfidence limits for the normal distribution with 0 mean and a variance of 1:

    Thus:

    To use this we have to reduce our random variable f to have 0 mean and unit variance1 0 1 1.65witten & eibe

    Pr[X z]z0.1%3.090.5%2.581%2.335%1.6510%1.2820%0.8425%0.6940%0.25

  • *Transforming fTransformed value for f : (i.e. subtract the mean and divide by the standard deviation) Resulting equation:

    Solving for p:witten & eibe

  • C4.5s methodError estimate for subtree is weighted sum of error estimates for all its leavesError estimate for a node (upper bound):

    If c = 25% then z = 0.69 (from normal distribution)f is the error on the training dataN is the number of instances covered by the leafwitten & eibe

  • Examplef=0.33 e=0.47f=0.5 e=0.72f=0.33 e=0.47f = 5/14 e = 0.46 e < 0.51 so prune!Combined using ratios 6:2:6 gives 0.51witten & eibe

  • *Complexity of tree inductionAssumem attributesn training instancestree depth O (log n)Building a treeO (m n log n)Subtree replacementO (n)Subtree raisingO (n (log n)2)Every instance may have to be redistributed at every node between its leaf and the rootCost for redistribution (on average): O (log n)Total cost: O (m n log n) + O (n (log n)2)witten & eibe

  • From trees to rules how?

    How can we produce a set of rules from a decision tree?

  • From trees to rules simpleSimple way: one rule for each leafC4.5rules: greedily prune conditions from each rule if this reduces its estimated errorCan produce duplicate rulesCheck for this at the endThenlook at each class in turnconsider the rules for that classfind a good subset (guided by MDL)Then rank the subsets to avoid conflictsFinally, remove rules (greedily) if this decreases error on the training datawitten & eibe

  • C4.5rules: choices and optionsC4.5rules slow for large and noisy datasetsCommercial version C5.0rules uses a different techniqueMuch faster and a bit more accurateC4.5 has two parametersConfidence value (default 25%): lower values incur heavier pruningMinimum number of instances in the two most popular branches (default 2)witten & eibe

  • *Classification rulesCommon procedure: separate-and-conquer Differences:Search method (e.g. greedy, beam search, ...)Test selection criteria (e.g. accuracy, ...)Pruning method (e.g. MDL, hold-out set, ...)Stopping criterion (e.g. minimum accuracy)Post-processing stepAlso: Decision list vs. one rule set for each classwitten & eibe

  • *Test selection criteriaBasic covering algorithm:keep adding conditions to a rule to improve its accuracyAdd the condition that improves accuracy the mostMeasure 1: p/tttotal instances covered by rule pnumber of these that are positiveProduce rules that dont cover negative instances, as quickly as possibleMay produce rules with very small coverage special cases or noise?Measure 2: Information gain p (log(p/t) log(P/T))P and T the positive and total numbers before the new condition was addedInformation gain emphasizes positive rather than negative instancesThese interact with the pruning mechanism usedwitten & eibe

  • *Missing values,numeric attributesCommon treatment of missing values: for any test, they failAlgorithm must eitheruse other tests to separate out positive instancesleave them uncovered until later in the processIn some cases its better to treat missing as a separate valueNumeric attributes are treated just like they are in decision treeswitten & eibe

  • *Pruning rulesTwo main strategies:Incremental pruningGlobal pruningOther difference: pruning criterionError on hold-out set (reduced-error pruning)Statistical significanceMDL principleAlso: post-pruning vs. pre-pruningwitten & eibe

  • Summary Decision Treessplits binary, multi-waysplit criteria entropy, gini, missing value treatmentpruning rule extraction from treesBoth C4.5 and CART are robust toolsNo method is always superior experiment!witten & eibe