DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

download DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

of 195

Transcript of DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    1/195

    Data Mining and Soft Computing

    Francisco HerreraResearch Group on Soft Computing and

    Information Intelli ent S stems SCI2S

    Dept. of Computer Science and A.I.University of Granada, Spain

    Email: [email protected]

    htt ://sci2s.u r.es

    http://decsai.ugr.es/~herrera

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    2/195

    Data Mining and Soft Computing

    Summary

    1. Introduction to Data Mining and Knowledge Discovery2. Data Preparation

    4. Data Mining - From the Top 10 Algorithms to the New Challenges

    5. Introduction to Soft Computing. Focusing our attention in Fuzzy Logic

    6.Soft Computing Techniques in Data Mining: Fuzzy Data Mining and

    Knowledge Extraction based on Evolutionary Learning

    . ene c uzzy ys ems: a e o e r an ew ren s

    8. Some Advanced Topics I: Classif ication with Imbalanced Data Sets9. Some Advanced Topics II: Subgroup Discovery

    10.Some advanced Topics III: Data Complexity

    11.Final talk: How must I Do my Experimental Study? Design of

    Ex eriments in Data Minin /Com utational Intelli ence. Usin Non-parametric Tests. Some Cases of Study.

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    3/195

    Slidesusedforpreparingthis

    talk:Top 10 Algorithms in Data Mining ResearchTop 10 Algorithms in Data Mining Research

    10 Challen in Problems in Data Minin Research10 Challen in Problems in Data Minin Researchprepared for ICDM 2005prepared for ICDM 2005

    CS490D:

    IntroductiontoDataMining

    Association Analysis: Basic Conceptsand Algorithms Prof.ChrisClifton

    Lecture Notes for Chapter 6

    Introduction to Data Minin

    DATA MININGIntroductory and Advanced Topics

    3by Tan, Steinbach, Kumar

    argare . un am

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    4/195

    Data Mining - From the Top 10 Algorithms to the

    ew a enges

    u ne

    Top 10 Algorithms in Data Mining Research Introduction

    Classification

    Statistical Learning Bagging and Boosting

    Clustering

    Association Analysis Link Mining Text Mining

    Top 10 Algorithms

    10 Challen in Problems in Data Minin Research4 Concluding Remarks

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    5/195

    Data Mining - From the Top 10 Algorithms to the

    ew a enges

    u ne

    Top 10 Algorithms in Data Mining Research Introduction

    Classification

    Statistical Learning Bagging and Boosting

    Clustering

    Association Analysis Link Mining Text Mining

    Top 10 Algorithms

    10 Challen in Problems in Data Minin Research5 Concluding Remarks

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    6/195

    Challen esinDataMininChallen esinDataMinin

    DiscussionPanelsatICDM2005and2006DiscussionPanelsatICDM2005and2006

    6

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    7/195

    Challen esinDataMininChallen esinDataMinin

    DiscussionPanelsatICDM2005and2006DiscussionPanelsatICDM2005and2006Top 10 Algorithms in Data Mining ResearchTop 10 Algorithms in Data Mining Research

    prepared for ICDM 2006prepared for ICDM 2006

    10 Challenging Problems in Data Mining Research10 Challenging Problems in Data Mining Researchprepare orprepare or

    7

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    8/195

    To 10Al oritms in DataMinin Research

    preparedforICDM2006preparedforICDM2006http://www.cs.uvm.edu/~icdm/algorithms/index.shtml

    XindongWu

    Universit of Vermonthttp://www.cs.uvm.edu/~xwu/home.html

    VipinKumar

    UniversityofMinessotahttp://wwwusers.cs.umn.edu/~kumar/

    8

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    9/195

    ICDM06PanelonTop

    10

    Algorithms

    in

    Data

    Mining

    9

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    10/195

    ICDM06PanelonTop

    10

    Algorithms

    in

    Data

    Mining

    10

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    11/195

    ICDM06PanelonTop

    10

    Algorithms

    in

    Data

    Mining

    11

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    12/195

    ICDM06PanelonTop

    10

    Algorithms

    in

    Data

    Mining

    12

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    13/195

    Data Mining - From the Top 10 Algorithms to the

    ew a enges

    u ne

    Top 10 Algorithms in Data Mining Research Introduction

    Classification

    Statistical Learning Bagging and Boosting

    Clustering

    Association Analysis Link Mining Text Mining

    Top 10 Algorithms

    10 Challen in Problems in Data Minin Research13 Concluding Remarks

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    14/195

    ICDM06PanelonTop

    10

    Algorithms

    in

    Data

    Mining

    Classification

    14

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    15/195

    g

    Partitioningbased:Dividesearchspaceinto

    . Tupleplacedintoclassbasedontheregion

    withinwhichitfalls.

    DTInduction

    Internalnodesassociatedwithattributeand

    Algorithms:ID3, C4.5,CART

    15

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    16/195

    ec s on ree

    D={t1,,tn}whereti=

    Databaseschemacontains {A1,A2,,Ah}ClassesC= C . C

    DecisionorClassificationTreeisatreeassociatedwith

    Eachinternalnodeislabeledwithattribute,Ai

    Each

    arc

    is

    labeled

    with

    predicate

    which

    can

    be

    appliedtoattributeatparent

    Eachleafnodeislabeledwithaclass,Cj

    16

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    17/195

    ra n ng a ase

    age income student credit_rating buys_computer

    40 low yes excellent no

    an

    x m l ow yes exce en yes

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    18/195

    Out ut:ADecisionTreefor

    buys_computera e?

    overcas >40..

    student? credit rating?yes

    no yes fairexcellent

    no noyes yes

    18

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    19/195

    gor m or ec s on ree n u c on

    Basicalgorithm(agreedyalgorithm)

    Treeisconstructedinato downrecursivedivideandcon uer

    manner Atstart,allthetrainingexamplesareattheroot

    Attributesarecategorical(ifcontinuousvalued,theyarediscretized inadvance)

    Testattributesareselectedonthebasisofaheuristicorstatisticalmeasure e. . information ain

    Conditionsforstoppingpartitioning Allsam lesfora ivennodebelon tothesameclass

    Therearenoremainingattributesforfurtherpartitioningmajorityvotingisemployedforclassifyingtheleaf

    19

    Therearenosamplesleft

    DT I d i

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    20/195

    DTInduction

    20

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    21/195

    p s rea

    GenderM

    F

    Height

    21

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    22/195

    ompar ng s

    Balanced

    22

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    23/195

    ssues

    ChoosingSplittingAttributes

    OrderingofSplittingAttributes

    TreeStructure StoppingCriteria

    TrainingData run ng

    23

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    24/195

    DecisionTreeInductionisoftenbasedon

    InformationTheory

    So

    24

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    25/195

    n orma on

    25

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    26/195

    n orma on n ropy

    Givenpro a i ititesp1,p2,..,psw osesumis1,Entropyisdefinedas:

    Entropymeasurestheamountofrandomnessorsur riseoruncertaint .

    Goalinclassification

    entropy=0

    26

    Att ib t S l ti M

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    27/195

    AttributeSelectionMeasure:

    InformationGain(ID3/C4.5) Select the attribute with the highest information

    S contains si tuples of class Ci for i = {1, , m}

    n ormat on measures n o requ re to c ass y

    any arbitrary tuple ss im

    i=

    entropy of attribute A with values {a ,a ,,a }

    ssi 1=

    )s,...,s(Is

    s...sE(A) mjjv mjj

    11

    =

    ++=

    information gained by branching on attribute A

    27

    E(A))s,...,s,I(sGain(A) m = 21

    Attribute Selection by

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    28/195

    AttributeSelectionby

    Information

    Gain

    Computationg ClassP:buys_computer=yesg ClassN:buys_computer=no )0,4(14

    4)3,2(

    14

    5)( += IIageE

    , , .

    g Computetheentropyforage:

    694.0)2,3(145 =+ I

    5 =

    14samples,with2yeses and340 3 2 0.971

    246.0)(),()( == ageEnpIageGainage income student credit_rat ing buys_computer 40 low yes fair yes

    151.0)( =studentGainow yes exce en no

    3140 low yes excellent yes

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    29/195

    er r u e e ec on easures

    Giniindex(CART,IBMIntelligentMiner)

    Allattributesareassumedcontinuousvalued Assumethereexistseveralpossiblesplitvaluesforeach

    attribute

    , ,

    possiblesplitvalues

    an emo e orca egor ca a r u es

    29

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    30/195

    GiniIndex

    (IBM

    IntelligentMiner) IfadatasetTcontainsexamplesfromnclasses,giniindex,

    gini(T)isdefinedas = n 2

    wherepjistherelativefrequencyofclassjinT.=j 1

    a a ase ssp n o wosu se s 1an 2w s zes 1andN2respectively,theginiindexofthesplitdatacontains

    ,

    21 NN21

    NNsplit

    T eattr uteprov est esma estg n splitT sc osentosp tthenode(needtoenumerateallpossiblesplittingpointsfor

    30

    .

    Extractin Classification Rules from

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    31/195

    Extractin ClassificationRulesfrom

    Trees RepresenttheknowledgeintheformofIFTHENrules

    Eachattributevaluepairalongapathformsaconjunction

    Theleafnodeholdstheclassprediction

    Example

    IFage=

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    32/195

    vo ver ng n ass ca on

    Overfitting: Aninducedtreemayoverfit thetrainingdata Toomanybranches,somemayreflectanomaliesduetonoiseor

    outliers Pooraccuracyforunseensamples

    Twoapproac estoavoi over itting Prepruning:Halttreeconstructionearlydonotsplitanodeifthis

    Difficulttochooseanappropriatethreshold

    Post runin :Removebranchesfromafull rowntree etasequenceofprogressivelyprunedtrees

    Useasetofdatadifferentfromthetrainingdatatodecidewhich

    32

    A roaches to Determine the Final

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    33/195

    A roachestoDeterminetheFinal

    TreeSize Separatetraining(2/3)andtesting(1/3)sets

    Usecrossvalidation,e.g.,10foldcrossvalidation

    butapplyastatisticaltest(e.g.,chisquare)toestimatewhetherexpandingorpruninganodemayimprovethe

    entiredistribution

    Useminimumdescriptionlength(MDL)principle

    haltinggrowthofthetreewhentheencodingisminimized

    33

    Enhancements to basic decision

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    34/195

    Enhancementstobasicdecision

    treeinduction Allowforcontinuousvaluedattributes

    Dynam ca y

    e ne

    new

    screte

    va ue

    attr utes

    t at

    partitionthecontinuousattributevalueintoadiscreteset

    o interva s

    Handlemissin attributevalues Assignthemostcommonvalueoftheattribute

    ss gnpro a tytoeac o t eposs eva ues

    Attributeconstruction

    Createnewattributesbasedonexistingonesthatare

    CS490D 34

    Thisreducesfragmentation,repetition,andreplication

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    35/195

    ec s on reevs. u es

    Treehasimpliedorder Ruleshavenoordering

    performed.

    .

    Treecreatedbasedon

    lookin atallclasses.

    Onlyneedtolookat

    rules.

    35

    Scalable Decision Tree Induction Methods

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    36/195

    ScalableDecisionTreeInductionMethods

    in

    Data

    Mining

    Studies Classificationaclassicalproblemextensivelystudiedby

    Scalability:Classifyingdatasetswithmillionsofexamplesand

    un re so attr utesw t reasona espee

    Why

    decision

    tree

    induction

    in

    data

    mining? relativelyfasterlearningspeed(thanotherclassificationmethods)

    convertibletosimpleandeasytounderstandclassificationrules

    canuseSQLqueriesforaccessingdatabases

    comparableclassificationaccuracywithothermethods

    36

    Scalable Decision Tree Induction Methods

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    37/195

    ScalableDecisionTreeInductionMethods

    in

    Data

    Mining

    Studies SLIQ(EDBT96 Mehtaetal.)

    buildsanindexforeachattributeandonlyclasslistandthecurrent

    attributelistresideinmemory SPRINT(VLDB96 J.Shaferetal.)

    constructsanattributelistdatastructure

    PUBLIC(VLDB98 Rastogi&Shim) integratestreesplittingandtreepruning:stopgrowingthetreeearlier

    RainForest (VLDB98 Gehrke,Ramakrishnan&Ganti) separatest esca a tyaspects romt ecr ter at at eterm net e

    qualityofthetree

    37

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    38/195

    ns ance ase e o s

    Instancebasedlearning: Storetrainingexamplesanddelaytheprocessing(lazyevaluation)

    untilanewinstancemustbeclassified Typicalapproaches

    knearestneighborapproach

    InstancesrepresentedaspointsinaEuclideanspace.

    oca ywe g e regress on

    Constructslocalapproximation

    Usessymbolicrepresentationsandknowledgebasedinference

    38

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    39/195

    ass ca on s ng s ance

    Placeitemsinclasstowhich theyareclosest.

    Mustdeterminedistancebetweenaniteman ac ass.

    Classesre resentedbCentroid:Centralvalue.

    Medoid: Representativepoint.

    Algorithm:KNN

    39

    The kNearest Nei hbor

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    40/195

    Thek NearestNei hbor

    Algorithm AllinstancescorrespondtopointsinthenDspace.

    ThenearestneighboraredefinedintermsofEuclidean

    distance. Thetargetfunctioncouldbediscreteorrealvalued.

    Fordiscretevalued,thekNNreturnsthemostcommonvalueamongthektrainingexamplesnearesttoxq.

    typicalsetoftrainingexamples.

    _+_ _

    +

    _

    .._ xq

    +

    _

    + .. .

    40

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    41/195

    eares e g or :

    Trainingsetincludesclasses.

    ExamineKitemsnearitemtobeclassified. ew emp ace nc assw emos

    number

    of

    close

    items. O(q)foreachtupletobeclassified. (Hereq

    s es zeo e ra n ngse .

    41

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    42/195

    42

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    43/195

    43

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    44/195

    ayes an ass ca on: y

    Probabilisticlearning: Calculateexplicitprobabilitiesforhypothesis,amongthemostpracticalapproachestocertainypeso earn ngpro ems

    Incremental:Eachtrainingexamplecanincrementally.

    Priorknowledgecanbecombinedwithobserveddata.

    Probabilistic rediction: Predictmulti leh otheses,weightedbytheirprobabilities

    Standard:EvenwhenBayesianmethodsarecomputationallyn rac a e, eycanprov eas an ar o op ma ec s on

    makingagainstwhichothermethodscanbemeasured

    44

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    45/195

    ayes an eorem: as cs

    LetXbeadatasamplewhoseclasslabelisunknown

    Let H be a h othesis that X belon s to class C

    Forclassificationproblems,determineP(H|X):theprobabilitythattheh othesisholds iventheobserveddatasam leX

    P(H):priorprobabilityofhypothesisH(i.e.theinitial

    probabilitybeforeweobserveanydata,reflectsthebackgroundknowledge)

    P(X):probabilitythatsampledataisobserved

    P(X|H):probabilityofobservingthesampleX,giventhatthe

    hypothesisholds

    45

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    46/195

    ayes eorem

    GiventrainingdataX,posterioriprobabilityofahypothesisH,P(H|X)followstheBayestheorem

    )()()|( HPHXPXHP =

    Informally,thiscanbewrittenas

    osterior=likelihoodx rior/evidence MAP(maximumposteriori)hypothesis

    =

    .

    HhHhAP

    probabilities,significantcomputationalcost

    46

    a ve ayes ass er

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    47/195

    a ve ayes ass er

    Asimplifiedassumption:attributesareconditionallyindependent: n

    == k CixkPCiXP 1 )|()|(

    Theproductofoccurrenceofsay2elementsx1andx2,giventhecurrentclassisC,istheproductoftheprobabilitiesof

    ,P([y1,y2],C)=P(y1,C)*P(y2,C)

    Nodependencerelationbetweenattributes

    Greatlyreducesthecomputationcost,onlycounttheclassdistribution.

    Oncet epro a i ityP X Ci is nown,assignXtot ec asswithmaximumP(X|Ci)*P(Ci)

    47

    ra n ng a ase

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    48/195

    ra n ng a ase

    age income student credit_rating buys_computer

    40 low yes fair yes

    >40 low es excellent no

    _no

    3140 low yes excellent yes

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    49/195

    a ve ayes an ass er: xamp e

    ComputeP(X/Ci)foreachclassP(age=

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    50/195

    Comments Advantages: Easytoimplement

    Goodresultsobtainedinmostofthecases Disadvantages

    Assumption:classconditionalindependence,thereforelossofaccuracy

    rac ca y, epen enc esex s amongvar a es

    E.g., hospitals:patients:Profile:age,familyhistoryetc

    , ., ,

    Dependencies

    among

    these

    cannot

    be

    modeled

    by

    Nave

    Bayesian

    Classifier

    Howtodealwiththesedependencies? BayesianBeliefNetworks

    50

    ICDM06Panelon

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    51/195

    Top10AlgorithmsinDataMining

    Statistical Learning

    51

    Data Mining - From the Top 10 Algorithms to the

    ew a enges

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    52/195

    ew a enges

    u ne

    Top 10 Algorithms in Data Mining Research Introduction

    Classification

    Statistical Learning Bagging and Boosting

    Clustering

    Association Analysis

    Link Mining Text Mining

    Top 10 Algorithms

    10 Challen in Problems in Data Minin Research

    52 Concluding Remarks

    uppor vec or mac ne

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    53/195

    uppor vec ormac ne

    Classificationisessentiallyfindingthebest

    oun ary

    e ween

    c asses.

    boundarypointscalledsupportvectorsand

    ui c assi ierontopo t em.

    machine.

    53

    xamp e o genera

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    54/195

    xamp eo genera

    Thedotswithshadowaroundthemaresupportvectors.pointstorepresenttheboundary.

    The

    curve

    is

    the .

    54

    SVM S V M hi

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    55/195

    SVM SupportVectorMachines

    Support Vectors

    p ma yper p ane separa e case

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    56/195

    p ma yperp ane,separa ecase.

    Inthiscase,class1and.

    Therepresentingpoints 00=+Tx

    themarginbetween X

    maximized. X X Crosse pointsare

    supportvectors. XC

    56

    on

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    57/195

    on .

    LinearSupportVectorMachine

    Givenasetofpoints withlabelnix }1,1{yi TheSVMfindsahyperplanedefinedbythepair(w,b)

    where wis the normal to the lane andbis thedistancefromtheorigin)

    s.t. Nibwxy ii ,...,11)( =++x eature vector, - as, y- c ass a e , w - marg n

    57

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    58/195

    na ys s on .

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    59/195

    na ys s on .

    3.Sotheproblemoffindingoptimalhyperplaneturns

    Maximizing Con

    ,,0

    =

    Subjecttoconstrain:

    T

    4.Itsthesameas:.,...,,0ii

    Minimizing

    subject

    to

    ||||.,...,1,1)( 0 Nixy ii =>+

    59

    enera

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    60/195

    clearl donothavea ood

    optimallinearclassifier.

    A nonlinear boundar as

    shownwilldofine.

    60

    onsepara ecase

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    61/195

    p

    Whenthedatasetis

    nonsepara eas

    shownintheri ht00=+Tx

    figure,wewillassign X*we g ttoeac

    su ortvectorwhichX

    X

    willbeshowninthe Xconstraint. C

    61

    on near

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    62/195

    ? 0x w b + >

    In non linear case we can see this as?

    ,i

    Kernel Can be thought of as doing dot productin some high dimensional space

    62

    enera on .

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    63/195

    Similartolinearcase,thesolutioncanbe

    writtenas:

    0'1

    0 )(),()()( +=+= =

    iii

    i

    i

    Txhxhyxhxf

    Butfunctionhisofveryhighdimension

    sometimes

    infinity,

    does

    it

    mean

    SVM

    is

    mprac ca

    63

    esu ng ur aces

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    64/195

    g

    64

    epro uc ng erne .

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    65/195

    Lookat thedualproblem,thesolution

    onlydependson . )(),( 'ii xhxh

    need

    to

    onl

    look

    at

    their

    kernel

    representation:K(X,X)= )(),( 'ii xhxh

    Which

    lies

    in

    a

    much

    smaller

    dimension

    pace an .

    65

    es r c onsan yp ca erne s.

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    66/195

    Kernelrepresentationdoesnotexistallthe

    me, ercer scon on ouran an

    Hilbert 1953 tellsustheconditionforthis

    kindofexistence.

    T ereareaseto erne sprovento e

    effective suchas ol nomialkernelsand

    radialbasiskernels.

    66

    xamp eo po ynom a erne .

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    67/195

    rdegreepolynomial:

    x,x = + .

    For a feature s ace with two in uts: x1 x2 and

    apolynomialkernelofdegree2.

    K(x,x)=(1+)2

    251423121 )(,)(,2)(,2)(,1)( xxhxxhxxhxxhxh =====

    and ,thenK(x,x)=.216 2)( xxxh =

    67

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    68/195

    penpro emso .

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    69/195

    HowdowechooseKernelfunctionfora.

    havedifferentresults,althoughgenerallytheresu tsare ettert anus ng yperp anes.

    classificationproblem.MinimumBayesianrisk.

    achievetherisk.

    69

    penpro emso

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    70/195

    Forverylargetrainingset,supportvectors

    m g eo arges ze. pee us ecomesa

    bottleneck.

    AoptimaldesignformulticlassSVMclassifier.

    70

    e a e n s

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    71/195

    http://svm.dcs.rhbnc.ac.uk/

    . .

    C. J. C.Burges.ATutorialonSupportVectorMachinesforPattern

    . , , .

    SVMlightSoftware(inC)http://ais.gmd.de/~thorsten/svm_light

    BOOK: An Intro uction to upport Vector Mac ines

    N. Cristianini and J. Shawe-Taylor

    am r ge n vers y ress,

    71

    Data Mining - From the Top 10 Algorithms to the

    ew a enges

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    72/195

    u ne

    Top 10 Algorithms in Data Mining Research Introduction

    Classification

    Statistical Learning

    Bagging and Boosting

    Clustering

    Association Analysis

    Link Mining Text Mining

    Top 10 Algorithms

    10 Challen in Problems in Data Minin Research

    72 Concluding Remarks

    ICDM06PanelonTop10AlgorithmsinDataMining

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    73/195

    p g g

    Bagging and Boosting

    om n ngc ass ers

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    74/195

    Examples:classificationtreesandneuralnetworks,, ,

    etc.

    Why? Betterclassificationperformancethanindividualclassifiers

    Moreresiliencetonoise

    Whynot?Time consumin

    Overfitting

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    75/195

    agg ngan oos ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    76/195

    Bagging=Manipulationwith

    data

    set

    Boosting=Manipulationwithmodel

    76

    agg ngan oos ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    77/195

    Generalidea Classification method (CM)

    ra n ng a a

    AlteredTrainingdata Classifier C1

    CM

    Classifier C2

    ..

    Aggregation. Classifier C*

    77

    agg ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    78/195

    Breiman,1996

    er ve rom ootstrap ron,

    Createclassifiersusingtrainingsetsthatare

    Averageresultsforeachcase

    agg ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    79/195

    GivenasetSofssamples

    Generate a bootstra sam le T from S. Cases in S ma not

    appear

    in

    T

    or

    may

    appear

    more

    than

    once.

    Re eatthissam lin rocedure ettin ase uenceofkindependenttrainingsets

    AcorrespondingsequenceofclassifiersC1,C2,,Ckisconstructedforeachofthesetrainingsets,byusingthesameclassificationalgorithm

    ToclassifyanunknownsampleX,leteachclassifierpredictor

    vote TheBaggedClassifierC*countsthevotesandassignsXtothe

    classwiththemostvotes

    79

    (Opitz, 1999)

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    80/195

    Bagging

    Example(Opitz,1999)

    ra n ng se

    Training set 3 3 6 2 7 5 6 2 2

    Training set 4 4 5 1 4 6 4 3 8

    oos ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    81/195

    Afamilyofmethods

    equent a pro uct ono c ass ers

    Eachclassifierisde endentonthe reviousone,and

    focusesonthepreviousoneserrors

    classifiersarechosenmoreoftenorweightedmore

    eavi y

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    82/195

    oos ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    83/195

    Theidea

    83

    a oos ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    84/195

    FreundandSchapire,1996

    Two

    approaches

    classifier(morerepresentativesofmisclassified

    casesarese ec e morecommonWeigherrorsofthemisclassifiedcaseshigher(all

    casesareincorporated,butweightsaredifferent)

    a oos ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    85/195

    Define kasthesumoftheprobabilitiesforthe

    Multiplyprobabilityofselectingmisclassifiedcases

    k=(1k)/k Renormalizeprobabilities(i.e.,rescalesothatit

    sumsto1)

    CombineclassifiersC1

    Ckusingweightedvoting

    (Opitz,1999)

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    86/195

    Boosting Example

    ra n ng se

    Training set 3 7 1 5 8 1 8 1 4

    Training set 4 1 1 6 1 1 3 1 5

    Data Mining - From the Top 10 Algorithms to the

    ew a enges

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    87/195

    u ne

    Top 10 Algorithms in Data Mining Research

    Introduction

    Classification

    Statistical Learning

    Bagging and Boosting

    Clustering

    Association Analysis

    Link Mining Text Mining Top 10 Algorithms

    10 Challen in Problems in Data Minin Research

    87 Concluding Remarks

    ICDM06PanelonTop10AlgorithmsinDataMining

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    88/195

    Clustering

    88

    us er ng ro em

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    89/195

    GivenadatabaseD={t1,t2,,tn}oftuplesand,

    todefineamappingf:D{1,..,k}whereeachtisass gne toonec uster j,

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    90/195

    us er ng xamp e

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    91/195

    91

    us er ng eve s

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    92/195

    Size Based

    92

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    93/195

    us er ng ssues

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    94/195

    Outlierhandling

    Dynamicdata

    Evaluatin results

    Numberofclusters

    Datatobeused

    ca a y

    94

    mpacto ut erson uster ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    95/195

    95

    ypeso us er ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    96/195

    Hierarchical Nestedsetofclusterscreated.

    ar ona nese o c us erscrea e .

    In r m n l Each element handled one at atime.

    mu aneous e emen s an e together.

    Overlapping/Nonoverlapping

    96

    us er ng pproac es

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    97/195

    Clustering

    Hierarchical Partitional Categorical Large DB

    Agglomerative Divisive Sampling Compression

    97

    us er arame ers

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    98/195

    98

    s ance e ween us ers

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    99/195

    CompleteLink:largestdistancebetweenpoints

    AverageLink:averagedistancebetweenpoints

    Centroid:distancebetweencentroids

    99

    erarc ca us er ng

    l d l l ll f

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    100/195

    Clustersarecreatedinlevelsactuallycreatingsetsof.

    Agglomerative n a yeac em n sownc us er

    Iterativelyclustersaremergedtogether

    BottomUp Divisive

    Initiallyallitemsinonecluster

    Largeclustersaresuccessivelydivided TopDown

    100

    erarc ca gor ms

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    101/195

    SingleLink

    MSTSingleLink

    Avera eLink

    101

    en rogram

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    102/195

    Dendrogram:atreedata

    hierarchical

    clustering

    Eachlevelshowsclustersfor

    . Leaf individualclusters

    Aclusteratleveliistheunion

    i+1.

    102

    eve so us er ng

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    103/195

    103

    ar ona us er ng

    N hi hi l

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    104/195

    Nonhierarchical

    Createsclustersinonestepasopposedto

    .

    Sinceonlyonesetofclustersisoutput,the

    usernormallyhastoinputthedesirednumber

    , .

    Usuallydealswithstaticsets.

    104

    ar ona gor ms

    MST

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    105/195

    MST

    SquaredError

    NearestNei hbor

    PAM

    BEA

    105

    eans

    .

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    106/195

    . Iterativel itemsaremovedamon setsof

    clustersuntilthedesiredsetisreached. Highdegreeofsimilarityamongelements

    inaclusterisobtained.

    GivenaclusterKi={ti1,ti2,,tim},thecluster

    meanismi=(1/m)(ti1++tim)

    106

    =R d l i 3 4

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    107/195

    Randomlyassignmeans:m1=3,m2=4

    1= , , 2= , , , , , , ,

    m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25},

    m1=3,m2=18

    K1={2,3,4,10},K2={12,20,30,11,25},m =4.75 m =19.6

    K1

    ={2,3,4,10,11,12},K2

    ={20,30,25},

    1 , 2 Stopastheclusterswiththesemeansare

    107

    esame.

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    108/195

    108

    us er ng arge a a ases

    Most clustering algorithms assume a large data

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    109/195

    Mostclusteringalgorithmsassumealargedata

    .

    Clusteringmaybeperformedfirstonasampleoft e ata aset enapp e tot eent re ata ase.

    Al orithms

    BIRCH

    CURE

    109

    DesiredFeaturesforLar e

    Databases One scan (or less) of DB

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    110/195

    Onescan(orless)ofDB

    n ne

    Sus endable sto able resumable

    Incremental

    Workwithlimitedmainmemory . .

    Processeachtupleonce

    110

    BalancedIterativeReducingandClustering

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    111/195

    g g

    Incremental,hierarchical,onescan

    Saveclusteringinformationinatree

    Eachentryinthetreecontainsinformation

    Newnodesinsertedinclosestentryintree

    111

    us er ng ea ure

    CTTriple: (N,LS,SS)

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    112/195

    N: Numberofpointsincluster

    LS: Sumofpointsinthecluster

    CFTree

    NodehasCFtripleforeachchild

    Leafnoderepresentscluster andhasCFvalueforeach

    subcluster

    in

    it.Subclusterhasmaximumdiameter

    112

    gor m

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    113/195

    113

    ComparisonofClusteringTechniques

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    114/195

    114

    Data Mining - From the Top 10 Algorithms to the

    ew a enges

    u ne

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    115/195

    u e Top 10 Algorithms in Data Mining Research

    Introduction

    Classification

    Statistical Learning

    Bagging and Boosting

    Clustering

    Association Analysis

    Link Mining Text Mining Top 10 Algorithms

    10 Challen in Problems in Data Minin Research

    115 Concluding Remarks

    ICDM06PanelonTop10AlgorithmsinDataMining

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    116/195

    ssociation nalysis

    ICDM06PanelonTop10AlgorithmsinDataMining

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    117/195

    Sequential Patterns

    Gra h Minin

    117

    ssoc a on u e ro em

    GivenasetofitemsI={I1,I2,,Im}andad b f i D

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    118/195

    databaseoftransactionsD= t t t

    where

    ti={Ii1,Ii2,

    ,

    Iik}

    and

    IijI,

    the

    associationrulesX Ywithaminimum

    suppor an con ence. LinkAnal sis

    NOTE:SupportofX Yissameassupporto .

    118

    ssoc a on u e e n ons

    Setofitems:I={I1,I2,,Im}

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    119/195

    f { 1, 2, , m}

    Transactions:D={t1,t2,,tn},tjI

    i1, i2,, ik

    Su ort o anitemset:Percenta eof

    transactionswhichcontainthatitemset.

    Large(Frequent)itemset:Itemsetwhose

    119

    ssoc a on u e e n ons

    Association Rule (AR): implication X Y

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    120/195

    AssociationRule(AR):implicationX Y

    whereX,YIandXY=;

    transactionsthatcontainXY

    Confidenceof

    AR

    ()

    X

    Y:Ratioof

    YtothenumberthatcontainX

    120

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    121/195

    ssoc a on u es xamp e

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    122/195

    I = { Beer, Bread, Jelly, Milk, PeanutButter}Support of {Bread,PeanutButter} is 60%

    122

    ssoc a on u es x con

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    123/195

    123

    ssoc a on u e n ng as

    GivenasetoftransactionsT,thegoalofassociation

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    124/195

    supportminsupthresholdconfidence minconfthreshold

    Bruteforceapproach:

    Computethesupportandconfidenceforeachrule

    Prunerulesthatfailtheminsupandminconfthresholds

    Computationallyprohibitive!

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    125/195

    ssoc a on u e ec n ques

    1. FindLargeItemsets.

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    126/195

    2. Generaterulesfromfrequentitemsets.

    126

    n ng ssoc a on u es

    Twostepapproach:

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    127/195

    1. FrequentItemsetGeneration

    Generateallitemsetswhosesu ortminsu

    .

    Generatehighconfidencerulesfromeachfrequent

    temset,w ereeac ru e sa narypart t on ngo a

    frequentitemset

    Frequentitemsetgenerationisstill

    Fre uentItemsetGenerationnull

    A B C D E

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    128/195

    AB AC AD AE BC BD BE CD CE DE

    ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE

    ABCD ABCE ABDE ACDE BCDE

    Given d items, there

    are 2d possible

    Fre uentItemsetGeneration

    Eachitemsetinthelatticeisacandidatefrequent

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    129/195

    q

    itemset

    database Transactions

    ems

    1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke4 Bread, Milk, Diaper, Beer

    Match each transaction a ainst ever candidate

    Complexity~O(NMw)=>ExpensivesinceM=2d !!!

    Com utationalCom lexit

    Totalnumberofitemsets=2d

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    130/195

    Totalnumberofpossibleassociationrules:

    11 1 =

    =

    = d

    k

    kd

    j jkR

    123 1 += +dd

    If d=6, R = 602 rules

    Fre uentItemsetGeneration

    Strategies Reducethenumberofcandidates(M)

    d

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    131/195

    d

    Use

    pruning

    techniques

    to

    reduce

    M

    Reducethenumberoftransactions(N) Reduce size of N as the size of itemset increases

    UsedbyDHPandverticalbasedminingalgorithms

    Reducethenumberofcomparisons(NM)

    Useefficientdatastructurestostorethecandidatesortransactions

    Noneedtomatcheverycandidateagainsteverytransaction

    Reducin NumberofCandidates

    A riori rinci le:

    Ifanitemsetisfrequent,thenallofitssubsetsmustalsobe

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    132/195

    thesupportmeasure:

    )()()(:, YsXsYXYX

    Supportofanitemsetneverexceedsthesupportofits

    Thisisknownastheantimonotonepropertyofsupport

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    133/195

    pr or

    LargeItemsetProperty:

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    134/195

    Anysubsetofalargeitemsetislarge. Contra ositive:

    Ifanitemsetisnotlarge,

    noneofitssupersetsarelarge.

    134

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    135/195

    pr or gor m

    a es:Lk=Setofkitemsetswhicharefrequent

    k f k h h ld b f

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    136/195

    Ck=Setofkitemsetswhichcouldbefrequent

    Method:Init.Letk=1

    Repeatuntilnonewfrequentitemsetsareidentified

    itemsets

    DB

    thosethatarefrequent

    Illustratin A rioriPrinci le

    Item Count

    L1a) No need to generate

    Bread 4Coke 2Milk 4

    C2

    B 3

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    137/195

    Beer 3Diaper 4

    Eggs 1

    {Bread,Milk}

    {Bread,Beer}

    b) Counting{Bread,Milk} 3

    {Bread,Beer} 2Bread Dia er 3

    Minimum Support = 3

    ,{Milk, Beer}{Milk,Diaper}Beer,Dia er

    {Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3

    c) Filter non-frequent

    L2Itemset Count

    Bread Milk 3

    a) No need to generatecandidates involving

    Bread Beer

    Itemset Count

    {Bread,Mi lk ,D iaper } 3

    L3

    {Bread,Diaper} 3{Milk,Diaper} 3{Beer,Diaper} 3Itemset

    C3b) Counting

    rea , , aper

    AprioriGenExample

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    138/195

    138

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    139/195

    pr or v sa v

    Advantages:

    Uses large itemset property

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    140/195

    Useslargeitemsetproperty.

    Easytoimplement.

    Disadvantages:

    resident.

    Requiresuptomdatabasescans.

    140

    amp ng

    Largedatabases Sam lethedatabaseanda l A rioritothe

    l

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    141/195

    sample.

    fromsample

    GeneralizationofAprioriGenappliedtoitemsetsofvaryingsizes.

    MinimalsetofitemsetswhicharenotinPL,butwhosesubsetsareallinPL.

    141

    ega ve or er xamp e

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    142/195

    PL PL BD-(PL)

    142

    amp ng gor m

    1. Ds=sampleofDatabaseD;

    = arge emse s n us ng sma s;

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    143/195

    . = arge emse s n sus ngsma s;

    3. C = PLBD PL4. CountCinDatabaseusings;

    5. ML=largeitemsetsinBD

    (PL);.

    7. else

    C

    =

    repeated

    application

    of

    BD

    ;

    8. CountCinDatabase;

    143

    amp ng xamp e

    FindARassumings=20% D = t t

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    144/195

    Smalls=10%

    = , , , , ,{Bread,PeanutButter},{Jelly,PeanutButter},

    , ,

    BD(PL)={{Beer},{Milk}}

    ML={{Beer},{Milk}}

    remainingitemsets

    144

    amp ng v sa v

    Advantages:

    Reduces number of database scans to one in the

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    145/195

    Reducesnumberofdatabasescanstooneinthe

    bestcaseandtwoinworst.Scalesbetter.

    D sa vantages:

    pass

    145

    ar on ng

    DividedatabaseintopartitionsD1,D2,,Dp

    Apply Apriori to each partition

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    146/195

    ApplyAprioritoeachpartition

    partition.

    146

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    147/195

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    148/195

    ar on ng v sa v

    Advantages:

    Adaptstoavailablemainmemory

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    149/195

    dapts to a a ab e a e o y

    Maximumnumberofdatabasescansistwo.

    Disadvantages: .

    149

    ara e z ng gor ms

    BasedonApriori

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    150/195

    What

    is

    counted

    at

    each

    site ow ata transact ons are str ute

    DataParallelism

    Datapartitioned CountDistributionAlgorithm

    TaskParallelism

    DataDistributionAlgorithm

    150

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    151/195

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    152/195

    1. Placedatapartitionateachsite.

    .

    3. Determinelocalcandidatesofsize1tocount;4. Broadcast local transactions to other sites

    5. Countlocalcandidatesofsize1onalldata;

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    153/195

    5 Cou t oca ca d dates o s e o a data;

    6. Determinelargeitemsetsofsize1forlocalcan a es;7. Broadcastlargeitemsetstoallsites;

    . 1

    9. i=1;10. Repeat11. i=i+1;12. C

    i=AprioriGen(L

    i

    1

    );. eterm ne oca can ateso s ze tocount;

    14. Count,broadcast,andfind Li;

    153

    .

    xamp e

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    154/195

    154

    ompar ng ec n ques

    arge Type

    DataType

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    155/195

    yp

    Technique

    Itemset

    Strategy

    and

    Data

    Structure TransactionStrate andDataStructure

    Optimization

    ParallelismStrategy

    155

    u

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    156/195

    156

    ncremen a ssoc a on u es

    enera e s na ynam c a a ase.

    database

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    157/195

    database

    Objective:

    FindlargeitemsetsforD{D}

    MustbelargeineitherDorD

    ave ian counts

    157

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    158/195

    ssoc a onru es: va ua on

    Associationrulealgorithmstendtoproducetoo

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    159/195

    many

    of

    them

    are

    uninteresting

    or

    redundantRedundantif{A,B,C}{D}and{A,B}{D}

    havesamesupport&confidence

    ,

    support&confidencearetheonlymeasuresused

    Interestin nessmeasurescanbeusedto rune/rank

    the

    derived

    patterns

    easur ng ua yo u es

    Support

    Confidence

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    160/195

    Conviction

    ChiSquaredTest

    160

    vance ec n ques

    GeneralizedAssociationRules Multipleminimumsupports

    MultipleLevelAssociationRules

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    161/195

    Usingmultipleminimumsupports

    orre a on u es

    Sequentialpatternsmining

    Graphmining

    Minin

    association

    rules

    in

    stream

    data Fuzzyassociationrules

    161

    x ens ons: an ng on nuous r u es

    Differentkindsofrules:

    Age[21,35)Salary[70k,120k)Buy

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    162/195

    = =

    Differentmethods: scre za on ase

    Statistics

    basedNondiscretizationbased

    Extensions:Se uential atternminin

    Sequence Sequence Element Event

    Database (Transaction) (Item)Customer Purchase history of a given A set of items bought by Books, diary products,

    ,

    Web Data Browsing activity of a A collection of files Home page, index

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    163/195

    particular Web visitor viewed by a Web visitor page, contact info, etcafter a single mouse click

    Event data History of events generated

    b a iven sensor

    Events triggered by a

    sensor at time t

    Types of alarms

    enerated b sensors

    Genomesequences

    DNA sequence of aparticular species

    An element of the DNAsequence

    Bases A,T,G,C

    Element

    EventE1E2

    E1E3

    E2E3E4

    E2ransac on

    (Item)

    Extensions:Se uential atternminin

    Object Timestamp Events

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    164/195

    j p

    A 10 2 3 5A 20 6, 1A 23 1B 11 4, 5, 6B 17 2

    B 21 7, 8, 1, 2B 28 1, 6C 14 1, 7, 8

    Extensions:Se uential atternminin

    Ase uence< a a a >iscontainedinanotherse uence(mn)ifthereexistintegersi1

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    165/195

    Data sequence Subsequence Contain?< {2,4} {3,5,6} {8} > < {2} {3,5} > Yes

    < > < >, ,

    < {2,4} {2,4} {2,5} > < {2} {4} > Yes

    Thesupportofasubsequencewisdefinedasthefractionof

    datase uencesthatcontainw Asequentialpatternisafrequentsubsequence(i.e.,a

    subse uencewhosesu ortisminsu

    Data Mining - From the Top 10 Algorithms to theew a enges

    u ne Top 10 Algorithms in Data Mining Research

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    166/195

    Introduction Classification

    Statistical Learning

    Bagging and Boosting Clustering

    Association Analysis

    Link Mining Text Mining

    Top 10 Algorithms

    10 Challen in Problems in Data Minin Research

    166 Concluding Remarks

    Extensions:Gra hminin

    Extendassociationruleminin tofindin

    frequentsubgraphs

    UsefulforWebMining,computational

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    167/195

    chemistr bioinformatics s atial data sets etcHomepage

    Research

    ArtificialIntelligence

    a a n ng

    Data Mining - From the Top 10 Algorithms to theew a enges

    u ne Top 10 Algorithms in Data Mining Research

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    168/195

    Introduction Classification

    Statistical Learning

    Bagging and Boosting Clustering

    Association Analysis

    Link Mining Text Mining

    Top 10 Algorithms

    10 Challen in Problems in Data Minin Research

    168 Concluding Remarks

    ICDM06PanelonTop10AlgorithmsinDataMining

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    169/195

    Link Mining

    169

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    170/195

    n e a a

    Heterogeneous,multirelationaldata

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    171/195

    Nodes are objects Mayhavedifferentkindsofobjects Objectshaveattributes

    Objects

    may

    have

    labels

    or

    classesEd es are links

    Mayhavedifferentkindsoflinks

    Linksmaybedirected,arenotrequiredtobebinary

    amp e oma ns

    webdata web

    bibliographicdata(cite)

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    172/195

    epidimiological data(epi)

    P1

    P1

    P1

    P2

    P3

    I

    P2

    P3

    I

    P2

    P3

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    173/195

    P2I P2I P2

    A1PapersPapers A1

    P4Authors P4AuthorsCitationCitation

    P4

    Co-Citation

    Author-of

    Co-Citation

    Author-ofAttributes:

    Author-affiliationAuthor-affiliationCategories

    n n ng as s

    LinkbasedObjectClassification

    LinkTypePrediction

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    174/195

    re c ng n x s ence

    ObjectIdentification

    SubgraphDiscovery

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    175/195

    175

    e a a

    Webpages

    Intrapagestructures

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    176/195

    Usa edata

    SupplementaldataProfiles

    Cookies

    176

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    177/195

    e ruc ure n ng

    Minestructure(links,graph)oftheWeb

    Techniques

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    178/195

    CLEVER

    CreateamodeloftheWeborganization. ay ecom ne w t contentm n ngto

    more

    effectivel

    retrieve

    im ortant

    a es.

    178

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    179/195

    age an con

    PR(p)=c(PR(1)/N1++PR(n)/Nn)

    PR(i):PageRankforapageiwhichpointstotarget

    page p

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    180/195

    pagep.Ni: numberoflinkscomingoutofpagei

    180

    age an con

    enera rincip e

    A B C D

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    181/195

    A B C D

    Every page has some number of Outbounds links (forwardlinks) and Inbounds links (backlinks).

    A page X has a high rank if:

    - t as many n oun s in s- It has highly ranked Inbounds links

    181

    - age n ng to as ew ut oun s n s.

    Basedonasetofkeywords,findsetofrelevant

    pages R.

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    182/195

    . ExpandRtoabaseset,B,ofpageslinkedtoorfromR.

    Calculateweightsforauthoritiesandhubs.

    PageswithhighestranksinRarereturned. AuthoritativePages:

    Highlyimportantpages.

    Bestsourceforrequestedinformation.

    HubPages:

    182

    .

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    183/195

    183

    e sage n ng

    Extendsworkofbasicsearchengines earc ng nes

    IR application

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    184/195

    IRapplicationKeywordbased

    Crawlers

    Indexing

    Linkanalysis

    Prentice Hall 184

    g g

    Personalization

    ImprovestructureofasitesWebpages

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    185/195

    references

    Improvedesignofindividualpages Improveeffectivenessofecommerce(sales

    Prentice Hall 185

    Cleanse

    Sessionize .

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    186/195

    PatternDiscovery oun pa erns a occur nsess ons

    Patternissequenceofpagesreferencesinsession.

    Similartoassociationrules Transaction:session

    emse :pa ern orsu se Orderisimportant

    Prentice Hall 186

    ICDM06PanelonTop10AlgorithmsinDataMining

    i i

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    187/195

    Integrated Mining

    n egra e n ng

    Ontheuseofassociationrulealgorithmsforc ass ca on:

    S b Di C t i ti f l

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    188/195

    SubgroupDiscovery:Caracterizationofclasses

    ven a popu at on o n v ua s an a property o

    those individuals we are interested in, find populationsu groups a are s a s ca y mos n eres ng , e.g.,

    are as large as possible and have the most unusual

    interest.

    188

    oug e pproac

    Roughsetsareusedtoapproximatelyorroughlydefineequivalentclasses

    AroughsetforagivenclassCisapproximatedbytwosets:a

    lower

    approximation(certain

    to

    be

    in

    C)

    and

    an

    upper

    approximation (cannot be described as not belonging to C)

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    189/195

    pp ( ) ppapproximation(cannotbedescribedasnotbelongingtoC)

    Findingtheminimalsubsets(reducts)ofattributes(forfeaturereduction isNPhardbutadiscernibilitymatrixis

    usedtoreducethecomputationintensity

    189

    ICDM06PanelonTop10AlgorithmsinDataMining

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    190/195

    190

    ICDM06PanelonTop10AlgorithmsinDataMining

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    191/195

    191

    ICDM06PanelonTop10AlgorithmsinDataMining

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    192/195

    192

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    193/195

    Data Mining - From the Top 10 Algorithms to theew a enges

    u ne Top 10 Algorithms in Data Mining Research

    Introduction Classification

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    194/195

    Classification

    Statistical Learning

    Bagging and Boosting

    Clustering

    Association Analysis

    Link Mining Text Mining

    Top 10 Algorithms

    10 Challen in Problems in Data Minin Research

    194 Concluding Remarks

    Data Mining - From the Top 10 Algorithms to theew a enges

    u ne Top 10 Algorithms in Data Mining Research

    Introduction Classification

  • 8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining

    195/195

    Classification

    Statistical Learning

    Bagging and Boosting

    Clustering

    Association Analysis

    Link Mining Text Mining

    Top 10 Algorithms

    10 Challen in Problems in Data Minin Research

    195 Concluding Remarks