Spatial Data Mining: Database Primitives, Algorithms and ...
DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
Transcript of DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
1/195
Data Mining and Soft Computing
Francisco HerreraResearch Group on Soft Computing and
Information Intelli ent S stems SCI2S
Dept. of Computer Science and A.I.University of Granada, Spain
Email: [email protected]
htt ://sci2s.u r.es
http://decsai.ugr.es/~herrera
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
2/195
Data Mining and Soft Computing
Summary
1. Introduction to Data Mining and Knowledge Discovery2. Data Preparation
4. Data Mining - From the Top 10 Algorithms to the New Challenges
5. Introduction to Soft Computing. Focusing our attention in Fuzzy Logic
6.Soft Computing Techniques in Data Mining: Fuzzy Data Mining and
Knowledge Extraction based on Evolutionary Learning
. ene c uzzy ys ems: a e o e r an ew ren s
8. Some Advanced Topics I: Classif ication with Imbalanced Data Sets9. Some Advanced Topics II: Subgroup Discovery
10.Some advanced Topics III: Data Complexity
11.Final talk: How must I Do my Experimental Study? Design of
Ex eriments in Data Minin /Com utational Intelli ence. Usin Non-parametric Tests. Some Cases of Study.
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
3/195
Slidesusedforpreparingthis
talk:Top 10 Algorithms in Data Mining ResearchTop 10 Algorithms in Data Mining Research
10 Challen in Problems in Data Minin Research10 Challen in Problems in Data Minin Researchprepared for ICDM 2005prepared for ICDM 2005
CS490D:
IntroductiontoDataMining
Association Analysis: Basic Conceptsand Algorithms Prof.ChrisClifton
Lecture Notes for Chapter 6
Introduction to Data Minin
DATA MININGIntroductory and Advanced Topics
3by Tan, Steinbach, Kumar
argare . un am
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
4/195
Data Mining - From the Top 10 Algorithms to the
ew a enges
u ne
Top 10 Algorithms in Data Mining Research Introduction
Classification
Statistical Learning Bagging and Boosting
Clustering
Association Analysis Link Mining Text Mining
Top 10 Algorithms
10 Challen in Problems in Data Minin Research4 Concluding Remarks
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
5/195
Data Mining - From the Top 10 Algorithms to the
ew a enges
u ne
Top 10 Algorithms in Data Mining Research Introduction
Classification
Statistical Learning Bagging and Boosting
Clustering
Association Analysis Link Mining Text Mining
Top 10 Algorithms
10 Challen in Problems in Data Minin Research5 Concluding Remarks
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
6/195
Challen esinDataMininChallen esinDataMinin
DiscussionPanelsatICDM2005and2006DiscussionPanelsatICDM2005and2006
6
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
7/195
Challen esinDataMininChallen esinDataMinin
DiscussionPanelsatICDM2005and2006DiscussionPanelsatICDM2005and2006Top 10 Algorithms in Data Mining ResearchTop 10 Algorithms in Data Mining Research
prepared for ICDM 2006prepared for ICDM 2006
10 Challenging Problems in Data Mining Research10 Challenging Problems in Data Mining Researchprepare orprepare or
7
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
8/195
To 10Al oritms in DataMinin Research
preparedforICDM2006preparedforICDM2006http://www.cs.uvm.edu/~icdm/algorithms/index.shtml
XindongWu
Universit of Vermonthttp://www.cs.uvm.edu/~xwu/home.html
VipinKumar
UniversityofMinessotahttp://wwwusers.cs.umn.edu/~kumar/
8
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
9/195
ICDM06PanelonTop
10
Algorithms
in
Data
Mining
9
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
10/195
ICDM06PanelonTop
10
Algorithms
in
Data
Mining
10
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
11/195
ICDM06PanelonTop
10
Algorithms
in
Data
Mining
11
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
12/195
ICDM06PanelonTop
10
Algorithms
in
Data
Mining
12
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
13/195
Data Mining - From the Top 10 Algorithms to the
ew a enges
u ne
Top 10 Algorithms in Data Mining Research Introduction
Classification
Statistical Learning Bagging and Boosting
Clustering
Association Analysis Link Mining Text Mining
Top 10 Algorithms
10 Challen in Problems in Data Minin Research13 Concluding Remarks
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
14/195
ICDM06PanelonTop
10
Algorithms
in
Data
Mining
Classification
14
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
15/195
g
Partitioningbased:Dividesearchspaceinto
. Tupleplacedintoclassbasedontheregion
withinwhichitfalls.
DTInduction
Internalnodesassociatedwithattributeand
Algorithms:ID3, C4.5,CART
15
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
16/195
ec s on ree
D={t1,,tn}whereti=
Databaseschemacontains {A1,A2,,Ah}ClassesC= C . C
DecisionorClassificationTreeisatreeassociatedwith
Eachinternalnodeislabeledwithattribute,Ai
Each
arc
is
labeled
with
predicate
which
can
be
appliedtoattributeatparent
Eachleafnodeislabeledwithaclass,Cj
16
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
17/195
ra n ng a ase
age income student credit_rating buys_computer
40 low yes excellent no
an
x m l ow yes exce en yes
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
18/195
Out ut:ADecisionTreefor
buys_computera e?
overcas >40..
student? credit rating?yes
no yes fairexcellent
no noyes yes
18
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
19/195
gor m or ec s on ree n u c on
Basicalgorithm(agreedyalgorithm)
Treeisconstructedinato downrecursivedivideandcon uer
manner Atstart,allthetrainingexamplesareattheroot
Attributesarecategorical(ifcontinuousvalued,theyarediscretized inadvance)
Testattributesareselectedonthebasisofaheuristicorstatisticalmeasure e. . information ain
Conditionsforstoppingpartitioning Allsam lesfora ivennodebelon tothesameclass
Therearenoremainingattributesforfurtherpartitioningmajorityvotingisemployedforclassifyingtheleaf
19
Therearenosamplesleft
DT I d i
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
20/195
DTInduction
20
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
21/195
p s rea
GenderM
F
Height
21
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
22/195
ompar ng s
Balanced
22
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
23/195
ssues
ChoosingSplittingAttributes
OrderingofSplittingAttributes
TreeStructure StoppingCriteria
TrainingData run ng
23
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
24/195
DecisionTreeInductionisoftenbasedon
InformationTheory
So
24
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
25/195
n orma on
25
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
26/195
n orma on n ropy
Givenpro a i ititesp1,p2,..,psw osesumis1,Entropyisdefinedas:
Entropymeasurestheamountofrandomnessorsur riseoruncertaint .
Goalinclassification
entropy=0
26
Att ib t S l ti M
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
27/195
AttributeSelectionMeasure:
InformationGain(ID3/C4.5) Select the attribute with the highest information
S contains si tuples of class Ci for i = {1, , m}
n ormat on measures n o requ re to c ass y
any arbitrary tuple ss im
i=
entropy of attribute A with values {a ,a ,,a }
ssi 1=
)s,...,s(Is
s...sE(A) mjjv mjj
11
=
++=
information gained by branching on attribute A
27
E(A))s,...,s,I(sGain(A) m = 21
Attribute Selection by
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
28/195
AttributeSelectionby
Information
Gain
Computationg ClassP:buys_computer=yesg ClassN:buys_computer=no )0,4(14
4)3,2(
14
5)( += IIageE
, , .
g Computetheentropyforage:
694.0)2,3(145 =+ I
5 =
14samples,with2yeses and340 3 2 0.971
246.0)(),()( == ageEnpIageGainage income student credit_rat ing buys_computer 40 low yes fair yes
151.0)( =studentGainow yes exce en no
3140 low yes excellent yes
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
29/195
er r u e e ec on easures
Giniindex(CART,IBMIntelligentMiner)
Allattributesareassumedcontinuousvalued Assumethereexistseveralpossiblesplitvaluesforeach
attribute
, ,
possiblesplitvalues
an emo e orca egor ca a r u es
29
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
30/195
GiniIndex
(IBM
IntelligentMiner) IfadatasetTcontainsexamplesfromnclasses,giniindex,
gini(T)isdefinedas = n 2
wherepjistherelativefrequencyofclassjinT.=j 1
a a ase ssp n o wosu se s 1an 2w s zes 1andN2respectively,theginiindexofthesplitdatacontains
,
21 NN21
NNsplit
T eattr uteprov est esma estg n splitT sc osentosp tthenode(needtoenumerateallpossiblesplittingpointsfor
30
.
Extractin Classification Rules from
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
31/195
Extractin ClassificationRulesfrom
Trees RepresenttheknowledgeintheformofIFTHENrules
Eachattributevaluepairalongapathformsaconjunction
Theleafnodeholdstheclassprediction
Example
IFage=
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
32/195
vo ver ng n ass ca on
Overfitting: Aninducedtreemayoverfit thetrainingdata Toomanybranches,somemayreflectanomaliesduetonoiseor
outliers Pooraccuracyforunseensamples
Twoapproac estoavoi over itting Prepruning:Halttreeconstructionearlydonotsplitanodeifthis
Difficulttochooseanappropriatethreshold
Post runin :Removebranchesfromafull rowntree etasequenceofprogressivelyprunedtrees
Useasetofdatadifferentfromthetrainingdatatodecidewhich
32
A roaches to Determine the Final
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
33/195
A roachestoDeterminetheFinal
TreeSize Separatetraining(2/3)andtesting(1/3)sets
Usecrossvalidation,e.g.,10foldcrossvalidation
butapplyastatisticaltest(e.g.,chisquare)toestimatewhetherexpandingorpruninganodemayimprovethe
entiredistribution
Useminimumdescriptionlength(MDL)principle
haltinggrowthofthetreewhentheencodingisminimized
33
Enhancements to basic decision
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
34/195
Enhancementstobasicdecision
treeinduction Allowforcontinuousvaluedattributes
Dynam ca y
e ne
new
screte
va ue
attr utes
t at
partitionthecontinuousattributevalueintoadiscreteset
o interva s
Handlemissin attributevalues Assignthemostcommonvalueoftheattribute
ss gnpro a tytoeac o t eposs eva ues
Attributeconstruction
Createnewattributesbasedonexistingonesthatare
CS490D 34
Thisreducesfragmentation,repetition,andreplication
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
35/195
ec s on reevs. u es
Treehasimpliedorder Ruleshavenoordering
performed.
.
Treecreatedbasedon
lookin atallclasses.
Onlyneedtolookat
rules.
35
Scalable Decision Tree Induction Methods
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
36/195
ScalableDecisionTreeInductionMethods
in
Data
Mining
Studies Classificationaclassicalproblemextensivelystudiedby
Scalability:Classifyingdatasetswithmillionsofexamplesand
un re so attr utesw t reasona espee
Why
decision
tree
induction
in
data
mining? relativelyfasterlearningspeed(thanotherclassificationmethods)
convertibletosimpleandeasytounderstandclassificationrules
canuseSQLqueriesforaccessingdatabases
comparableclassificationaccuracywithothermethods
36
Scalable Decision Tree Induction Methods
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
37/195
ScalableDecisionTreeInductionMethods
in
Data
Mining
Studies SLIQ(EDBT96 Mehtaetal.)
buildsanindexforeachattributeandonlyclasslistandthecurrent
attributelistresideinmemory SPRINT(VLDB96 J.Shaferetal.)
constructsanattributelistdatastructure
PUBLIC(VLDB98 Rastogi&Shim) integratestreesplittingandtreepruning:stopgrowingthetreeearlier
RainForest (VLDB98 Gehrke,Ramakrishnan&Ganti) separatest esca a tyaspects romt ecr ter at at eterm net e
qualityofthetree
37
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
38/195
ns ance ase e o s
Instancebasedlearning: Storetrainingexamplesanddelaytheprocessing(lazyevaluation)
untilanewinstancemustbeclassified Typicalapproaches
knearestneighborapproach
InstancesrepresentedaspointsinaEuclideanspace.
oca ywe g e regress on
Constructslocalapproximation
Usessymbolicrepresentationsandknowledgebasedinference
38
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
39/195
ass ca on s ng s ance
Placeitemsinclasstowhich theyareclosest.
Mustdeterminedistancebetweenaniteman ac ass.
Classesre resentedbCentroid:Centralvalue.
Medoid: Representativepoint.
Algorithm:KNN
39
The kNearest Nei hbor
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
40/195
Thek NearestNei hbor
Algorithm AllinstancescorrespondtopointsinthenDspace.
ThenearestneighboraredefinedintermsofEuclidean
distance. Thetargetfunctioncouldbediscreteorrealvalued.
Fordiscretevalued,thekNNreturnsthemostcommonvalueamongthektrainingexamplesnearesttoxq.
typicalsetoftrainingexamples.
_+_ _
+
_
.._ xq
+
_
+ .. .
40
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
41/195
eares e g or :
Trainingsetincludesclasses.
ExamineKitemsnearitemtobeclassified. ew emp ace nc assw emos
number
of
close
items. O(q)foreachtupletobeclassified. (Hereq
s es zeo e ra n ngse .
41
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
42/195
42
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
43/195
43
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
44/195
ayes an ass ca on: y
Probabilisticlearning: Calculateexplicitprobabilitiesforhypothesis,amongthemostpracticalapproachestocertainypeso earn ngpro ems
Incremental:Eachtrainingexamplecanincrementally.
Priorknowledgecanbecombinedwithobserveddata.
Probabilistic rediction: Predictmulti leh otheses,weightedbytheirprobabilities
Standard:EvenwhenBayesianmethodsarecomputationallyn rac a e, eycanprov eas an ar o op ma ec s on
makingagainstwhichothermethodscanbemeasured
44
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
45/195
ayes an eorem: as cs
LetXbeadatasamplewhoseclasslabelisunknown
Let H be a h othesis that X belon s to class C
Forclassificationproblems,determineP(H|X):theprobabilitythattheh othesisholds iventheobserveddatasam leX
P(H):priorprobabilityofhypothesisH(i.e.theinitial
probabilitybeforeweobserveanydata,reflectsthebackgroundknowledge)
P(X):probabilitythatsampledataisobserved
P(X|H):probabilityofobservingthesampleX,giventhatthe
hypothesisholds
45
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
46/195
ayes eorem
GiventrainingdataX,posterioriprobabilityofahypothesisH,P(H|X)followstheBayestheorem
)()()|( HPHXPXHP =
Informally,thiscanbewrittenas
osterior=likelihoodx rior/evidence MAP(maximumposteriori)hypothesis
=
.
HhHhAP
probabilities,significantcomputationalcost
46
a ve ayes ass er
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
47/195
a ve ayes ass er
Asimplifiedassumption:attributesareconditionallyindependent: n
== k CixkPCiXP 1 )|()|(
Theproductofoccurrenceofsay2elementsx1andx2,giventhecurrentclassisC,istheproductoftheprobabilitiesof
,P([y1,y2],C)=P(y1,C)*P(y2,C)
Nodependencerelationbetweenattributes
Greatlyreducesthecomputationcost,onlycounttheclassdistribution.
Oncet epro a i ityP X Ci is nown,assignXtot ec asswithmaximumP(X|Ci)*P(Ci)
47
ra n ng a ase
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
48/195
ra n ng a ase
age income student credit_rating buys_computer
40 low yes fair yes
>40 low es excellent no
_no
3140 low yes excellent yes
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
49/195
a ve ayes an ass er: xamp e
ComputeP(X/Ci)foreachclassP(age=
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
50/195
Comments Advantages: Easytoimplement
Goodresultsobtainedinmostofthecases Disadvantages
Assumption:classconditionalindependence,thereforelossofaccuracy
rac ca y, epen enc esex s amongvar a es
E.g., hospitals:patients:Profile:age,familyhistoryetc
, ., ,
Dependencies
among
these
cannot
be
modeled
by
Nave
Bayesian
Classifier
Howtodealwiththesedependencies? BayesianBeliefNetworks
50
ICDM06Panelon
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
51/195
Top10AlgorithmsinDataMining
Statistical Learning
51
Data Mining - From the Top 10 Algorithms to the
ew a enges
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
52/195
ew a enges
u ne
Top 10 Algorithms in Data Mining Research Introduction
Classification
Statistical Learning Bagging and Boosting
Clustering
Association Analysis
Link Mining Text Mining
Top 10 Algorithms
10 Challen in Problems in Data Minin Research
52 Concluding Remarks
uppor vec or mac ne
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
53/195
uppor vec ormac ne
Classificationisessentiallyfindingthebest
oun ary
e ween
c asses.
boundarypointscalledsupportvectorsand
ui c assi ierontopo t em.
machine.
53
xamp e o genera
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
54/195
xamp eo genera
Thedotswithshadowaroundthemaresupportvectors.pointstorepresenttheboundary.
The
curve
is
the .
54
SVM S V M hi
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
55/195
SVM SupportVectorMachines
Support Vectors
p ma yper p ane separa e case
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
56/195
p ma yperp ane,separa ecase.
Inthiscase,class1and.
Therepresentingpoints 00=+Tx
themarginbetween X
maximized. X X Crosse pointsare
supportvectors. XC
56
on
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
57/195
on .
LinearSupportVectorMachine
Givenasetofpoints withlabelnix }1,1{yi TheSVMfindsahyperplanedefinedbythepair(w,b)
where wis the normal to the lane andbis thedistancefromtheorigin)
s.t. Nibwxy ii ,...,11)( =++x eature vector, - as, y- c ass a e , w - marg n
57
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
58/195
na ys s on .
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
59/195
na ys s on .
3.Sotheproblemoffindingoptimalhyperplaneturns
Maximizing Con
,,0
=
Subjecttoconstrain:
T
4.Itsthesameas:.,...,,0ii
Minimizing
subject
to
||||.,...,1,1)( 0 Nixy ii =>+
59
enera
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
60/195
clearl donothavea ood
optimallinearclassifier.
A nonlinear boundar as
shownwilldofine.
60
onsepara ecase
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
61/195
p
Whenthedatasetis
nonsepara eas
shownintheri ht00=+Tx
figure,wewillassign X*we g ttoeac
su ortvectorwhichX
X
willbeshowninthe Xconstraint. C
61
on near
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
62/195
? 0x w b + >
In non linear case we can see this as?
,i
Kernel Can be thought of as doing dot productin some high dimensional space
62
enera on .
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
63/195
Similartolinearcase,thesolutioncanbe
writtenas:
0'1
0 )(),()()( +=+= =
iii
i
i
Txhxhyxhxf
Butfunctionhisofveryhighdimension
sometimes
infinity,
does
it
mean
SVM
is
mprac ca
63
esu ng ur aces
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
64/195
g
64
epro uc ng erne .
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
65/195
Lookat thedualproblem,thesolution
onlydependson . )(),( 'ii xhxh
need
to
onl
look
at
their
kernel
representation:K(X,X)= )(),( 'ii xhxh
Which
lies
in
a
much
smaller
dimension
pace an .
65
es r c onsan yp ca erne s.
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
66/195
Kernelrepresentationdoesnotexistallthe
me, ercer scon on ouran an
Hilbert 1953 tellsustheconditionforthis
kindofexistence.
T ereareaseto erne sprovento e
effective suchas ol nomialkernelsand
radialbasiskernels.
66
xamp eo po ynom a erne .
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
67/195
rdegreepolynomial:
x,x = + .
For a feature s ace with two in uts: x1 x2 and
apolynomialkernelofdegree2.
K(x,x)=(1+)2
251423121 )(,)(,2)(,2)(,1)( xxhxxhxxhxxhxh =====
and ,thenK(x,x)=.216 2)( xxxh =
67
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
68/195
penpro emso .
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
69/195
HowdowechooseKernelfunctionfora.
havedifferentresults,althoughgenerallytheresu tsare ettert anus ng yperp anes.
classificationproblem.MinimumBayesianrisk.
achievetherisk.
69
penpro emso
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
70/195
Forverylargetrainingset,supportvectors
m g eo arges ze. pee us ecomesa
bottleneck.
AoptimaldesignformulticlassSVMclassifier.
70
e a e n s
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
71/195
http://svm.dcs.rhbnc.ac.uk/
. .
C. J. C.Burges.ATutorialonSupportVectorMachinesforPattern
. , , .
SVMlightSoftware(inC)http://ais.gmd.de/~thorsten/svm_light
BOOK: An Intro uction to upport Vector Mac ines
N. Cristianini and J. Shawe-Taylor
am r ge n vers y ress,
71
Data Mining - From the Top 10 Algorithms to the
ew a enges
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
72/195
u ne
Top 10 Algorithms in Data Mining Research Introduction
Classification
Statistical Learning
Bagging and Boosting
Clustering
Association Analysis
Link Mining Text Mining
Top 10 Algorithms
10 Challen in Problems in Data Minin Research
72 Concluding Remarks
ICDM06PanelonTop10AlgorithmsinDataMining
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
73/195
p g g
Bagging and Boosting
om n ngc ass ers
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
74/195
Examples:classificationtreesandneuralnetworks,, ,
etc.
Why? Betterclassificationperformancethanindividualclassifiers
Moreresiliencetonoise
Whynot?Time consumin
Overfitting
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
75/195
agg ngan oos ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
76/195
Bagging=Manipulationwith
data
set
Boosting=Manipulationwithmodel
76
agg ngan oos ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
77/195
Generalidea Classification method (CM)
ra n ng a a
AlteredTrainingdata Classifier C1
CM
Classifier C2
..
Aggregation. Classifier C*
77
agg ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
78/195
Breiman,1996
er ve rom ootstrap ron,
Createclassifiersusingtrainingsetsthatare
Averageresultsforeachcase
agg ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
79/195
GivenasetSofssamples
Generate a bootstra sam le T from S. Cases in S ma not
appear
in
T
or
may
appear
more
than
once.
Re eatthissam lin rocedure ettin ase uenceofkindependenttrainingsets
AcorrespondingsequenceofclassifiersC1,C2,,Ckisconstructedforeachofthesetrainingsets,byusingthesameclassificationalgorithm
ToclassifyanunknownsampleX,leteachclassifierpredictor
vote TheBaggedClassifierC*countsthevotesandassignsXtothe
classwiththemostvotes
79
(Opitz, 1999)
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
80/195
Bagging
Example(Opitz,1999)
ra n ng se
Training set 3 3 6 2 7 5 6 2 2
Training set 4 4 5 1 4 6 4 3 8
oos ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
81/195
Afamilyofmethods
equent a pro uct ono c ass ers
Eachclassifierisde endentonthe reviousone,and
focusesonthepreviousoneserrors
classifiersarechosenmoreoftenorweightedmore
eavi y
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
82/195
oos ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
83/195
Theidea
83
a oos ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
84/195
FreundandSchapire,1996
Two
approaches
classifier(morerepresentativesofmisclassified
casesarese ec e morecommonWeigherrorsofthemisclassifiedcaseshigher(all
casesareincorporated,butweightsaredifferent)
a oos ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
85/195
Define kasthesumoftheprobabilitiesforthe
Multiplyprobabilityofselectingmisclassifiedcases
k=(1k)/k Renormalizeprobabilities(i.e.,rescalesothatit
sumsto1)
CombineclassifiersC1
Ckusingweightedvoting
(Opitz,1999)
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
86/195
Boosting Example
ra n ng se
Training set 3 7 1 5 8 1 8 1 4
Training set 4 1 1 6 1 1 3 1 5
Data Mining - From the Top 10 Algorithms to the
ew a enges
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
87/195
u ne
Top 10 Algorithms in Data Mining Research
Introduction
Classification
Statistical Learning
Bagging and Boosting
Clustering
Association Analysis
Link Mining Text Mining Top 10 Algorithms
10 Challen in Problems in Data Minin Research
87 Concluding Remarks
ICDM06PanelonTop10AlgorithmsinDataMining
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
88/195
Clustering
88
us er ng ro em
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
89/195
GivenadatabaseD={t1,t2,,tn}oftuplesand,
todefineamappingf:D{1,..,k}whereeachtisass gne toonec uster j,
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
90/195
us er ng xamp e
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
91/195
91
us er ng eve s
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
92/195
Size Based
92
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
93/195
us er ng ssues
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
94/195
Outlierhandling
Dynamicdata
Evaluatin results
Numberofclusters
Datatobeused
ca a y
94
mpacto ut erson uster ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
95/195
95
ypeso us er ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
96/195
Hierarchical Nestedsetofclusterscreated.
ar ona nese o c us erscrea e .
In r m n l Each element handled one at atime.
mu aneous e emen s an e together.
Overlapping/Nonoverlapping
96
us er ng pproac es
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
97/195
Clustering
Hierarchical Partitional Categorical Large DB
Agglomerative Divisive Sampling Compression
97
us er arame ers
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
98/195
98
s ance e ween us ers
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
99/195
CompleteLink:largestdistancebetweenpoints
AverageLink:averagedistancebetweenpoints
Centroid:distancebetweencentroids
99
erarc ca us er ng
l d l l ll f
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
100/195
Clustersarecreatedinlevelsactuallycreatingsetsof.
Agglomerative n a yeac em n sownc us er
Iterativelyclustersaremergedtogether
BottomUp Divisive
Initiallyallitemsinonecluster
Largeclustersaresuccessivelydivided TopDown
100
erarc ca gor ms
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
101/195
SingleLink
MSTSingleLink
Avera eLink
101
en rogram
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
102/195
Dendrogram:atreedata
hierarchical
clustering
Eachlevelshowsclustersfor
. Leaf individualclusters
Aclusteratleveliistheunion
i+1.
102
eve so us er ng
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
103/195
103
ar ona us er ng
N hi hi l
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
104/195
Nonhierarchical
Createsclustersinonestepasopposedto
.
Sinceonlyonesetofclustersisoutput,the
usernormallyhastoinputthedesirednumber
, .
Usuallydealswithstaticsets.
104
ar ona gor ms
MST
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
105/195
MST
SquaredError
NearestNei hbor
PAM
BEA
105
eans
.
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
106/195
. Iterativel itemsaremovedamon setsof
clustersuntilthedesiredsetisreached. Highdegreeofsimilarityamongelements
inaclusterisobtained.
GivenaclusterKi={ti1,ti2,,tim},thecluster
meanismi=(1/m)(ti1++tim)
106
=R d l i 3 4
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
107/195
Randomlyassignmeans:m1=3,m2=4
1= , , 2= , , , , , , ,
m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25},
m1=3,m2=18
K1={2,3,4,10},K2={12,20,30,11,25},m =4.75 m =19.6
K1
={2,3,4,10,11,12},K2
={20,30,25},
1 , 2 Stopastheclusterswiththesemeansare
107
esame.
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
108/195
108
us er ng arge a a ases
Most clustering algorithms assume a large data
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
109/195
Mostclusteringalgorithmsassumealargedata
.
Clusteringmaybeperformedfirstonasampleoft e ata aset enapp e tot eent re ata ase.
Al orithms
BIRCH
CURE
109
DesiredFeaturesforLar e
Databases One scan (or less) of DB
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
110/195
Onescan(orless)ofDB
n ne
Sus endable sto able resumable
Incremental
Workwithlimitedmainmemory . .
Processeachtupleonce
110
BalancedIterativeReducingandClustering
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
111/195
g g
Incremental,hierarchical,onescan
Saveclusteringinformationinatree
Eachentryinthetreecontainsinformation
Newnodesinsertedinclosestentryintree
111
us er ng ea ure
CTTriple: (N,LS,SS)
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
112/195
N: Numberofpointsincluster
LS: Sumofpointsinthecluster
CFTree
NodehasCFtripleforeachchild
Leafnoderepresentscluster andhasCFvalueforeach
subcluster
in
it.Subclusterhasmaximumdiameter
112
gor m
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
113/195
113
ComparisonofClusteringTechniques
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
114/195
114
Data Mining - From the Top 10 Algorithms to the
ew a enges
u ne
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
115/195
u e Top 10 Algorithms in Data Mining Research
Introduction
Classification
Statistical Learning
Bagging and Boosting
Clustering
Association Analysis
Link Mining Text Mining Top 10 Algorithms
10 Challen in Problems in Data Minin Research
115 Concluding Remarks
ICDM06PanelonTop10AlgorithmsinDataMining
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
116/195
ssociation nalysis
ICDM06PanelonTop10AlgorithmsinDataMining
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
117/195
Sequential Patterns
Gra h Minin
117
ssoc a on u e ro em
GivenasetofitemsI={I1,I2,,Im}andad b f i D
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
118/195
databaseoftransactionsD= t t t
where
ti={Ii1,Ii2,
,
Iik}
and
IijI,
the
associationrulesX Ywithaminimum
suppor an con ence. LinkAnal sis
NOTE:SupportofX Yissameassupporto .
118
ssoc a on u e e n ons
Setofitems:I={I1,I2,,Im}
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
119/195
f { 1, 2, , m}
Transactions:D={t1,t2,,tn},tjI
i1, i2,, ik
Su ort o anitemset:Percenta eof
transactionswhichcontainthatitemset.
Large(Frequent)itemset:Itemsetwhose
119
ssoc a on u e e n ons
Association Rule (AR): implication X Y
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
120/195
AssociationRule(AR):implicationX Y
whereX,YIandXY=;
transactionsthatcontainXY
Confidenceof
AR
()
X
Y:Ratioof
YtothenumberthatcontainX
120
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
121/195
ssoc a on u es xamp e
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
122/195
I = { Beer, Bread, Jelly, Milk, PeanutButter}Support of {Bread,PeanutButter} is 60%
122
ssoc a on u es x con
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
123/195
123
ssoc a on u e n ng as
GivenasetoftransactionsT,thegoalofassociation
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
124/195
supportminsupthresholdconfidence minconfthreshold
Bruteforceapproach:
Computethesupportandconfidenceforeachrule
Prunerulesthatfailtheminsupandminconfthresholds
Computationallyprohibitive!
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
125/195
ssoc a on u e ec n ques
1. FindLargeItemsets.
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
126/195
2. Generaterulesfromfrequentitemsets.
126
n ng ssoc a on u es
Twostepapproach:
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
127/195
1. FrequentItemsetGeneration
Generateallitemsetswhosesu ortminsu
.
Generatehighconfidencerulesfromeachfrequent
temset,w ereeac ru e sa narypart t on ngo a
frequentitemset
Frequentitemsetgenerationisstill
Fre uentItemsetGenerationnull
A B C D E
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
128/195
AB AC AD AE BC BD BE CD CE DE
ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE
ABCD ABCE ABDE ACDE BCDE
Given d items, there
are 2d possible
Fre uentItemsetGeneration
Eachitemsetinthelatticeisacandidatefrequent
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
129/195
q
itemset
database Transactions
ems
1 Bread, Milk2 Bread, Diaper, Beer, Eggs3 Milk, Diaper, Beer, Coke4 Bread, Milk, Diaper, Beer
Match each transaction a ainst ever candidate
Complexity~O(NMw)=>ExpensivesinceM=2d !!!
Com utationalCom lexit
Totalnumberofitemsets=2d
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
130/195
Totalnumberofpossibleassociationrules:
11 1 =
=
= d
k
kd
j jkR
123 1 += +dd
If d=6, R = 602 rules
Fre uentItemsetGeneration
Strategies Reducethenumberofcandidates(M)
d
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
131/195
d
Use
pruning
techniques
to
reduce
M
Reducethenumberoftransactions(N) Reduce size of N as the size of itemset increases
UsedbyDHPandverticalbasedminingalgorithms
Reducethenumberofcomparisons(NM)
Useefficientdatastructurestostorethecandidatesortransactions
Noneedtomatcheverycandidateagainsteverytransaction
Reducin NumberofCandidates
A riori rinci le:
Ifanitemsetisfrequent,thenallofitssubsetsmustalsobe
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
132/195
thesupportmeasure:
)()()(:, YsXsYXYX
Supportofanitemsetneverexceedsthesupportofits
Thisisknownastheantimonotonepropertyofsupport
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
133/195
pr or
LargeItemsetProperty:
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
134/195
Anysubsetofalargeitemsetislarge. Contra ositive:
Ifanitemsetisnotlarge,
noneofitssupersetsarelarge.
134
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
135/195
pr or gor m
a es:Lk=Setofkitemsetswhicharefrequent
k f k h h ld b f
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
136/195
Ck=Setofkitemsetswhichcouldbefrequent
Method:Init.Letk=1
Repeatuntilnonewfrequentitemsetsareidentified
itemsets
DB
thosethatarefrequent
Illustratin A rioriPrinci le
Item Count
L1a) No need to generate
Bread 4Coke 2Milk 4
C2
B 3
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
137/195
Beer 3Diaper 4
Eggs 1
{Bread,Milk}
{Bread,Beer}
b) Counting{Bread,Milk} 3
{Bread,Beer} 2Bread Dia er 3
Minimum Support = 3
,{Milk, Beer}{Milk,Diaper}Beer,Dia er
{Milk,Beer} 2{Milk,Diaper} 3{Beer,Diaper} 3
c) Filter non-frequent
L2Itemset Count
Bread Milk 3
a) No need to generatecandidates involving
Bread Beer
Itemset Count
{Bread,Mi lk ,D iaper } 3
L3
{Bread,Diaper} 3{Milk,Diaper} 3{Beer,Diaper} 3Itemset
C3b) Counting
rea , , aper
AprioriGenExample
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
138/195
138
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
139/195
pr or v sa v
Advantages:
Uses large itemset property
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
140/195
Useslargeitemsetproperty.
Easytoimplement.
Disadvantages:
resident.
Requiresuptomdatabasescans.
140
amp ng
Largedatabases Sam lethedatabaseanda l A rioritothe
l
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
141/195
sample.
fromsample
GeneralizationofAprioriGenappliedtoitemsetsofvaryingsizes.
MinimalsetofitemsetswhicharenotinPL,butwhosesubsetsareallinPL.
141
ega ve or er xamp e
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
142/195
PL PL BD-(PL)
142
amp ng gor m
1. Ds=sampleofDatabaseD;
= arge emse s n us ng sma s;
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
143/195
. = arge emse s n sus ngsma s;
3. C = PLBD PL4. CountCinDatabaseusings;
5. ML=largeitemsetsinBD
(PL);.
7. else
C
=
repeated
application
of
BD
;
8. CountCinDatabase;
143
amp ng xamp e
FindARassumings=20% D = t t
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
144/195
Smalls=10%
= , , , , ,{Bread,PeanutButter},{Jelly,PeanutButter},
, ,
BD(PL)={{Beer},{Milk}}
ML={{Beer},{Milk}}
remainingitemsets
144
amp ng v sa v
Advantages:
Reduces number of database scans to one in the
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
145/195
Reducesnumberofdatabasescanstooneinthe
bestcaseandtwoinworst.Scalesbetter.
D sa vantages:
pass
145
ar on ng
DividedatabaseintopartitionsD1,D2,,Dp
Apply Apriori to each partition
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
146/195
ApplyAprioritoeachpartition
partition.
146
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
147/195
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
148/195
ar on ng v sa v
Advantages:
Adaptstoavailablemainmemory
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
149/195
dapts to a a ab e a e o y
Maximumnumberofdatabasescansistwo.
Disadvantages: .
149
ara e z ng gor ms
BasedonApriori
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
150/195
What
is
counted
at
each
site ow ata transact ons are str ute
DataParallelism
Datapartitioned CountDistributionAlgorithm
TaskParallelism
DataDistributionAlgorithm
150
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
151/195
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
152/195
1. Placedatapartitionateachsite.
.
3. Determinelocalcandidatesofsize1tocount;4. Broadcast local transactions to other sites
5. Countlocalcandidatesofsize1onalldata;
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
153/195
5 Cou t oca ca d dates o s e o a data;
6. Determinelargeitemsetsofsize1forlocalcan a es;7. Broadcastlargeitemsetstoallsites;
. 1
9. i=1;10. Repeat11. i=i+1;12. C
i=AprioriGen(L
i
1
);. eterm ne oca can ateso s ze tocount;
14. Count,broadcast,andfind Li;
153
.
xamp e
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
154/195
154
ompar ng ec n ques
arge Type
DataType
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
155/195
yp
Technique
Itemset
Strategy
and
Data
Structure TransactionStrate andDataStructure
Optimization
ParallelismStrategy
155
u
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
156/195
156
ncremen a ssoc a on u es
enera e s na ynam c a a ase.
database
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
157/195
database
Objective:
FindlargeitemsetsforD{D}
MustbelargeineitherDorD
ave ian counts
157
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
158/195
ssoc a onru es: va ua on
Associationrulealgorithmstendtoproducetoo
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
159/195
many
of
them
are
uninteresting
or
redundantRedundantif{A,B,C}{D}and{A,B}{D}
havesamesupport&confidence
,
support&confidencearetheonlymeasuresused
Interestin nessmeasurescanbeusedto rune/rank
the
derived
patterns
easur ng ua yo u es
Support
Confidence
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
160/195
Conviction
ChiSquaredTest
160
vance ec n ques
GeneralizedAssociationRules Multipleminimumsupports
MultipleLevelAssociationRules
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
161/195
Usingmultipleminimumsupports
orre a on u es
Sequentialpatternsmining
Graphmining
Minin
association
rules
in
stream
data Fuzzyassociationrules
161
x ens ons: an ng on nuous r u es
Differentkindsofrules:
Age[21,35)Salary[70k,120k)Buy
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
162/195
= =
Differentmethods: scre za on ase
Statistics
basedNondiscretizationbased
Extensions:Se uential atternminin
Sequence Sequence Element Event
Database (Transaction) (Item)Customer Purchase history of a given A set of items bought by Books, diary products,
,
Web Data Browsing activity of a A collection of files Home page, index
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
163/195
particular Web visitor viewed by a Web visitor page, contact info, etcafter a single mouse click
Event data History of events generated
b a iven sensor
Events triggered by a
sensor at time t
Types of alarms
enerated b sensors
Genomesequences
DNA sequence of aparticular species
An element of the DNAsequence
Bases A,T,G,C
Element
EventE1E2
E1E3
E2E3E4
E2ransac on
(Item)
Extensions:Se uential atternminin
Object Timestamp Events
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
164/195
j p
A 10 2 3 5A 20 6, 1A 23 1B 11 4, 5, 6B 17 2
B 21 7, 8, 1, 2B 28 1, 6C 14 1, 7, 8
Extensions:Se uential atternminin
Ase uence< a a a >iscontainedinanotherse uence(mn)ifthereexistintegersi1
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
165/195
Data sequence Subsequence Contain?< {2,4} {3,5,6} {8} > < {2} {3,5} > Yes
< > < >, ,
< {2,4} {2,4} {2,5} > < {2} {4} > Yes
Thesupportofasubsequencewisdefinedasthefractionof
datase uencesthatcontainw Asequentialpatternisafrequentsubsequence(i.e.,a
subse uencewhosesu ortisminsu
Data Mining - From the Top 10 Algorithms to theew a enges
u ne Top 10 Algorithms in Data Mining Research
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
166/195
Introduction Classification
Statistical Learning
Bagging and Boosting Clustering
Association Analysis
Link Mining Text Mining
Top 10 Algorithms
10 Challen in Problems in Data Minin Research
166 Concluding Remarks
Extensions:Gra hminin
Extendassociationruleminin tofindin
frequentsubgraphs
UsefulforWebMining,computational
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
167/195
chemistr bioinformatics s atial data sets etcHomepage
Research
ArtificialIntelligence
a a n ng
Data Mining - From the Top 10 Algorithms to theew a enges
u ne Top 10 Algorithms in Data Mining Research
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
168/195
Introduction Classification
Statistical Learning
Bagging and Boosting Clustering
Association Analysis
Link Mining Text Mining
Top 10 Algorithms
10 Challen in Problems in Data Minin Research
168 Concluding Remarks
ICDM06PanelonTop10AlgorithmsinDataMining
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
169/195
Link Mining
169
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
170/195
n e a a
Heterogeneous,multirelationaldata
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
171/195
Nodes are objects Mayhavedifferentkindsofobjects Objectshaveattributes
Objects
may
have
labels
or
classesEd es are links
Mayhavedifferentkindsoflinks
Linksmaybedirected,arenotrequiredtobebinary
amp e oma ns
webdata web
bibliographicdata(cite)
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
172/195
epidimiological data(epi)
P1
P1
P1
P2
P3
I
P2
P3
I
P2
P3
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
173/195
P2I P2I P2
A1PapersPapers A1
P4Authors P4AuthorsCitationCitation
P4
Co-Citation
Author-of
Co-Citation
Author-ofAttributes:
Author-affiliationAuthor-affiliationCategories
n n ng as s
LinkbasedObjectClassification
LinkTypePrediction
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
174/195
re c ng n x s ence
ObjectIdentification
SubgraphDiscovery
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
175/195
175
e a a
Webpages
Intrapagestructures
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
176/195
Usa edata
SupplementaldataProfiles
Cookies
176
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
177/195
e ruc ure n ng
Minestructure(links,graph)oftheWeb
Techniques
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
178/195
CLEVER
CreateamodeloftheWeborganization. ay ecom ne w t contentm n ngto
more
effectivel
retrieve
im ortant
a es.
178
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
179/195
age an con
PR(p)=c(PR(1)/N1++PR(n)/Nn)
PR(i):PageRankforapageiwhichpointstotarget
page p
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
180/195
pagep.Ni: numberoflinkscomingoutofpagei
180
age an con
enera rincip e
A B C D
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
181/195
A B C D
Every page has some number of Outbounds links (forwardlinks) and Inbounds links (backlinks).
A page X has a high rank if:
- t as many n oun s in s- It has highly ranked Inbounds links
181
- age n ng to as ew ut oun s n s.
Basedonasetofkeywords,findsetofrelevant
pages R.
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
182/195
. ExpandRtoabaseset,B,ofpageslinkedtoorfromR.
Calculateweightsforauthoritiesandhubs.
PageswithhighestranksinRarereturned. AuthoritativePages:
Highlyimportantpages.
Bestsourceforrequestedinformation.
HubPages:
182
.
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
183/195
183
e sage n ng
Extendsworkofbasicsearchengines earc ng nes
IR application
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
184/195
IRapplicationKeywordbased
Crawlers
Indexing
Linkanalysis
Prentice Hall 184
g g
Personalization
ImprovestructureofasitesWebpages
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
185/195
references
Improvedesignofindividualpages Improveeffectivenessofecommerce(sales
Prentice Hall 185
Cleanse
Sessionize .
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
186/195
PatternDiscovery oun pa erns a occur nsess ons
Patternissequenceofpagesreferencesinsession.
Similartoassociationrules Transaction:session
emse :pa ern orsu se Orderisimportant
Prentice Hall 186
ICDM06PanelonTop10AlgorithmsinDataMining
i i
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
187/195
Integrated Mining
n egra e n ng
Ontheuseofassociationrulealgorithmsforc ass ca on:
S b Di C t i ti f l
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
188/195
SubgroupDiscovery:Caracterizationofclasses
ven a popu at on o n v ua s an a property o
those individuals we are interested in, find populationsu groups a are s a s ca y mos n eres ng , e.g.,
are as large as possible and have the most unusual
interest.
188
oug e pproac
Roughsetsareusedtoapproximatelyorroughlydefineequivalentclasses
AroughsetforagivenclassCisapproximatedbytwosets:a
lower
approximation(certain
to
be
in
C)
and
an
upper
approximation (cannot be described as not belonging to C)
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
189/195
pp ( ) ppapproximation(cannotbedescribedasnotbelongingtoC)
Findingtheminimalsubsets(reducts)ofattributes(forfeaturereduction isNPhardbutadiscernibilitymatrixis
usedtoreducethecomputationintensity
189
ICDM06PanelonTop10AlgorithmsinDataMining
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
190/195
190
ICDM06PanelonTop10AlgorithmsinDataMining
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
191/195
191
ICDM06PanelonTop10AlgorithmsinDataMining
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
192/195
192
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
193/195
Data Mining - From the Top 10 Algorithms to theew a enges
u ne Top 10 Algorithms in Data Mining Research
Introduction Classification
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
194/195
Classification
Statistical Learning
Bagging and Boosting
Clustering
Association Analysis
Link Mining Text Mining
Top 10 Algorithms
10 Challen in Problems in Data Minin Research
194 Concluding Remarks
Data Mining - From the Top 10 Algorithms to theew a enges
u ne Top 10 Algorithms in Data Mining Research
Introduction Classification
-
8/13/2019 DMSC4 PartI From the Top 10 Algorithms to the New Challenges in Data Mining
195/195
Classification
Statistical Learning
Bagging and Boosting
Clustering
Association Analysis
Link Mining Text Mining
Top 10 Algorithms
10 Challen in Problems in Data Minin Research
195 Concluding Remarks