What is Learning? Machine Learning: Introduction and Unsupervised
Transcript of What is Learning? Machine Learning: Introduction and Unsupervised
1
MachineLearning:Introduc2onand
UnsupervisedLearning
Chapter18.1,18.2,18.8.1and“Introduc2ontoSta2s2calMachineLearning”
Optional: “A Few Useful Things to Know about Machine Learning,” P. Domingos, Comm. ACM 55, 2012
WhatisLearning?
• “Learningismakingusefulchangesinourminds”–MarvinMinsky
• “Learningisconstruc7ngormodifyingrepresenta7onsofwhatisbeingexperienced“–RyszardMichalski
• “Learningdenoteschangesinasystemthat...enableasystemtodothesametaskmoreefficientlythenext7me”–HerbertSimon
WhydoMachineLearning?
• Solveclassifica2onproblems• Learnmodelsofdata(“datafiNng”)• Understandandimproveefficiencyofhumanlearning(e.g.,Computer-AidedInstruc2on(CAI))
• Discovernewthingsorstructuresthatareunknowntohumans(“datamining”)
• Fillinskeletalorincompletespecifica2onsaboutadomain
MajorParadigmsofMachineLearning
• RoteLearning• Induc2on• Clustering• Discovery• Gene2cAlgorithms• ReinforcementLearning• TransferLearning• LearningbyAnalogy• Mul2-taskLearning
2
Induc2veLearning
• Generalizefromagivensetof(training)examplessothataccuratepredic2onscanbemadeaboutfutureexamples
• Learnunknownfunc2on:f(x) = y– x:aninputexample(akainstance)– y:thedesiredoutput
• Discreteorcon2nuousscalarvalue– h(hypothesis)func2onislearnedthatapproximatesf
Represen2ng“Things”inMachineLearning
• Anexampleorinstance,x,representsaspecificobject(“thing”)
• xo[enrepresentedbyaD-dimensionalfeaturevectorx=(x1,...,xD)∈RD
• Eachdimensioniscalledafeatureora2ribute• Con2nuousordiscrete• xisapointintheD-dimensionalfeaturespace• Abstrac2onofobject.Ignoresallotheraspects(e.g.,twopeoplehavingthesameweightandheightmaybeconsiderediden2cal)
FeatureVectorRepresenta2on• Preprocessrawdata
– extractafeature(a_ribute)vector,x, thatdescribesalla_ributesrelevantforanobject
• Eachxisalistof(attribute, value)pairs x = [(Rank, queen), (Suit, hearts), (Size, big)]– numberofa_ributesisfixed:Rank,Suit,Size– numberofpossiblevaluesforeacha_ributeisfixed(ifdiscrete)Rank:2,…,10,jack,queen,king,aceSuit:diamonds,hearts,clubs,spadesSize:big,small
TypesofFeatures
• Numericalfeaturehasdiscreteorcon2nuousvaluesthataremeasurements,e.g.,aperson’sweight
• Categoricalfeatureisonethathastwoormorevalues(categories),butthereisnointrinsicorderingofthevalues,e.g.,aperson’sreligion(akaNominalfeature)
• Ordinalfeatureissimilartoacategoricalfeaturebutthereisaclearorderingofthevalues,e.g.,economicstatus,withthreevalues:low,mediumandhigh
3
FeatureVectorRepresenta2on
EachexamplecanbeinterpretedasapointinaD-dimensionalfeaturespace,whereDisthenumberoffeatures/a_ributes
Suit
Rank
spades clubs hearts diamonds
2 4 6 8 10 J Q K
FeatureVectorRepresenta2onExample
• Textdocument– VocabularyofsizeD(~100,000):aardvark,…,zulu
• “bagofwords”:countsofeachvocabularyentry– Tomarrymytrueloveè(3531:113788:119676:1)– IwishthatIfindmysoulmatethisyearè(3819:113448:119450:1
20514:1)
• O[enremove“stopwords”:the,of,at,in,…• Special“out-of-vocabulary”(OOV)entrycatchesallunknownwords
MoreFeatureRepresenta2ons
• Image– Colorhistogram
• So[ware– Execu2onprofile:thenumberof2meseachlineisexecuted
• Bankaccount– Creditra2ng,balance,#depositsinlastday,week,month,year,#withdrawals,…
• Bioinforma2cs– Medicaltest1,test2,test3,…
TrainingSet
• Atrainingset(akatrainingsample)isacollec2onofexamples(akainstances),x1,...,xn,whichistheinputtothelearningprocess
• xi=(xi1,...,xiD)• Assumetheseinstancesareallsampledindependentlyfromthesame,unknown(popula2on)distribu2on,P(x)
• Wedenotethisbyxi∼P(x),wherei.i.d.standsforindependentandiden:callydistributed
• Example:Repeatedthrowsofdice
i.i.d.
4
TrainingSet
• Atrainingsetisthe“experience”giventoalearningalgorithm
• Whatthealgorithmcanlearnfromitvaries• Twobasiclearningparadigms:
– unsupervisedlearning– supervisedlearning
Induc2veLearning
• Supervisedvs.UnsupervisedLearning– supervised:"teacher"givesasetof(x,y)pairs– unsupervised:onlythex’saregiven
• Ineithercase,thegoalistoes2matef sothatitgeneralizeswellto“correctly”dealwith“futureexamples”incompu2ngf(x)=y– Thatis,findfthatminimizessomemeasureoftheerroroverasetofsamples
UnsupervisedLearning• Trainingsetisx1,...,xn,that’sit!• No“teacher” providingsupervisionastohowindividualexamplesshouldbehandled
• Commontasks:– Clustering:separatethenexamplesintogroups– Discovery:findhiddenorunknownpa_erns– Noveltydetec:on:findexamplesthatareverydifferentfromtherest
– Dimensionalityreduc:on:representeachexamplewithalowerdimensionalfeaturevectorwhilemaintainingkeycharacteris2csofthetrainingsamples
Clustering
• Goal:Grouptrainingsampleintoclusterssuchthatexamplesinthesameclusteraresimilar,andexamplesindifferentclustersaredifferent
• Howmanyclustersdoyousee?• Manyclusteringalgorithms
5
OrangesandLemons
(fromIainMurrayh_p://homepages.inf.ed.ac.uk/imurray2/)
GoogleNews
DigitalPhotoCollec2ons
• Youhave1000sofdigitalphotosstoredinvariousfolders
• Organizethembe_erbygroupingintoclusters– Simplestidea:useimagecrea2on2me(EXIFtag)– Morecomplicated:extractimagefeatures
Histogram-BasedImageSegmenta2on
• Goal:SegmenttheimageintoKregions– ReducethenumberofgraylevelstoKandmapeachpixeltotheclosestgraylevel
6
Histogram-BasedImageSegmenta2on
• Goal:SegmenttheimageintoKregions– ReducethenumberofgraylevelstoKandmapeachpixeltotheclosestgraylevel
Detec2ngEventsonTwi_er
• Usereal-2metextandimagesfromtweetstodiscovernewsocialevents
• Clustersdefinedbysimilarwordsandwordcooccurences,plussimilarimagefeatures
Google’sEmbeddingProjectorProject ThreeFrequentlyUsedClusteringMethods
• HierarchicalAgglomera:veClustering– Buildabinarytreeoverthedatasetbyrepeatedlymergingclusters
• K-MeansClustering– Specifythedesirednumberofclustersanduseanitera2vealgorithmtofindthem
• MeanShiDClustering
7
HierarchicalAgglomera:veClustering
• Ini2allyeverypointisinitsowncluster
HierarchicalAgglomera:veClustering• Findthepairofclustersthataretheclosest
HierarchicalAgglomera:veClustering• Mergethetwointoasinglecluster
HierarchicalAgglomera:veClustering• Repeat…
8
HierarchicalAgglomera:veClustering• Repeat…
HierarchicalAgglomera:veClustering• Repeat…un2lthewholedatasetisonegiantcluster• Yougetabinarytree(notshownhere)
HierarchicalAgglomera:veClusteringAlgorithm
HierarchicalAgglomera:veClustering
Howdoyoumeasuretheclosenessbetweentwoclusters?Atleastthreeways:
– Single-linkage:theshortestdistancefromanymemberofoneclustertoanymemberoftheothercluster
– Complete-linkage:thelargestdistancefromanymemberofoneclustertoanymemberoftheothercluster
– Average-linkage:theaveragedistancebetweenallpairsofmembers,onefromeachcluster
9
Distance• Howtomeasurethedistancebetweenapairofexamples,X=(x1,…,xn)andY=(y1,…,yn)?– Euclidean
– Manha_an/City-Block– Hamming
• Numberoffeaturesthataredifferentbetweenthetwoexamples
– Andmanyothers
d(X,Y) = xi − yi( )2i∑
d(X,Y) = xi − yii∑
HierarchicalAgglomera:veClustering
• Thebinarytreeyougetiso[encalledadendrogram,ortaxonomy,orahierarchyofdatapoints
• Thetreecanbecutatanyleveltoproducedifferentnumbersofclusters:ifyouwantkclusters,justcutthe(k-1)longestlinks
• 6Italianci2es• Single-linkage
Example created by Matteo Matteucci
HierarchicalAgglomera:veClusteringExample
Itera2on1:MergeMIandTO
Recompute min distance from MI/TO cluster to all other cities
10
Itera2on2:MergeNAandRM Itera2on3:MergeBAandNA/RM
Itera2on4:MergeFIandBA/NA/RM FinalDendrogram
11
WhatFactorsAffecttheOutcomeofHierarchicalAgglomera:veClustering?• Featuresused• Rangeofvaluesforeachfeature• Linkagemethod• Distancemetricused• Weightofeachfeature• …
HierarchicalAgglomera:veClusteringApplet
h_p://home.dei.polimi.it/ma_eucc/Clustering/tutorial_html/AppletH.html
ThreeFrequentlyUsedClusteringMethods
• HierarchicalAgglomera:veClustering– Buildabinarytreeoverthedataset
• K-MeansClustering– Specifythedesirednumberofclustersanduseanitera2vealgorithmtofindthem
• MeanShiDClustering
• SupposeItellyoutheclustercenters,ci– Q:Howtodeterminewhichpointstoassociatewitheachci?
K-MeansClustering
– A:Foreachpointx,chooseclosestci
• SupposeItellyouthepointsineachcluster– Q:Howtodeterminetheclustercenters?– A:Choosecitobethemean/centroidofallpointsinthecluster
12
K-MeansClustering• Thedataset.Inputk=5
K-MeansClustering• Randomlypick5
posi2onsasini2alclustercenters(notnecessarilydatapoints)
K-MeansClustering• Eachpointfinds
whichclustercenteritisclosestto;thepointbelongstothatcluster
K-MeansClustering• Eachcluster
computesitsnewcentroidbasedonwhichpointsbelongtoit
13
K-MeansClustering• Eachcluster
computesitsnewcentroid,basedonwhichpointsbelongtoit
• Repeatun2lconvergence(i.e.,noclustercentermoves)
K-MeansDemo
• h_p://home.dei.polimi.it/ma_eucc/Clustering/tutorial_html/AppletKM.html
K-MeansAlgorithm
• Input:x1,…,xn,kwhereeachxiisapointinad-dimensionalfeaturespace
• Step1:Selectkclustercenters,c1,…,ck• Step2:Foreachpointxi,determineitscluster:Findtheclosestcenter(using,say,Euclideandistance)
• Step3:Updateallclustercentersasthecentroids
• Repeatsteps2and3un2lclustercentersnolongerchange
ci =1
num_ pts_ in_ cluster _ ix
x∈ cluster i∑ Input image Clusters on intensity Clusters on color
Example:ImageSegmenta2on
14
K-MeansProper2es
• Willitalwaysterminate?– Yes(finitenumberofwaysofpar22oningafinitenumberofpointsintokgroups)
• Isitguaranteedtofindan“op2mal”clustering?– No,buteachitera2onwillreducethedistor2on(error)oftheclustering
Copyright © 2001, 2004, Andrew W. Moore
Non-Op2malClustering
Sayk=3andyouaregiventhefollowingpoints:
Copyright © 2001, 2004, Andrew W. Moore
Non-Op2malClustering
Givenapoorchoiceoftheini2alclustercenters,thefollowingresultispossible:
PickingStar2ngClusterCenters
Whichlocalop2mumk-Meansgoestoisdeterminedsolelybythestar2ngclustercenters
– Idea1:Runk-Meansmul2ple2meswithdifferentstar2ng,randomclustercenters(hillclimbingwithrandomrestarts)
– Idea2:Pickarandompointx1fromthedataset1. Findthepointx2farthestfromx1inthe
dataset2. Findx3farthestfromthecloserofx1,x23. …Pickkpointslikethis,andusethemasthe
star2ngclustercentersforthekclusters
15
PickingtheNumberofClusters
• Difficultproblem• Heuris2capproachesdependonthenumberofpointsandthenumberofdimensions
MeasuringClusterQuality
• Distor:on=Sumofsquareddistancesofeachdatapointtoitsclustercenter:
• The“op2mal”clusteringistheonethatminimizesdistor2on(overallpossibleclustercenterloca2onsandassignmentofpointstoclusters)
HowtoPickk?Trymul2plevaluesofkandpicktheoneatthe“elbow”ofthedistor2oncurve
Distor2o
n
NumberofClusters,k
UsesofK-Means
• O[enusedasanexploratorydataanalysistool• Inone-dimension,agoodwaytoquan2zereal-valuedvariablesintoknon-uniformbuckets
• Usedonacous2cdatainspeechrecogni2ontoconvertwaveformsintooneofkcategories(knownasVectorQuan:za:on)
• Alsousedforchoosingcolorpale_esongraphicaldisplaydevices
16
ThreeFrequentlyUsedClusteringMethods
• HierarchicalAgglomera:veClustering– Buildabinarytreeoverthedataset
• K-MeansClustering– Specifythedesirednumberofclustersanduseanitera2vealgorithmtofindthem
• MeanShiDClustering
MeanShi[Clustering1. Choose a search window size 2. Choose the initial location of the search window 3. Compute the mean location (centroid of the data) in the search
window 4. Center the search window at the mean location computed in
Step 3 5. Repeat Steps 3 and 4 until convergence
The mean shift algorithm seeks the mode, i.e., point of highest density of a data distribution:
Intui2veDescrip2on
Distribution of identical points
Region of interest
Centroid
Mean Shift vector
Objective : Find the densest region
Intui2veDescrip2on
Distribution of identical points
Region of interest
Centroid
Mean Shift vector
Objective : Find the densest region
17
Intui2veDescrip2on
Distribution of identical points
Region of interest
Centroid
Mean Shift vector Objective : Find the densest region
Intui2veDescrip2on
Distribution of identical points
Region of interest
Centroid
Mean Shift vector
Objective : Find the densest region
Intui2veDescrip2on
Distribution of identical points
Region of interest
Centroid
Mean Shift vector
Objective : Find the densest region
Intui2veDescrip2on
Distribution of identical points
Region of interest
Centroid
Mean Shift vector
Objective : Find the densest region
18
Intui2veDescrip2on
Distribution of identical points
Region of interest
Centroid
Objective : Find the densest region
Results
feature space is only gray level
Results Results
19
SupervisedLearning
• Alabeledtrainingsampleisacollec2onofexamples:(x1,y1),...,(xn,yn)
• Assume(xi,yi)∼P(x,y)andP(x,y)isunknown• Supervisedlearninglearnsafunc2onh:x→yinsomefunc2onfamily,H,suchthath(x)predictsthetruelabelyonfuturedata,x,where (x,y)∼P(x,y)
– Classifica2on:ifydiscrete– Regression:ifycon2nuous
i.i.d.
i.i.d.
Labels• Examples
– Predictgender(M,F)fromweight,height– Predictadult,juvenile(A,J)fromweight,height
• Alabelyisthedesiredpredic2onforaninstancex
• Discretelabel:classes– M,F;A,J:o[enencodeas0,1or-1,1– Mul2pleclasses:1,2,3,…,C.Noclassorderimplied.
• Con2nuouslabel:e.g.,bloodpressure
ConceptLearning
• Determineifagivenexampleisorisnotaninstanceoftheconcept/class/category– Ifitis,callitaposi:veexample– Ifnot,calleditanega:veexample
Example:MushroomClassifica2on
http://www.usask.ca/biology/fungi/
Edible or Poisonous?
20
MushroomFeatures/A_ributes1. cap-shape:bell=b,conical=c,convex=x,flat=f,knobbed=k,
sunken=s2. cap-surface:fibrous=f,grooves=g,scaly=y,smooth=s3. cap-color:brown=n,buff=b,cinnamon=c,gray=g,green=r,
pink=p,purple=u,red=e,white=w,yellow=y4. bruises?:bruises=t,no=f5. odor:almond=a,anise=l,creosote=c,fishy=y,foul=f,
musty=m,none=n,pungent=p,spicy=s6. gill-a2achment:a_ached=a,descending=d,free=f,
notched=n7. …
Classes:edible=e,poisonous=p
SupervisedConceptLearningbyInduc2on
• Givenatrainingsetofposi2veandnega2veexamplesofaconcept:– {(x1, y1), (x2, y2), ..., (xn, yn)}
whereeachyi iseither+or−• Constructadescrip2onthataccuratelyclassifieswhetherfutureexamplesareposi2veornega2ve:– h(xn+1) = yn+1
whereyn+1 isthe+or−predic2on
SupervisedLearningMethods
• k-nearest-neighbors(k-NN)(Chapter18.8.1)
• Decisiontrees• Neuralnetworks(NN)• Supportvectormachines(SVM)• etc.
Induc2veLearningbyNearest-NeighborClassifica2on
Asimpleapproach:– saveeachtrainingexampleasapointinFeatureSpace
– classifyanewexamplebygivingitthesameclassifica2onasitsnearestneighborinFeatureSpace
21
k-Nearest-Neighbors(k-NN)
• 1-NN: Decision boundary
k-NN
• Whatifwewantregression?– Insteadofmajorityvote,takeaverageofneighbors’yvalues
• Howtopickk?– Splitdataintotrainingandtuningsets– Classifytuningsetwithdifferentvaluesofk– Pickthekthatproducesthesmallesttuning-seterror
k-NNDoesn'tgeneralizewelliftheexamplesineachclassarenotwell"clustered"
Suit
Rank
Spades Clubs Hearts Diamonds
2 4 6 8 10 J Q K
k-NNDemo
• h_p://www.cs.cmu.edu/~zhuxj/courseproject/knndemo/KNN.html
22
Induc2veBias
• Induc2velearningisaninherentlyconjecturalprocess.Why?– anyknowledgecreatedbygeneraliza2onfromspecificfactscannotbeproventrue
– itcanonlybeprovenfalse
• Hence,induc2veinferenceis“falsitypreserving,”not“truthpreserving”
Induc2veBias
• LearningcanbeviewedassearchingtheHypothesisSpaceHofpossiblehfunc2ons
• Induc2veBias– isusedwhenonehischosenoveranother– isneededtogeneralizebeyondthespecifictrainingexamples
• Completelyunbiasedinduc2vealgorithm– onlymemorizestrainingexamples– can'tpredictanythingaboutunseenexamples
Induc2veBias
Biasescommonlyusedinmachinelearning:– RestrictedHypothesisSpaceBias:allowonlycertaintypesofh’s,notarbitraryones
– PreferenceBias:defineametricforcomparingh’ssoastodeterminewhetheroneisbe_erthananother
SupervisedLearningMethods
• k-nearest-neighbor(k-NN)• Decisiontrees• Neuralnetworks(NN)• Supportvectormachines(SVM)• etc.