Dm week01 intro.handout

18
Christof Monz Informatics Institute University of Amsterdam Data Mining Week 1: Introduction Today’s Class Christof Monz Data Minging - Week 1: Introduction 1 I Overview of Data Mining I Overview of Machine Learning I Course administrivia

Transcript of Dm week01 intro.handout

Page 1: Dm week01 intro.handout

Christof MonzInformatics Institute

University of Amsterdam

Data MiningWeek 1: Introduction

Today’s Class

Christof MonzData Minging - Week 1: Introduction

1

I Overview of Data MiningI Overview of Machine LearningI Course administrivia

Page 2: Dm week01 intro.handout

What’s Data Mining?

Christof MonzData Minging - Week 1: Introduction

2

I Data: Records, web pages, documents, etc.I Mining: The process or business of extracting

ore or minerals from the ground (The AmericanHeritage)

I Data Mining: The nontrivial extraction ofimplicit, previously unknown, and potentiallyuseful information from large amounts of data

Why Data Mining?

Christof MonzData Minging - Week 1: Introduction

3

I There is an abundance of data resources:commercial databases, intranets, the Internet,. . .

I These resources contain a large amount ofvaluable data

I The best way to structure the data depends onhow one wants to exploit it

I Manual data organization is very laborious andexpensive

I There is a need to automate this process

Page 3: Dm week01 intro.handout

Some Application Areas

Christof MonzData Minging - Week 1: Introduction

4

I Customer analysis (what impacts customerbehavior?)

I Medical research (what is the impact oflifestyle/drug effects?)

I Insurance (risk assessment)I Stock investment (which factors impact stock

performance?)I Fraud detection (when is a transaction likely to

be fraudulent?)

The Need for Automated Analysis

Christof MonzData Minging - Week 1: Introduction

5

I Much of the available data is never analyzed!

Page 4: Dm week01 intro.handout

What is and isn’t Data Mining

Christof MonzData Minging - Week 1: Introduction

6

I Look up in an electronically available phonebook what John Doe’s phone number andaddress is (isn’t Data Mining but databasemanagement)

I Infer from analyzing a number of web pageswhat John Doe’s phone number is, althoughthis information is not expressed explicitly (isData Mining)

Situating Data Mining

Christof MonzData Minging - Week 1: Introduction

7

I Data Mining lies on the intersection of anumber of research areas

Page 5: Dm week01 intro.handout

Data Mining Tasks

Christof MonzData Minging - Week 1: Introduction

8

I Prediction• Use some variables to predict unknown or future values

of other variablesI Description

• Find human-interpretable patterns that describe the data

Some Data Mining Tasks

Christof MonzData Minging - Week 1: Introduction

9

I Classification (Predictive)I Clustering (Descriptive)I Association Rule Discovery (Descriptive)I Sequential Pattern Discovery (Descriptive)I Regression (Predictive)I Deviation Detection (Predictive)

Page 6: Dm week01 intro.handout

Classification

Christof MonzData Minging - Week 1: Introduction

10

I Given a collection of records (training set)• Each record contains a set of attributes, one of the

attributes is the class.

I Find a model for class attribute as a function ofthe values of other attributes

I Goal: previously unseen records should beassigned a class as accurately as possible.• A test set is used to determine the accuracy of the model

Example: Direct Marketing

Christof MonzData Minging - Week 1: Introduction

11

I Goal: Reduce cost of mailing by targeting a setof consumers likely to buy a new cell-phoneproduct

I Approach:• Use the data for a similar product introduced before

• We know which customers decided to buy and whichdecided otherwise. This buy/don’t buy decision formsthe class attribute

• Collect various demographic, lifestyle, andcompany-interaction related information about all suchcustomers (where they stay, how much they earn, . . . )

• Use this information as input attributes to learn aclassifier model

Page 7: Dm week01 intro.handout

Classify This!

Christof MonzData Minging - Week 1: Introduction

12

Some Observations

Christof MonzData Minging - Week 1: Introduction

13

I Training data (examples for which the class isknown)

I Feature extraction (what are the ’things’ thatare relevant to predict a class?)

I Feature weight (how important is a feature?)I Feature combination (sometimes features act

together)I Over-fitting (some features don’t generalize

well)I Evaluation (how accurate is the prediction?)

Page 8: Dm week01 intro.handout

Machine Learning

Christof MonzData Minging - Week 1: Introduction

14

I The research area of machine learninginvestigates and formalizes the challenge ofprediction and description by computer

I Machine learning plays a central role in datamining

I It is used for:• Building new models

• Adapting existing models to new situations

• Comparing the performance of competing models

Machine Learning is . . .

Christof MonzData Minging - Week 1: Introduction

15

I . . . the principles, methods, and algorithms forlearning and prediction on the basis of pastexperience

I . . . already everywhere: speech recognition,hand-written character recognition, computervision, information retrieval, operating systems,compilers, fraud detection, security, defenseapplications, . . .

Page 9: Dm week01 intro.handout

Learning

Christof MonzData Minging - Week 1: Introduction

16

I Steps• entertain a (biased) set of possibilities

• adjust predictions based on feedback

• rethink the set of possibilitiesI Principles of learning are ‘universal’

• society (e.g., scientific community)

• animal (e.g., human)

• machine

Learning and Prediction

Christof MonzData Minging - Week 1: Introduction

17

I We make predictions all the time but rarelyinvestigate the processes underlying ourpredictions

I In carrying out scientific research we are alsogoverned by how theories are evaluated

I To automate the process of making predictionswe need to understand in addition how wesearch and refine ‘theories’

Page 10: Dm week01 intro.handout

Learning: Key Steps

Christof MonzData Minging - Week 1: Introduction

18

I Data and assumptions• What data is available for the learning task?

• What can we assume about the problem?I Representation

• How should we represent the examples to be classified?I Evaluation and Estimation

• How well are we doing?

• How do we adjust our predictions based on thefeedback?

• Can we rethink the approach to do even better?

Example

Christof MonzData Minging - Week 1: Introduction

19

I A classification problem: predict the grades forstudents taking this course

I Key Steps:1. data

2. assumptions

3. representation

4. estimation

5. evaluation

6. model selection

Page 11: Dm week01 intro.handout

Example

Christof MonzData Minging - Week 1: Introduction

20

I Key Steps:1. data: what ‘past experience’ can we rely on?

2. assumptions: what can we assume about the students orthe course?

3. representation: how do we ‘summarize’ a student?

4. estimation: how do we construct a map from students togrades?

5. evaluation: how well are we predicting?

6. model selection: perhaps we can do even better?

Example: Data

Christof MonzData Minging - Week 1: Introduction

21

I The data we have available (in principle):• Names and grades of students in past years ML courses

• Academic record of past and current students

I Training data:Student ML course 1 course 2 . . .

Peter A B A . . .David B A A . . .

I Test data:Student ML course 1 course 2 . . .

Jack ? C A . . .Kate ? A A . . .

Page 12: Dm week01 intro.handout

Assumptions

Christof MonzData Minging - Week 1: Introduction

22

I There are many assumptions we can make tofacilitate predictions:• The course has remained roughly the same over the years

• Each student performs independently from others

Example: Representation

Christof MonzData Minging - Week 1: Introduction

23

I Academic records are rather diverse so we mightlimit the summaries to a select few courses

I For example, we can summarize the i th student(say David) with a vector: xi = [B A A]

I The available data in this representation:Training Testing

Student ML grade Student ML grade

x1 A x ′1 ?x2 B x ′2 ?. . . . . . . . . . . .

Page 13: Dm week01 intro.handout

Example: Estimation

Christof MonzData Minging - Week 1: Introduction

24

I Given the training dataStudent ML grade

x1 Ax2 B. . . . . .

find a mapping from input vectors x to ‘labels’y encoding the grades for the ML course.

I Possible solution (nearest neighbor classifier):1. For any student x in the test set find the ‘closest’

student xi in the training set

2. Predict yi as the grade of the closest student

Example: Evaluation

Christof MonzData Minging - Week 1: Introduction

25

I How can we tell how good our predictions are?• We can wait till the end of this course

• We can try to assess the accuracy based on the data wealready have (part of the training data)

I Possible solution:• Divide the training set further into training and test sets

• Evaluate the classifier constructed on the basis of onlythe smaller training set on the new test set

Page 14: Dm week01 intro.handout

Example: Model Selection

Christof MonzData Minging - Week 1: Introduction

26

I We can refine• the estimation algorithm (e.g., using a classifier other

than the nearest neighbor classifier)

• the representation (e.g., base the summaries on adifferent set of courses)

• the assumptions (e.g., perhaps students work in groups)etc.

I We have to rely on the method of evaluatingthe accuracy of our predictions to select amongthe possible refinements

Types of Learning Approaches

Christof MonzData Minging - Week 1: Introduction

27

I Supervised learning: where we get a set oftraining inputs and outputs• E.g., classification, regression

I Unsupervised learning: where we areinterested in capturing inherent organization inthe data• E.g., clustering, density estimation

I Reinforcement learning: where we only getfeedback in the form of how well we are doing(not what we should be doing)• E.g., planning

Page 15: Dm week01 intro.handout

Challenges of Data Mining

Christof MonzData Minging - Week 1: Introduction

28

I ScalabilityI Dimensionality/ComplexityI Data qualityI Data ownershipI Privacy considerationsI Continually updated data

Recap

Christof MonzData Minging - Week 1: Introduction

29

I Difference between data mining and otherresearch areas

I Applications of data miningI Need for automation and the use of machine

learningI Key steps in machine learning

Page 16: Dm week01 intro.handout

About This Course

Christof MonzData Minging - Week 1: Introduction

30

I This course does not:• give a comprehensive introduction to data mining

• cover how to adapt data mining to specific applications

• cover feature extraction

• cover evaluation issues in detailI This course does:

• focus on the pre-dominant approach in data mining:machine learning

• sketch some of the example applications

• introduce a representative selection of machine learningtechniques used in data mining

• focus on the algorithmic fundamentals of machinelearning

Approaches Covered

Christof MonzData Minging - Week 1: Introduction

31

I Linear regression (regression)I Decision Trees (classification)I Neural Networks (classification)I k-Nearest-Neighbors (classification)I Naive Bayes (classification)I K-Means (clustering)I Hierarchical Clustering (clustering)

Page 17: Dm week01 intro.handout

What to get out of this Course

Christof MonzData Minging - Week 1: Introduction

32

I At the end of this course you will have learned:• what type of problems can be addressed by data mining

techniques

• what the most common machine learning approaches indata mining are

• which machine learning approaches are appropriate for agiven type of data mining application

• the algorithmic fundamentals of a number of relevantmachine learning approaches

Course Administrivia

Christof MonzData Minging - Week 1: Introduction

33

I Exam counts for 40%, homework counts for20%, practical assignments (40%)

I Lectures are on Tuesday 9-11am (D1.116)Tutorials (werk colleges) are on Thursday9-11am (G0.05) and Fridays 9-11am (G5.29)Labs are on Thursday 11am-1pm (G0.18)or Friday 11am-1pm (G0.18)

Page 18: Dm week01 intro.handout

Course Administrivia

Christof MonzData Minging - Week 1: Introduction

34

I Teaching assistants:Yijin He (email: [email protected])(English only!)Spyros Martzoukos (email:[email protected]) (English only!)

I Course web page: on BlackboardI Check course web page regularly for

announcements, slides, . . .