What is Datamining? Which algorithms can be used for Datamining?

38
DATAMINING Seval Ünver E1900810 | CENG 553 Middle East Technical University Computer Engineering Department 14.05.2013 CENG 553 In Summary

description

This presentation includes what is datamining, which technics and algorithms are available in datamining. This presentation helps you to understand the concepts of datamining.

Transcript of What is Datamining? Which algorithms can be used for Datamining?

Page 1: What is Datamining? Which algorithms can be used for Datamining?

DATAMINING

Seval ÜnverE1900810 | CENG 553

Middle East Technical UniversityComputer Engineering Department

14.05.2013 CENG 553

In Summary

Page 2: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 2

Outline• Introduction• Data vs. Information• Who uses datamining?• Common uses of datamining• Datamining is…• Supervised and Unsupervised Learning• Predictive Models• Datamining Process• Some Popular Datamining Algorithms• Data Warehouse• Conceptual Modelling of Data Warehouse• Example of Star Schema, Snowflake Schema, Fact Constellation• Evolution of OLTP, OLAP and Data Warehouse

Page 3: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 3

Introduction

• Nowadays, large data sets have become available due to advances in technology.

• As a result, there is an increasing interest in various scientific communities to explore the use of emerging data mining techniques for the analysis of these large data sets *.

• Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data **. * Grossman et al., 2001

** Shmueli G, 2012

Page 4: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 4

What is Datamining?

• Process of semi-automatically analyzing large databases to find patterns that are *:– valid: hold on new data with some certainty– novel: non-obvious to the system– useful: should be possible to act on the item – understandable: humans should be able to

interpret the pattern• Also known as Knowledge Discovery in

Databases * Prof. S. Sudarshan CSE Dept, IIT Bombay

Page 5: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 5

Big data: Cash Register

• Past: It was a calculator.

• Now: It saves every detail of every action.– The movements of

each product.– The movements of

each user.

Page 6: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 6

Data vs. Information

• Data is useless by itself. • Data is not just numbers or

letters. It consists of numbers, letters and their meaning. The meaning is called metadata.

• Information is interpreted data.

• Converting the data to information is called data processing.

Page 7: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 7

Who uses Datamining?• CapitalOne Bank

– future prediction• Netflix (the largest DVD-by-mail rental company)

– Recommendation (you might also be interested in…)• Amazon.com

– recommendation• British law enforcement

– crime trends or security threats• Facebook

– prediction how active a user will be after 3 months.• Children's Hospital in Boston

– detecting domestic abuse • Pandora (an Internet music radio)

– chooses the next song to play

Page 8: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 8

Common uses of Datamining:

• Direct mail marketing• Web site personalization• Credit card fraud detection• Gas & jewelry• Bioinformatics• Text analysis– SAS lie detector

• Market basket analysis– Beer & baby diapers:

Page 9: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 9

Application Areas

Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis

Page 10: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 10

Datamining is…

Page 11: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 11

Datamining is not…

• Data warehousing • SQL / Ad Hoc Queries / Reporting• Software Agents• Online Analytical Processing (OLAP)• Data Visualization

Page 12: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 12

Supervised vs. Unsupervised Learning

• Supervised: – Problem solving– Driven by a real business problems and historical data– Quality of results dependent on quality of data

• Unsupervised:– Exploration (aka clustering)– Relevance often an issue

• Beer and baby diapers

– Useful when trying to get an initial understanding of the data– Non-obvious patterns can sometimes pop out of a completed

data analysis project

Page 13: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 13

Predictive Models

Page 14: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 14

Datamining Process

Page 15: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 15

Some Popular Data Mining Algorithms

Supervised— Regression models— Decision trees— k-Nearest-Neighbor— Neural networks— Rule induction Unsupervised— K-means clustering— Self organized map

Page 16: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 16

A very simple problem set

Page 17: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 17

Regression Models

Page 18: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 18

Regression Models

Page 19: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 19

Decision TreesA series of nested if/then rules.

Page 20: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 20

Decision Tree Models

Page 21: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 21

K-Nearest Neighbor Algorithm

• Find nearest data point and do the same thing as you did for that record.

Page 22: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 22

K-Nearest Neighbor Models

Page 23: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 23

Neural Networks• Set of nodes connected by directed weighted edges.

Page 24: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 24

Neural Networks Models

Page 25: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 25

Neural Networks Models

Page 26: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 26

· Pros+ Can learn more complicated class boundaries+ Fast application+ Can handle large number of features

· Cons­ Slow training time­ Hard to interpret ­ Hard to implement: trial and error for choosing number of nodes

Pros and Cons of Neural Networks

Page 27: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 27

Supervised Algorithm Summary

• Decision Trees– Understandable– Relatively fast– Easy to translate into SQL queries

• kNN– Quick and easy– Models tend to be very large

• Neural Networks– Difficult to interpret– Can require significant amounts of time to train

Page 28: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 28

K-Means Clustering

• User starts by specifying the number of clusters (K)• K datapoints are randomly selected• Repeat until no change:

– Hyperplanes separating K points are generated– K Centroids of each cluster are computed

Page 29: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 29

Data WarehouseData warehouse is a database used for reporting and data analysis.

Page 30: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 30

Data Mining works with Warehouse Data

• Data Mining provides the Enterprise with intelligence

• Data Warehousing provides the Enterprise with a memory

Page 31: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 31

Conceptual Modeling of Data Warehouses

• Modeling data warehouses: dimensions & measures

– Star schema: A fact table in the middle connected to a set of

dimension tables

– Snowflake schema: A refinement of star schema where

some dimensional hierarchy is normalized into a set of

smaller dimension tables, forming a shape similar to

snowflake

– Fact constellations: Multiple fact tables share dimension

tables, viewed as a collection of stars, therefore called

galaxy schema or fact constellation

Page 32: What is Datamining? Which algorithms can be used for Datamining?

Example of Star Schema

10.04.2023 32

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcitystate_or_provincecountry

location

Sales­Fact­Table

­­­­­­­­­­­time_key

­­­­­­­­­­­­­­item_key

­­­­­­­­­­­branch_key

­­­­­­­­­location_key

­­­­­­­­­­­­units_sold

­­­­­­­­­dollars_sold

­­­­­­­­­­­­­avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Seval Ünver | CENG 553

Page 33: What is Datamining? Which algorithms can be used for Datamining?

Example of Snowflake Schema

10.04.2023 33

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcity_key

location

Sales­Fact­Table

­­­­­­­­­­­time_key

­­­­­­­­­­­­­­item_key

­­­­­­­­­­­branch_key

­­­­­­­­­location_key

­­­­­­­­­­­­units_sold

­­­­­­­­­dollars_sold

­­­­­­­­­­­­­avg_sales

Measures

item_keyitem_namebrandtypesupplier_key

item

branch_keybranch_namebranch_type

branch

supplier_keysupplier_type

supplier

city_keycitystate_or_provincecountry

city

Seval Ünver | CENG 553

Page 34: What is Datamining? Which algorithms can be used for Datamining?

Example of Fact Constellation

10.04.2023 34

time_keydayday_of_the_weekmonthquarteryear

time

location_keystreetcityprovince_or_statecountry

location

Sales­Fact­Table

time_key

­­­­­­­­­item_key

­­­­­­branch_key

­­­­location_key

­­­­­­­­units_sold

­­­­­dollars_sold

­­­­­­­­­avg_sales

Measures

item_keyitem_namebrandtypesupplier_type

item

branch_keybranch_namebranch_type

branch

Shipping­Fact­Table

time_key

­­­­­­­­­item_key

­­­­­shipper_key

­­from_location

­­­­­­to_location

­­­­­dollars_cost

­­­units_shipped

shipper_keyshipper_namelocation_keyshipper_type

shipper

Seval Ünver | CENG 553

Page 35: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 35

Evolution of OLTP, OLAP and Data Warehouse

Time

Page 36: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 36

Evolutionary Step Business Question Enabling Technology

Data Collection (1960s)

"What was my total revenue in the last five years?" computers, tapes, disks

Data Access (1980s)

"What were unit sales in New England last March?"

faster and cheaper computers with more storage, relational databases

Data Warehousing AndDecision Support

"What were unit sales in New England last March? Drill down to Boston."

faster and cheaper computers with more storage, On-line analytical processing (OLAP), multidimensional databases,data warehouses

Data Mining "What's likely to happen to Boston unit sales next month? Why?"

faster and cheaper computers with more storage, advanced computer algorithms

Page 37: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 37

As a Result

• In order to apply data mining, a large amount of quality data is required.

• The aim of datamining is acquiring rules and equations which can be used to predict future.

• To be successful on such a work is dependent on working with database experts and data mining specialists. They need to work together.

• Work may take longer, you need time and patience.

Page 38: What is Datamining? Which algorithms can be used for Datamining?

10.04.2023 Seval Ünver | CENG 553 38

Thank You

If you have question, you can contact with me via email: [email protected]

Seval Ünver | METU CENG