What is Datamining? Which algorithms can be used for Datamining?
-
Upload
svltr -
Category
Technology
-
view
119 -
download
2
description
Transcript of What is Datamining? Which algorithms can be used for Datamining?
DATAMINING
Seval ÜnverE1900810 | CENG 553
Middle East Technical UniversityComputer Engineering Department
14.05.2013 CENG 553
In Summary
10.04.2023 Seval Ünver | CENG 553 2
Outline• Introduction• Data vs. Information• Who uses datamining?• Common uses of datamining• Datamining is…• Supervised and Unsupervised Learning• Predictive Models• Datamining Process• Some Popular Datamining Algorithms• Data Warehouse• Conceptual Modelling of Data Warehouse• Example of Star Schema, Snowflake Schema, Fact Constellation• Evolution of OLTP, OLAP and Data Warehouse
10.04.2023 Seval Ünver | CENG 553 3
Introduction
• Nowadays, large data sets have become available due to advances in technology.
• As a result, there is an increasing interest in various scientific communities to explore the use of emerging data mining techniques for the analysis of these large data sets *.
• Data mining is the semi-automatic discovery of patterns, associations, changes, anomalies, and statistically significant structures and events in data **. * Grossman et al., 2001
** Shmueli G, 2012
10.04.2023 Seval Ünver | CENG 553 4
What is Datamining?
• Process of semi-automatically analyzing large databases to find patterns that are *:– valid: hold on new data with some certainty– novel: non-obvious to the system– useful: should be possible to act on the item – understandable: humans should be able to
interpret the pattern• Also known as Knowledge Discovery in
Databases * Prof. S. Sudarshan CSE Dept, IIT Bombay
10.04.2023 Seval Ünver | CENG 553 5
Big data: Cash Register
• Past: It was a calculator.
• Now: It saves every detail of every action.– The movements of
each product.– The movements of
each user.
10.04.2023 Seval Ünver | CENG 553 6
Data vs. Information
• Data is useless by itself. • Data is not just numbers or
letters. It consists of numbers, letters and their meaning. The meaning is called metadata.
• Information is interpreted data.
• Converting the data to information is called data processing.
10.04.2023 Seval Ünver | CENG 553 7
Who uses Datamining?• CapitalOne Bank
– future prediction• Netflix (the largest DVD-by-mail rental company)
– Recommendation (you might also be interested in…)• Amazon.com
– recommendation• British law enforcement
– crime trends or security threats• Facebook
– prediction how active a user will be after 3 months.• Children's Hospital in Boston
– detecting domestic abuse • Pandora (an Internet music radio)
– chooses the next song to play
10.04.2023 Seval Ünver | CENG 553 8
Common uses of Datamining:
• Direct mail marketing• Web site personalization• Credit card fraud detection• Gas & jewelry• Bioinformatics• Text analysis– SAS lie detector
• Market basket analysis– Beer & baby diapers:
10.04.2023 Seval Ünver | CENG 553 9
Application Areas
Industry ApplicationFinance Credit Card AnalysisInsurance Claims, Fraud AnalysisTelecommunication Call record analysisTransport Logistics managementConsumer goods promotion analysisData Service providersValue added dataUtilities Power usage analysis
10.04.2023 Seval Ünver | CENG 553 10
Datamining is…
10.04.2023 Seval Ünver | CENG 553 11
Datamining is not…
• Data warehousing • SQL / Ad Hoc Queries / Reporting• Software Agents• Online Analytical Processing (OLAP)• Data Visualization
10.04.2023 Seval Ünver | CENG 553 12
Supervised vs. Unsupervised Learning
• Supervised: – Problem solving– Driven by a real business problems and historical data– Quality of results dependent on quality of data
• Unsupervised:– Exploration (aka clustering)– Relevance often an issue
• Beer and baby diapers
– Useful when trying to get an initial understanding of the data– Non-obvious patterns can sometimes pop out of a completed
data analysis project
10.04.2023 Seval Ünver | CENG 553 13
Predictive Models
10.04.2023 Seval Ünver | CENG 553 14
Datamining Process
10.04.2023 Seval Ünver | CENG 553 15
Some Popular Data Mining Algorithms
Supervised— Regression models— Decision trees— k-Nearest-Neighbor— Neural networks— Rule induction Unsupervised— K-means clustering— Self organized map
10.04.2023 Seval Ünver | CENG 553 16
A very simple problem set
10.04.2023 Seval Ünver | CENG 553 17
Regression Models
10.04.2023 Seval Ünver | CENG 553 18
Regression Models
10.04.2023 Seval Ünver | CENG 553 19
Decision TreesA series of nested if/then rules.
10.04.2023 Seval Ünver | CENG 553 20
Decision Tree Models
10.04.2023 Seval Ünver | CENG 553 21
K-Nearest Neighbor Algorithm
• Find nearest data point and do the same thing as you did for that record.
10.04.2023 Seval Ünver | CENG 553 22
K-Nearest Neighbor Models
10.04.2023 Seval Ünver | CENG 553 23
Neural Networks• Set of nodes connected by directed weighted edges.
10.04.2023 Seval Ünver | CENG 553 24
Neural Networks Models
10.04.2023 Seval Ünver | CENG 553 25
Neural Networks Models
10.04.2023 Seval Ünver | CENG 553 26
· Pros+ Can learn more complicated class boundaries+ Fast application+ Can handle large number of features
· Cons Slow training time Hard to interpret Hard to implement: trial and error for choosing number of nodes
Pros and Cons of Neural Networks
10.04.2023 Seval Ünver | CENG 553 27
Supervised Algorithm Summary
• Decision Trees– Understandable– Relatively fast– Easy to translate into SQL queries
• kNN– Quick and easy– Models tend to be very large
• Neural Networks– Difficult to interpret– Can require significant amounts of time to train
10.04.2023 Seval Ünver | CENG 553 28
K-Means Clustering
• User starts by specifying the number of clusters (K)• K datapoints are randomly selected• Repeat until no change:
– Hyperplanes separating K points are generated– K Centroids of each cluster are computed
10.04.2023 Seval Ünver | CENG 553 29
Data WarehouseData warehouse is a database used for reporting and data analysis.
10.04.2023 Seval Ünver | CENG 553 30
Data Mining works with Warehouse Data
• Data Mining provides the Enterprise with intelligence
• Data Warehousing provides the Enterprise with a memory
10.04.2023 Seval Ünver | CENG 553 31
Conceptual Modeling of Data Warehouses
• Modeling data warehouses: dimensions & measures
– Star schema: A fact table in the middle connected to a set of
dimension tables
– Snowflake schema: A refinement of star schema where
some dimensional hierarchy is normalized into a set of
smaller dimension tables, forming a shape similar to
snowflake
– Fact constellations: Multiple fact tables share dimension
tables, viewed as a collection of stars, therefore called
galaxy schema or fact constellation
Example of Star Schema
10.04.2023 32
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcitystate_or_provincecountry
location
SalesFactTable
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
Seval Ünver | CENG 553
Example of Snowflake Schema
10.04.2023 33
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcity_key
location
SalesFactTable
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_key
item
branch_keybranch_namebranch_type
branch
supplier_keysupplier_type
supplier
city_keycitystate_or_provincecountry
city
Seval Ünver | CENG 553
Example of Fact Constellation
10.04.2023 34
time_keydayday_of_the_weekmonthquarteryear
time
location_keystreetcityprovince_or_statecountry
location
SalesFactTable
time_key
item_key
branch_key
location_key
units_sold
dollars_sold
avg_sales
Measures
item_keyitem_namebrandtypesupplier_type
item
branch_keybranch_namebranch_type
branch
ShippingFactTable
time_key
item_key
shipper_key
from_location
to_location
dollars_cost
units_shipped
shipper_keyshipper_namelocation_keyshipper_type
shipper
Seval Ünver | CENG 553
10.04.2023 Seval Ünver | CENG 553 35
Evolution of OLTP, OLAP and Data Warehouse
Time
10.04.2023 Seval Ünver | CENG 553 36
Evolutionary Step Business Question Enabling Technology
Data Collection (1960s)
"What was my total revenue in the last five years?" computers, tapes, disks
Data Access (1980s)
"What were unit sales in New England last March?"
faster and cheaper computers with more storage, relational databases
Data Warehousing AndDecision Support
"What were unit sales in New England last March? Drill down to Boston."
faster and cheaper computers with more storage, On-line analytical processing (OLAP), multidimensional databases,data warehouses
Data Mining "What's likely to happen to Boston unit sales next month? Why?"
faster and cheaper computers with more storage, advanced computer algorithms
10.04.2023 Seval Ünver | CENG 553 37
As a Result
• In order to apply data mining, a large amount of quality data is required.
• The aim of datamining is acquiring rules and equations which can be used to predict future.
• To be successful on such a work is dependent on working with database experts and data mining specialists. They need to work together.
• Work may take longer, you need time and patience.
10.04.2023 Seval Ünver | CENG 553 38
Thank You
If you have question, you can contact with me via email: [email protected]
Seval Ünver | METU CENG