Post on 24-May-2015
DBM630: Data Mining and
Data Warehousing
MS.IT. Rangsit University
Lecture 1
Introduction to Data Mining and Data Warehousing
1
by Kritsada Sriphaew (sriphaew.k AT gmail.com)
Text: Data Mining: Concepts and Techniques, By Jiawei Han and Micheline Kamber, Morgan Kaufmann Publishers (2006). ISBN: 978-1558609013
Semester 2/2011
Administrative Matters
Data Mining and Data Warehousing by Kritsada Sriphaew 2
Course Syllabus
Lecture Notes & Assignments & Quizzes
Course’s Communication Announcements, discussion, lecture notes, etc.
Page: http://www.facebook.com/pages/Data-mining-MSIT-RSU/
How we will be evaluated?
Data Mining and Data Warehousing by Kritsada Sriphaew 3
Assessment Tasks
To Pass
At least 60% of the overall scores.
Tasks % Scores
Quizzes (Approx. 2 times) 20
Assignment
(Disscussion/Demonstration)
20
Final 60
Text Books
Data Mining and Data Warehousing by Kritsada Sriphaew 4
Mandatory Book Data Mining: Concepts and Techniques
By Jiawei Han and Micheline Kamber
Morgan Kaufmann Publishers (2006), Second Edition,
ISBN-10: 1558609016, ISBN-13: 978-1558609013
Supplementary Book Practical Machine Learning Tools and Techniques with JAVA Implementations By Ian H. Witten and Eibe Frank, Data Mining
Morgan Kaufmann Publishers (2005), 2nd Edition
ISBN-10: 0120884070, ISBN-13: 978-0120884070
Course Description (What we’LL learn?)
Data Mining and Data Warehousing by Kritsada Sriphaew 5
Introduction to data warehousing. Characteristics of data warehousing, drawbacks and benefits of data warehousing, architecture of data warehousing, internal data structure for data warehousing, data integration, creating high quality data, data mart, online analytical processing (OLAP). Introduction to data mining, types of data for mining, architecture of typical data mining system, data preprocessing, association rule mining, classification and prediction, clustering, data mining applications, current trends in data mining, text mining, web mining, including tools for data mining analysis such as WEKA, SAS, etc.
แนวคดิเบือ้งตน้ของคลงัขอ้มลู คุณลกัษณะของคลงัขอ้มลู ขอ้ดแีละขอ้เสยีของคลงัขอ้มลู สถาปตัยกรรมของคลงัขอ้มลู โครงสรา้งการจดัเกบ็ขอ้มลูภายในคลงัขอ้มลู การบูรณาการขอ้มลู การสรา้งขอ้มลูทีม่คีุณภาพ ดาตา้มารท์ การประมวลผลออนไลน์เชงิวเิคราะห ์แนวคดิเบือ้งตน้การท าเหมอืงขอ้มลู ชนิดขอ้มลูส าหรบัการท าเหมอืงขอ้มลู สถาปตัยกรรมของระบบเหมอืงขอ้มลู การเตรยีมขอ้มลู การขดุคน้กฎสมัพนัธ ์การจ าแนกประเภทและการท านาย การจดักลุม่ การท าเหมอืงขอ้มลูทีม่คีวามซบัซอ้น การประยกุตใ์ชเ้หมอืงขอ้มลู แนวโน้มปจัจุบนัการท าเหมอืงขอ้มลู เหมอืงขอ้มลูตวัอกัษร เหมอืงขอ้มลูเวบ็ รวมถงึการใชเ้ครือ่งมอืในการวเิคราะหเ์หมอืงขอ้มลู เชน่ WEKA, SAS เป็นตน้
Course Schedule (tentative)
Data Mining and Data Warehousing by Kritsada Sriphaew 6
Week Date Topics
1 8 JAN Introduction to Data Mining and Data Warehousing
2 15 JAN Data Warehouse and OLAP Technology – I
3 22 JAN Data Warehouse and OLAP Technology – II
4 29 JAN Data Mining Concepts and Data Preparation
5 5 FEB Association Rule Mining
6 12 FEB Classification Model: Decision Tree, Classification Rules
7 19 FEB Classification Model: Naïve Bayes
8 26 FEB Prediction Model: Regression
9 4 MAR Clustering
10 11 MAR Data Mining Application: Text Mining, Web Mining, Social Network
Analysis
11 18 MAR Introduction to Data Mining Tool: WEKA
12 25 MAR Tutorials
Final
Prerequisites
Data Mining and Data Warehousing by Kritsada Sriphaew 7
Basic Database Concepts
Basic Statistics:
Probability, Sampling, Logic, Linear Regression, …
Algorithms:
Basic Data Structures, Dynamic Programming, ...
We provide some backgrounds, but the class will be fast pace if you have some basics in advance.
Introduction
Data Mining and Data Warehousing by Kritsada Sriphaew 8
Motivation: Why mine data?
KDD: Knowledge Discovery in Databases
What is Data Mining?
Data Mining: on What kind of Data?
Data Mining Tasks
Data Mining Applications
Evolution of Database Technology
Data Mining and Data Warehousing by Kritsada Sriphaew 9
1960s:
Data collection, database creation, IMS and network DBMS
1970s:
Relational data model, relational DBMS implementation
1980s:
RDBMS, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, scientific, engineering, etc.)
1990s—2000s:
Data mining and data warehousing, multimedia databases, and Web databases
Large Data Sets: A Motivation
Data Mining and Data Warehousing by Kritsada Sriphaew 10
There is often information “hidden” in the data that is not readily evident.
Human analysts take weeks to discover useful information.
Much of the data is never been analyzed at all
How do you explore millions of
records, tens or hundreds of
fields, and find patterns?
KDD Process (Knowledge Discovery in Databases)
Data Mining and Data Warehousing by Kritsada Sriphaew 11
adapted from: U. Fayyad, et al. (1995), “From Knowledge Discovery to Data Mining: An Overview,” Advances in Knowledge Discovery and Data Mining, U. Fayyad et al. (Eds.), AAAI/MIT Press
Data Target
Data
Selection
Knowledge
Preprocessed
Data
Patterns
Data Mining
Interpretation/
Evaluation
Preprocessing
Knowledge Discovery
Data Mining and Data Warehousing by Kritsada Sriphaew 12
Business Intelligence (BI) vs. Data Mining
Data Mining and Data Warehousing by Kritsada Sriphaew 13
A word to call processes, techniques and tools that support business decision using information technology
Increasing potential
to support
business decisions End User
Business Analyst
Data Analyst
DBA
Making Decisions
Data Presentation
Visualization Techniques
Data Mining
Knowledge Discovery
Data Exploration
OLAP
Statistical Analysis, Querying and Reporting
Data Warehouses / Data Marts
Data Sources Paper, Files, Information Providers, Database Systems, OLTP
Terminology
Data Mining and Data Warehousing by Kritsada Sriphaew 14
Data Mining A step in the knowledge discovery process consisting of
particular algorithms (methods) that under some acceptable objective, produces a particular enumeration of patterns (models) over the data.
Knowledge Discovery Process The process of using data mining methods (algorithms)
to extract (identify) what is deemed knowledge according to the specifications of measures and thresholds, using a database along with any necessary preprocessing or transformations.
Other definitions of Data Mining
Data Mining and Data Warehousing by Kritsada Sriphaew 15
Non‐trivial extraction of implicit, previously unknown and useful information from data
Automatic or semi-automatic process for analyzing large databases to find patterns that are:
valid: hold on new data with some certainty
novel: non‐obvious to the system
useful: should be possible to act on the item
understandable: humans should be able to interpret the pattern
Origins of Data Mining
Data Mining and Data Warehousing by Kritsada Sriphaew 16
Overlaps various fields, but focus on
Scalability
Algorithm and Architecture
Automation to handle large data
Data Mining: on What kind of Data?
Data Mining and Data Warehousing by Kritsada Sriphaew 17
Relational Databases
Data Warehouses
Transactional Databases
Advanced Database Systems Object-Relational Spatial and Temporal Time-Series Multimedia Text Heterogeneous, Legacy, and Distributed WWW
GeneFilter Comparison Report
GeneFilter 1 Name: GeneFilter 1 Name:
O2#1 8-20-99adjfinal N2#1finaladj
INTENSITIES
RAW NORMALIZED
ORF NAME GENE NAME CHRM F G R GF1 GF2 GF1 GF2 DIFFERENCE RATIO
YAL001C TFC3 1 1 A 1 2 12.03 7.38 403.83 209.79 194.04 1.92
YBL080C PET112 2 1 A 1 3 53.21 35.62 "1,786.11" "1,013.13" 772.98 1.76
YBR154C RPB5 2 1 A 1 4 79.26 78.51 "2,660.73" "2,232.86" 427.87 1.19
YCL044C 3 1 A 1 5 53.22 44.66 "1,786.53" "1,270.12" 516.41 1.41
YDL020C SON1 4 1 A 1 6 23.80 20.34 799.06 578.42 220.64 1.38
YDL211C 4 1 A 1 7 17.31 35.34 581.00 "1,005.18" -424.18 -1.73
YDR155C CPH1 4 1 A 1 8 349.78 401.84 "11,741.98" "11,428.10" 313.88 1.03
YDR346C 4 1 A 1 9 64.97 65.88 "2,180.87" "1,873.67" 307.21 1.16
YAL010C MDM10 1 1 A 2 2 13.73 9.61 461.03 273.36 187.67 1.69
YBL088C TEL1 2 1 A 2 3 8.50 7.74 285.38 220.01 65.37 1.30
YBR162C 2 1 A 2 4 226.84 293.83 "7,614.82" "8,356.39" -741.57 -1.10
YCL052C PBN1 3 1 A 2 5 41.28 34.79 "1,385.79" 989.41 396.38 1.40
YDL028C MPS1 4 1 A 2 6 7.95 6.24 266.99 177.34 89.65 1.51
YDL219W 4 1 A 2 7 16.08 11.33 539.93 322.20 217.74 1.68
YDR163W 4 1 A 2 8 19.13 14.19 642.17 403.56 238.61 1.59
YDR354W TRP4 4 1 A 2 9 62.24 40.74 "2,089.48" "1,158.64" 930.84 1.80
YAL018C 1 1 A 3 2 10.72 8.81 359.75 250.60 109.15 1.44
YBL096C 2 1 A 3 3 10.91 8.98 366.40 255.40 111.00 1.43
YBR169C SSE2 2 1 A 3 4 17.33 27.81 581.80 790.84 -209.05 -1.36
YCL060C 3 1 A 3 5 17.99 24.75 603.96 703.75 -99.79 -1.17
YDL036C 4 1 A 3 6 14.22 8.86 477.39 251.94 225.44 1.89
YDL227C HO 4 1 A 3 7 25.61 31.52 859.71 896.46 -36.75 -1.04
YDR171W HSP42 4 1 A 3 8 102.08 98.37 "3,426.83" "2,797.58" 629.25 1.22
YDR362C 4 1 A 3 9 16.32 12.95 547.96 368.39 179.57 1.49
YAL026C DRS2 1 1 A 4 2 11.32 7.97 379.85 226.53 153.33 1.68
YBL102W SFT2 2 1 A 4 3 55.88 63.74 "1,875.82" "1,812.81" 63.02 1.03
YBR177C 2 1 A 4 4 63.31 29.03 "2,125.20" 825.60 "1,299.60" 2.57
YCL068C 3 1 A 4 5 8.33 4.47 279.51 127.16 152.35 2.20
YDL044C MTF2 4 1 A 4 6 11.73 6.96 393.88 198.07 195.81 1.99
YDL235C YPD1 4 1 A 4 7 38.71 30.20 "1,299.33" 858.83 440.50 1.51
YDR179C 4 1 A 4 8 12.77 11.05 428.60 314.12 114.48 1.36
YDR370C 4 1 A 4 9 16.70 15.30 560.62 435.13 125.49 1.29
YAL034C FUN19 1 1 A 5 2 20.89 24.21 701.32 688.59 12.73 1.02
YBL111C 2 1 A 5 3 22.38 13.67 751.39 388.69 362.70 1.93
YBR185C MBA1 2 1 A 5 4 38.42 19.96 "1,289.61" 567.78 721.83 2.27
YCLX03C 3 1 A 5 5 8.69 3.66 291.77 104.11 187.66 2.80
YDL052C SLC1 4 1 A 5 6 52.37 49.87 "1,758.05" "1,418.33" 339.73 1.24
YDL243C 4 1 A 5 7 15.56 12.95 522.24 368.30 153.94 1.42
YDR186C 4 1 A 5 8 16.48 15.01 553.30 426.75 126.55 1.30
YDR378C 4 1 A 5 9 31.13 28.08 "1,045.01" 798.50 246.50 1.31
YAL040C CLN3 1 1 A 6 2 126.65 107.34 "4,251.70" "3,052.61" "1,199.08" 1.39
YBR006W 2 1 A 6 3 22.74 11.10 763.49 315.55 447.94 2.42
YBR193C 2 1 A 6 4 14.81 15.55 497.07 442.20 54.87 1.12
YCLX11W 3 1 A 6 5 161.96 175.34 "5,436.86" "4,986.41" 450.44 1.09
YDL060W 4 1 A 6 6 29.84 37.13 "1,001.65" "1,055.98" -54.34 -1.05
YDR003W 4 1 A 6 7 23.99 23.22 805.48 660.25 145.22 1.22
YDR194C MSS116 4 1 A 6 8 66.58 47.16 "2,235.07" "1,341.29" 893.78 1.67
YDR386W 4 1 A 6 9 11.27 5.75 378.27 163.46 214.81 2.31
YAL047C 1 1 A 7 2 15.54 11.30 521.74 321.28 200.46 1.62
YBR012W-B 2 1 A 7 3 54.70 79.97 "1,836.29" "2,274.15" -437.86 -1.24
YBR201W DER1 2 1 A 7 4 21.67 19.57 727.49 556.64 170.85 1.31
YCR007C 3 1 A 7 5 25.02 15.96 840.01 453.76 386.25 1.85
YDL068W 4 1 A 7 6 18.32 13.11 614.83 372.78 242.05 1.65
Structure - 3D Anatomy
Function – 1D Signal
Metadata – Annotation
Data Mining Tasks
Classification
Clustering
Association Rule Mining
Sequential Pattern Discovery
Regression
Anomaly Detection
Ex: Classifying Galaxy
Data Mining and Data Warehousing by Kritsada Sriphaew 19
Ex: Market Basket Analysis
Data Mining and Data Warehousing by Kritsada Sriphaew 20
Where should detergents be placed in the
Store to maximize their sales? ? Are window cleaning products purchased
when detergents and orange juice are
bought together? ?
How are the demographics of the
neighborhood affecting what customers
are buying?
?
Is soda typically purchased with bananas?
Does the brand of soda make a difference? ?
Ex: Anomaly Detection
Data Mining and Data Warehousing by Kritsada Sriphaew 21
Detect significant deviations from normal behavior
Applications:
Credit Card Fraud Detection
Network Intrusion Detection
Some Success Stories
Data Mining and Data Warehousing by Kritsada Sriphaew 22
Network intrusion detection using a combination of sequential rule discovery and classification tree on 4 GB DARPA data Won over (manual) knowledge engineering approach
http://www.cs.columbia.edu/~sal/JAM/PROJECT/ provides good detailed description of the entire process
Major US bank: Customer attrition prediction Segment customers based on financial behavior: 3 segments
Build attrition models for each of the 3 segments
40‐50% of attritions were predicted == factor of 18 increase
Targeted credit marketing: major US banks find customer segments based on 13 months credit balances
build another response model based on surveys
increased response 4 times -- 2%
How You’LL Benefit
Confidently discuss the role and applicability of data warehousing and data mining to business/organization problems
Get background knowledge for further explore to your thesis, independent study or your career’s projects since data mining methods (to extract knowledge from the data) are very useful for every fields.
Assignment
Assignments will aim to test your detailed knowledge and understanding of the topics, as well as your critical thinking and research ability. Assignments may include tasks involving: writing detailed designs; reading research papers; learning and using specialist software/hardware.
Assessment: the assignment will be worth 20% of the total course assessment.
25
PreTest 1. Select only one of the following items to fill in the blanks.
(a) Characterization/Discrimination
(b) Classification
(c) Numeric Prediction
(d) Clustering
(e) Association Analysis
(f) Trend Analysis
Which function matches with the following task?
______(1) To estimate the price of the stock A in next month
______(2) To display a portion of sold products, according to their types.
______(3) To know which products are likely to be sold with which products
______(4) To group customers to a set of similar groups based on their features
______(5) To find the value of an experiment when a substance is tested.
______(6) To predict that a customer tends to be a good customer or not.
2. Assume that we want to design a model to forecast tomorrow’s SET index,
please suggest the detail of the model that we should construct and
recommend the input and output to the model.