Data Mining
Transcript of Data Mining
![Page 1: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/1.jpg)
1
Data MiningData Mining “Application of Information and Communication Technology to
Production and Dissemination of Official statistics”
10 May – 11 July 2006
M Q HasanLecturer/ StatisticianUN Statistical Institute for Asia and the PacificChiba, JapanEmail : [email protected]
![Page 2: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/2.jpg)
2
ObjectivesObjectives
Understanding data mining
Basis for future planning and development
![Page 3: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/3.jpg)
3
ContentsContents What is data mining
Evolution of data mining
Technology and techniques involved
Software packages
References
Exercises
![Page 4: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/4.jpg)
4
What is “data mining” :What is “data mining” :
“The nontrivial extraction of implicit, previously unknown, and potentially useful information from data"
“The science of extracting useful information from large data sets or databases".
Wikipedia, the free encyclopaedia
![Page 5: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/5.jpg)
5
What is “data mining” :What is “data mining” : Also term as “data discovery”
Process of analyzing data to identify patterns or relationship
Extraction of pattern or information from stored information
![Page 6: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/6.jpg)
6
What is “data mining” ….What is “data mining” ….
Prediction of future events, behaviors, estimating value etc.– Accuracy.
Confidence level.
![Page 7: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/7.jpg)
7
What is “data mining” ….What is “data mining” ….Process of data mining
– the initial exploration of available data
– model building or pattern identification with validation
– the application of the model to new data in order to generate predictions
![Page 8: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/8.jpg)
8
What is “data mining” ….What is “data mining” ….
Requirements–Data
–Concepts
–Instances
–Parameters
![Page 9: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/9.jpg)
9
What is NOT data mining :What is NOT data mining :Data warehousing SQL / ad hoc queries / reporting Software agents Online analytical processing (OLAP) Data visualization
![Page 10: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/10.jpg)
10
Why DM now ? …Why DM now ? … Development and refinement of three technologies
over the years.
– Massive data collection and storage facility. Databases of terabyte order.Includes publicly available data
– Powerful multiprocessor computers.Parallel processing technology, distributed
technology, speed.
– Data mining algorithms.Statistical, Data Modeling etc.
![Page 11: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/11.jpg)
11
Evolutionary Step
Business Question Enabling Technologies
Characteristics
Data Collection (1960s)
“What was my total revenue in the last five years?”
Computers, tapes, disks
Retrospective, static data delivery
Data Access (1980s)
“What were unit sales in New England last March?”
RDBMS, SQL, ODBC
Retrospective, dynamic data delivery at record level
Data Warehousing & Decision Support (1990s)
“What were unit sales in New England last March? Drill down to Boston."
On-line analytic processing (OLAP), multidimensional databases, data warehouses
Retrospective, dynamic data delivery at multiple levels
Data Mining (Ememrged)
“What’s likely to happen to Boston unit sales next month? Why?”
Advanced algorithms, multiprocessor computers, massive databases
Prospective, proactive information delivery
![Page 12: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/12.jpg)
12
ToolsTools
Case based reasoning.• Case-based reasoning tools provide a means to find
records similar to a specified record or records. These tools let the user specify the "similarity" of retrieved records.
Data visualization.• Data visualization tools let the user easily and quickly
view graphical displays of information from different perspectives.
![Page 13: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/13.jpg)
13
1 + 1 = 11 + 1 = 1
Is it possible ?
![Page 14: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/14.jpg)
14
Let a = bThen a2 = abThen 2a2 = a2 + abThen 2a2 – 2ab = a2 – abThen 2(a2 – ab) = 1(a2 – ab)Then (1 + 1)(a2 – ab) = 1(a2 – ab)Canceling (a2 – ab) from both sides
1 + 1 = 1
Where is the FALASY ?
![Page 15: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/15.jpg)
15
In data mining think from all sides ?
Avoid the FALASIES
![Page 16: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/16.jpg)
16
Thinking Hat techniquesThinking Hat techniques
White hat:.
With this thinking hat you focus on the data available. Look at the information you have, and see what you can learn from it. Look for gaps in your knowledge, and either try to fill them or take account of them.
This is where you analyse past trends, and try to extrapolate from historical data.
![Page 17: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/17.jpg)
17
Thinking Hat techniquesThinking Hat techniques
Red hat:
'Wearing' the red hat, you look at problems using intuition, gut reaction, and emotion. Also try to think how other people will react emotionally. Try to understand the responses of people who do not fully know your reasoning.
![Page 18: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/18.jpg)
18
Thinking Hat techniquesThinking Hat techniquesBlack hat: using black hat thinking.
Look at all the bad points of the decision. Look at it cautiously and defensively. Try to see why it might not work. Helps to make plans 'tougher' and resilient. Help you to spot fatal flaws and risks. Helps sometime successful people get so used
to thinking positively that often they cannot see problems in advance.
![Page 19: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/19.jpg)
19
Thinking Hat techniquesThinking Hat techniques
Yellow hat: using yellow hat thinking.
Helps “think positively.”
Helps you to see all the benefits of the decision and the value in it.
Helps you to keep going when everything looks gloomy and difficult.
![Page 20: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/20.jpg)
20
Thinking Hat techniquesThinking Hat techniques
Green hat: the green hat stands for creativity.
This is time to develop creative solutions to a problem.
Little criticism of ideas.
A whole range of creativity tools can help.
![Page 21: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/21.jpg)
21
Thinking Hat techniquesThinking Hat techniques
Blue hat: the blue hat stands for process control..
This is the hat worn by people chairing meetings. When running into difficulties because ideas are running dry, they may direct activity into green hat thinking. When contingency plans are needed, they will ask for black hat thinking, etc.
![Page 22: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/22.jpg)
22
Some DM terms :Some DM terms : Instances
Attributes
Objects
Class
Relationships
Rule indications
![Page 23: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/23.jpg)
23
Machine learning
![Page 24: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/24.jpg)
24
Some DM techniques :Some DM techniques : Decision Trees
Neural Networks
Genetic Algorithms
Nearest neighbor methods
Rule indications
![Page 25: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/25.jpg)
25
Some DM techniquesSome DM techniquesDecision trees
– Tree shaped structure with branches
– 2 main types:Classification trees label records and assign them to the
proper classRegression trees estimate the value of a target variable
– Various algorithmsChi square automatic interaction detection (CHAID)Classification & regression trees (CART) Etc
![Page 26: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/26.jpg)
26
Some DM techniquesSome DM techniques Neural Networks
– Learn through training
– Resemble to biological networks in structure
– Can produce very good predictions– Not easy to use and to understand– Cannot deal with missing data
![Page 27: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/27.jpg)
27
Some DM techniquesSome DM techniques Genetic Algorithms
– Optimization techniques
Genetic combinations
Natural selections
Concepts of evolution
Etc
![Page 28: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/28.jpg)
28
Some DM techniquesSome DM techniques Nearest neighbor methods
– K-nearest neighbor technique
– Classification trees based on combination of classes
![Page 29: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/29.jpg)
29
Some DM techniquesSome DM techniques
Rule indications
– Extraction of if , then , else rules from data based on statistical significance
![Page 30: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/30.jpg)
30
How DM works ?How DM works ?
Modeling
– Predicting FUTURE !!!! Build once
– apply /use many
![Page 31: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/31.jpg)
31
How DM works ?How DM works ? Test validity modeling
– Known cases with known data
![Page 32: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/32.jpg)
32
Data Mining SoftwareData Mining SoftwareNumap7, freeware for fast development,
validation, and application of regression type networks including the multi layer perception, functional link net, piecewise linear network, self organizing map and k-means.– http://www-ee.uta.edu/eeweb/ip/Software/Software.htm
![Page 33: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/33.jpg)
33
Data Mining SoftwareData Mining Software
Tiberius, MLP Neural Network for classification and regression problems.
– http://www.philbrierley.com/
![Page 34: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/34.jpg)
34
Data Mining SoftwareData Mining Software
Eurostat-funded research projects
– SODAS – symbolic official data analysis– System => ASSO– KESO – knowledge extraction for statistical– Offices– Spin! – Spatial mining for data of public interest
![Page 35: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/35.jpg)
35
Data Mining SoftwareData Mining Software SAS data mining tools
– Enterprise miner and text miner– Applications relevant to national statistical offices– Build a model of real world based on various– Data– Use the model to produce patterns– Reveal trends– Explain known outcomes– Predict the future outcomes– Forecast resource demands– Identify factors to secure a desired effect– Produce new knowledge to better inform– Decision makers before they act– Predict new opportunities
![Page 36: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/36.jpg)
36
Data Mining SoftwareData Mining Software
SAS data mining process : A framework for data mining: sample, explore, modify, model, assess
Integrated models and algorithms:– Decision trees– Neural networks– Regression– Memory based reasoning– Bagging and boosting ensembles– Two-stage models– Clustering– Time series– Associations
![Page 37: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/37.jpg)
37
Data Mining SoftwareData Mining Software
SPSS Clementine– Data mining workbench– Applications relevant to national statistical offices
Find useful relationships in large data sets Develop predictive models Improve decision making
– Modeling Prediction and classification: neural networks, decision Trees and rule induction, linear regression, logistic Regression, multinomial logistic regression Clustering and segmentation: Kohonen network, Kmeans, And two steps Association detection: GRI, apriori, and sequence Data reduction: factor analysis and principle Components analysis Meta-modeling – combination of models
![Page 38: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/38.jpg)
38
Data Mining SoftwareData Mining SoftwareOpen source data mining
– Www.Cs.waikato.Ac.nz/ml/weka - Weka (Waikato– Environment for knowledge analysis)– Data mining software in java– Collection of machine learning algorithms for data– Mining tasks:
Data pre-processing Classification Regression Clustering Association rules Visualization
– Platforms: Linux, windows and Macintosh– Apply directly to a dataset or call from java code– Online documentation:
Tutorial User guide API documentation
![Page 39: Data Mining](https://reader035.fdocuments.us/reader035/viewer/2022081602/55506449b4c905c0448b5395/html5/thumbnails/39.jpg)
39
References :References : Statistical Data Mining Tutorials
– http://www-2.cs.cmu.edu/~awm/tutorials/ Data Mining Glossary
– http://www.twocrows.com/glossary.htm Mind tools - Decision Tree Analysis
– http://www.mindtools.com/dectree.html Welcome to TheDataMine
– http://www.the-data-mine.com/ An Introduction to Data Mining - Discovering hidden value in your
data warehouse
– http://www.thearling.com/text/dmwhite/dmwhite.htm An Introduction to Data Mining
– http://www.thearling.com/dmintro/dmintro.pdf Data Mining for Official Statistics, Phan Tuan Pham (UNSD)
– SIAP ICT, Chiba, 7 – 9 June 2004 Wikipedia, the free encyclopaedia
– http://en.wikipedia.org/wiki/Data_mining