10 Data Mining
-
Upload
blerim-d-krasniqi -
Category
Documents
-
view
215 -
download
0
Transcript of 10 Data Mining
-
7/30/2019 10 Data Mining
1/43
SQL Server 2008 for Business IntelligenceUTS Short Course
-
7/30/2019 10 Data Mining
2/43
Loves C# and .NET
Specializes in Application architecture and
design
SQL Performance Tuning andOptimization
Agile, ScrumCertified Scrum Trainer
Technology aficionado Silverlight
ASP.NET
Windows Forms
Eric Phan SA @ SSWw: ericphan.info | e: [email protected] | t: @ericphan
-
7/30/2019 10 Data Mining
3/43
Attendance
You initial sheet Hands On Lab
You get me to initial sheet
Homework
Certificate
At end of 5 sessions
If I say if you have completed successfully
Admin Stuff
-
7/30/2019 10 Data Mining
4/43
Course Timetable & Materials
http://bit.ly/UTSSQL
Resources
http://sharepoint.ssw.com.au/Training/UTSSQL/
Course Website
http://bit.ly/UTSSQLhttp://www.microsoft.com/downloads/details.aspxhttp://www.microsoft.com/downloads/details.aspxhttp://www.microsoft.com/downloads/details.aspxhttp://bit.ly/UTSSQLhttp://bit.ly/UTSSQL -
7/30/2019 10 Data Mining
5/43
Course Overview
Session Date Time Topic
1Tuesday
01-05-201218:00 - 21:00 SSIS and Creating a Data Warehouse
2Tuesday
08-05-201218:00 - 21:00 OLAP Creating Cubes and Cube Issues
3Tuesday
15-05-201218:00 - 21:00 Reporting Services
4 Tuesday22-05-2012 18:00 - 21:00 Alternative Cube Browsers
5Tuesday
29-05-201218:00 - 21:00 Data Mining
-
7/30/2019 10 Data Mining
6/43
1. Other cube browsers
Microsoft Data Analyzer Proclarity
Excel 2003/2007/2010
Excel services
Thinslicer
Performance Point
Power Pivot
Last week(s)
-
7/30/2019 10 Data Mining
7/43
The plan
-
7/30/2019 10 Data Mining
8/43
1. Create Data Warehouse
2.Copy data to data warehouse
3. Create OLAP Cubes
4. Create Reports
5. Browse the cube
6. Do some Data Mining Discovering relationships
Predict future events
Step by step to BI
-
7/30/2019 10 Data Mining
9/43
1. What is Data Mining?
2. Why?
3. Uses
4. Algorithms
5. Demo
6. Hands on Lab
Agenda
-
7/30/2019 10 Data Mining
10/43
What is Data Mining?
Data mining is theuse of powerfulsoftware tools
to discover significant traits or relationships,
from databases or data warehouses and
often used topredict future events
-
7/30/2019 10 Data Mining
11/43
What is Data Mining?
It exploits statistical algorithms
Once the knowledge is extracted it:
Can be used to discover
Can be used to predict values of other cases
-
7/30/2019 10 Data Mining
12/43
Marketing
Who picks the movie? The kids, the wife, me
Who are our Customers and what sort of films do theyhire?
Is a 30 year old woman with 2 children going to hire Arnieslatest film
Validation
Is this data sensible? Terminator 2 and Toy Story
Prediction
Sales Next Year
Why Data Mining?
-
7/30/2019 10 Data Mining
13/43
1. Get new information from data, future trends, past trends,
outlier, maximums, minimums
2. Analyse data from different perspectives and summarizing it
into useful information
3. New information to
increase revenuecuts costs
or both :-)
Why? Its all about money
-
7/30/2019 10 Data Mining
14/43
Who are our biggest customers?
What are customers buying with cigars?
What are the customer retention levels of our branches?
Which customers have bought olives, feta cheese but no ciabatta bread?
Which regions have the highest male/female ratio of single 20 somethings?
Which region has lowest customer retention levels and list out lost
customers?
Which Questions are Data
Mining?
-
7/30/2019 10 Data Mining
15/43
Ad hoc query
Drill through to details
Business Intelligence tool
Whats not data mining
-
7/30/2019 10 Data Mining
16/43
Huge amount of data
Good raw material good data mining
Samples should be representative
Samples "similar" to domain
Not all-seeing crystal ball
Verify and Validate!
Data - Uncover patterns in samples
-
7/30/2019 10 Data Mining
17/43
OLAP
Is about fast ad hoc querying Analysis by dimensions and measures
Gives precise answers
Data Mining
May use RDBMS or OLAP source
Is about discovering and predicting
Gives imprecise answers
OLAP is not a prerequisite for data mining, but it almost always comes first
OLAP versus Data Mining
(learning to ride a bike before a car)
-
7/30/2019 10 Data Mining
18/43
Classification algorithms
predict one or more discrete variables, based on the other attributes in the dataset
Regression algorithms
predict one or more continuous variables, such as profit or loss, based on otherattributes in the dataset
Segmentation algorithms
divide data into groups, or clusters, of items that have similar properties
Association algorithms
find correlations between different attributes in a dataset Sequence analysis algorithms
summarize frequent sequences or episodes in data, such as a Web path flow
Types of Data Mining Algorithms
-
7/30/2019 10 Data Mining
19/43
Complete Set Of AlgorithmsWays to analyze your data
Decision Trees Clustering Time Series
Neural Network AssociationNave Bayes
Linear Regression LogisticRegressionSequenceClustering
http://images.google.com/imgres?imgurl=http://nuweb2.neu.edu/math/cp/blog/regression/graphics/regression__62.png&imgrefurl=http://www.atsweb.neu.edu/math/cp/blog/regression/regression.htm&usg=__F-hsRrePZlGNdhsrqxCN824gTbQ=&h=390&w=580&sz=8&hl=en&start=2&sig2=ptfDCDM4_FJD0qf6cnDK0A&um=1&tbnid=93FZLt5e5PuV-M:&tbnh=90&tbnw=134&prev=/images?q=logistic+regression&hl=en&rls=com.microsoft:en-au:IE-SearchBox&rlz=1I7SKPB_en&sa=N&um=1&ei=wNLTSsLOA8aSkAXK1_CHBAhttp://images.google.com/imgres?imgurl=http://www.le.ac.uk/bl/gat/virtualfc/Stats/regression/REGR2.GIF&imgrefurl=http://www.le.ac.uk/bl/gat/virtualfc/Stats/regression/regr1.html&usg=__ufucSf1dLob9MSEDeZn1MsF4vsA=&h=427&w=597&sz=4&hl=en&start=2&sig2=VgBJJpZNjT58fJkWmVWlug&um=1&tbnid=hYxaFI0g1YSGPM:&tbnh=97&tbnw=135&prev=/images?q=linear+regression&hl=en&rls=com.microsoft:en-au:IE-SearchBox&rlz=1I7SKPB_en&sa=N&um=1&ei=btLTStqxBYOHkQXOsfT8Aw -
7/30/2019 10 Data Mining
20/43
Split data
Each of branch is like an attribute
Brightness = amount of data
Decision trees
-
7/30/2019 10 Data Mining
21/43
Decision Trees assign (classify) each case to one of a
few (discrete) broad categories of selected attribute
(variable) and explains the classification with few
selected input variables
The process of building is recursive partitioning
splitting data into partitions and then splitting it up
more
Initially all cases are in one big box
Decision Trees (1)
-
7/30/2019 10 Data Mining
22/43
The algorithm tries all possible breaks in classes using all
possible values of each input attribute; it then selects the
split that partitions data to the purest classes of thesearched variable
Several measures of purity
Then it repeats splitting for each new class
Again testing all possible breaks
Unuseful branches of the tree can be
pre-pruned or post-pruned
Decision Trees (2)
-
7/30/2019 10 Data Mining
23/43
Decision trees are used for classification and prediction
Typical questions:
Predict which customers will leave
Help in mailing and promotion campaigns
Explain reasons for a decision
What are the movies young female customers like to buy?
Decision Trees (3)
-
7/30/2019 10 Data Mining
24/43
Decision Trees Who Decides
-
7/30/2019 10 Data Mining
25/43
Bayes Formula
Uses statistics to say falls into certain category or notwith probability
Spam filtering: score of spam (Bayes)
Testing only a particular attribute
Nave Bayes
-
7/30/2019 10 Data Mining
26/43
Quickly builds mining models that can be used for
classification and prediction
It calculates probabilities for each possible state of the
input attribute, given each state of the predictable
attribute
This can later be used to predict an outcomeof the predicted attribute based on the known input attributes
This makes the model a good option
for exploring the data
Nave Bayes
-
7/30/2019 10 Data Mining
27/43
Grouping data into clusters
Objects within a cluster have high similarity based on the
attribute values
The class label of each object is not known
Several techniques
Partitioning methods
Hierarchical methods
Density based methods Model based methods
And more
Cluster Analysis (1)
-
7/30/2019 10 Data Mining
28/43
Segments a heterogeneous population into a number of more
homogenous subgroups or clusters
Some typical questions:
Discover distinct groups of customers
Identification of groups of houses in a city
In biology, derive animal and plant taxonomies
Find outliers
Cluster Analysis (2)
-
7/30/2019 10 Data Mining
29/43
Clustering
Age
Annual
Income
-
7/30/2019 10 Data Mining
30/43
Time series
Timebased data prediction
-
7/30/2019 10 Data Mining
31/43
Sequence clustering
Numbers orders stronger associations
Direction of association (not necessary the other direction)
-
7/30/2019 10 Data Mining
32/43
If you own certain stocks ' you own maybe other ones as well
Probability = thickness of line
Association
-
7/30/2019 10 Data Mining
33/43
Let system learn how to classify data
Neural Network adapts to the new data
Formulate statement/hypothesis
Outcome is know
(Data / Surveys)
1. 70% data to train network (outcome is known)
2. 30% of data to test network (outcome is known)
3. New data (no survey needed, predict from network)
Other example: OCR
Neural Nets
-
7/30/2019 10 Data Mining
34/43
Conclusion: When To Use What
Task Microsoft algorithms to use
Predicting a discrete attribute.For example, predict whether therecipient of a targeted mailing campaignwill buy a product.
Microsoft Decision Trees AlgorithmMicrosoft Naive Bayes AlgorithmMicrosoft Clustering AlgorithmMicrosoft Neural Network Algorithm
Predicting a continuous attribute.For example, forecast next year's sales. Microsoft Decision Trees AlgorithmMicrosoft Time Series Algorithm
Predicting a sequence.For example, perform a clickstreamanalysis of a company's Web site.
Microsoft Sequence Clustering Algorithm
Finding groups of common itemsin transactions.For example, use market basket analysisto suggest additional products to a
customer for purchase.
Microsoft Association AlgorithmMicrosoft Decision Trees Algorithm
Finding groups of similar items.For example, segment demographic datainto groups to better understand therelationships between attributes.
Microsoft Clustering AlgorithmMicrosoft Sequence Clustering Algorithm
http://msdn.microsoft.com/en-us/library/ms174941.aspxhttp://msdn.microsoft.com/en-us/library/ms175312.aspxhttp://www.sqlservercentral.com/articles/Video/64190/http://www.microsoft.com/downloads/en/details.aspxhttp://msdn.microsoft.com/en-us/library/ms174941.aspxhttp://msdn.microsoft.com/en-us/library/ms174879.aspxhttp://msdn.microsoft.com/en-us/library/ms174923.aspxhttp://msdn.microsoft.com/en-us/library/ms175462.aspxhttp://msdn.microsoft.com/en-us/library/ms174941.aspxhttp://www.sqlservercentral.com/articles/Video/64190/http://msdn.microsoft.com/en-us/library/ms174923.aspxhttp://msdn.microsoft.com/en-us/library/ms174923.aspxhttp://www.sqlservercentral.com/articles/Video/64190/http://msdn.microsoft.com/en-us/library/ms174941.aspxhttp://msdn.microsoft.com/en-us/library/ms175462.aspxhttp://msdn.microsoft.com/en-us/library/ms174923.aspxhttp://msdn.microsoft.com/en-us/library/ms174879.aspxhttp://msdn.microsoft.com/en-us/library/ms174941.aspxhttp://www.microsoft.com/downloads/en/details.aspxhttp://www.sqlservercentral.com/articles/Video/64190/http://msdn.microsoft.com/en-us/library/ms175312.aspxhttp://msdn.microsoft.com/en-us/library/ms174941.aspx -
7/30/2019 10 Data Mining
35/43
Visual Numerics
3rd party algorithms
http://www.vni.com/company/whitepapers/
MicrosoftBIwithNumericalLibraries.pdf
There is more...
http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/http://www.sqlservercentral.com/articles/Video/65055/ -
7/30/2019 10 Data Mining
36/43
Microsoft SQL Server 2008 Data Mining Add-ins for Microsoft
Office 2007 http://www.microsoft.com/downloads/en/details.aspx?familyid=8
96A493A-2502-4795-94AE-E00632BA6DE7&displaylang=en
Excel Data Mining
http://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=enhttp://msdn.microsoft.com/en-us/library/ms175595.aspx?familyid=896A493A-2502-4795-94AE-E00632BA6DE7&displaylang=en -
7/30/2019 10 Data Mining
37/43
Train station / airport
Who is the bad guy Farmers
Find the best crops
Supermarket
Find to figure out how to get you to buy more, where theexpensive items
Other usages of data mining
Find patterns - Profiling
-
7/30/2019 10 Data Mining
38/43
SSIS 2008 - Data profiling task
Get a profile of the data in a table
potential candidate keys
length of data values in columns
Null percentage of rows
distribution of values
....
Tip
-
7/30/2019 10 Data Mining
39/43
Video: Simple data mining model
http://www.sqlservercentral.com/articles/Video/65055/
Video: Data mining and Reporting Services
http://www.sqlservercentral.com/articles/Video/64190/
Data Mining Algorithms
http://msdn.microsoft.com/en-us/library/ms175595.aspx
Resources 1
http://blogs.msdn.com/b/jamiemac/http://richardlees.blogspot.com/http://www.amazon.com/gp/product/0470277742http://www.amazon.com/gp/product/0470277742http://www.amazon.com/gp/product/0470277742http://www.amazon.com/gp/product/0470277742http://www.amazon.com/gp/product/0470277742http://richardlees.blogspot.com/http://blogs.msdn.com/b/jamiemac/http://blogs.msdn.com/b/jamiemac/ -
7/30/2019 10 Data Mining
40/43
Jamie MacLennan
http://blogs.msdn.com/b/jamiemac/
Richard Lees on BI
http://richardlees.blogspot.com/
Book Data Mining with Microsoft SQL Server 2008http://www.amazon.com/gp/product/0470277742?ie=UTF8&tag=sqlserverda09-
20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742
Resources 2
http://www.ssw.com.au/ssw/Events/2010UTSSQL/http://www.vni.com/company/whitepapers/MicrosoftBIwithNumericalLibraries.pdfhttp://sharepoint.ssw.com.au/Training/UTSSQL/?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742http://sharepoint.ssw.com.au/Training/UTSSQL/?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742http://sharepoint.ssw.com.au/Training/UTSSQL/?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742http://sharepoint.ssw.com.au/Training/UTSSQL/?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742http://sharepoint.ssw.com.au/Training/UTSSQL/?ie=UTF8&tag=sqlserverda09-20&linkCode=as2&camp=1789&creative=9325&creativeASIN=0470277742http://www.vni.com/company/whitepapers/MicrosoftBIwithNumericalLibraries.pdfhttp://www.vni.com/company/whitepapers/MicrosoftBIwithNumericalLibraries.pdfhttp://www.vni.com/company/whitepapers/MicrosoftBIwithNumericalLibraries.pdfhttp://www.ssw.com.au/ssw/Events/2010UTSSQL/http://www.ssw.com.au/ssw/Events/2010UTSSQL/http://www.ssw.com.au/ssw/Events/2010UTSSQL/ -
7/30/2019 10 Data Mining
41/43
Why Data Mining?
Uses
Algorithms
Demo
Hands on Lab
Summary
-
7/30/2019 10 Data Mining
42/43
3 things
http://ericphan.info/
twitter.com/ericphan
-
7/30/2019 10 Data Mining
43/43
Thank You!
Gateway Court Suite 10
81 - 91 Military Road
Neutral Bay, Sydney NSW 2089
AUSTRALIA
ABN: 21 069 371 900
Phone: + 61 2 9953 3000
Fax: + 61 2 9953 3105
www.ssw.com.au