Data Mining
description
Transcript of Data Mining
Data Mining
Data MiningChase ReppOverview of what Im covering1What is Data Mining?knowledge discovery
searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained within2What is Data Mining?Data mining differs from database querying in the following manner: database querying asks what company purchased $100,000 worth of widgets last year? while this asks what company is likely to purchase over $100,000 of widgets next year and why?
4History of Data Miningcoined in the 1960s
Data mining was used to find basic information from the collections of data such as total revenue over the last three years.
classic statisticsartificial intelligencemachine learningHad a different meaning - During this time, the term described the practice of manually wading through data and finding patterns, whether or not they were useful5Knowledge Discovery Process
Phase 1 Data Integration - Collect data from sourcesPhase 2 Data Selection - Select useful dataPhase 3 Data Cleaning - Rid data of errors, missing values, inconsistent dataPhase 4 Data Transformation - Normalization, smoothing, other forms appropriate for data miningPhase 5 Data Mining - Apply mining techniques to discover patternsPhase 6 Pattern Evaluation / Presentation - Visualization and removing redundant patternsPhase 7 Knowledge Discovery - Use to make decisions The order of the first three phases is somewhat debatable. It depends on if you want to clean the data before you integrate it or not.
6CategoriesPredictive Data MiningTarget valueFuture trends
Descriptive Data MiningNo target valueFocuses on relationsPredictivefocuses on discovering a relationship between independent variables and a relationship between dependent and independent variables
used to forecast specific things Descriptivedescribes a data set in a brief but comprehensive way and gives interesting characteristics of the data without having any predefined target
Focus on relations9Associationpatterns are discovered based on a relationship of a specific item with other items in the same transaction
Descriptive
Example: groceries Classificationto classify each item in a set of data into one of the predefined sets of classes or groups
Often used with machine learning
Predictive
Example: cat or dog person?ClusteringDifferent from classification, the clustering technique also defines the classes and put objects in them
Descriptive
Example: a library Regressionused to predict numbers from data sets that have known target values
Predictive
Example: sales, distance, temperature, value, etc
Sequential Patternsdiscovers frequent sequences or subsequences as patterns in a sequence database
Descriptive
Derived from association miningWhat is a sequence database? stores a number of records as sequences of ordered events that may or may not have a definite notion of time14Sequential Pattern MiningThere are three categories that the main sequential pattern mining techniques fall into.
Apriori-basedPattern-growthEarly-pruning
Aprior-basedfollow the apriori property - all nonempty subsets of a frequent itemset must also be frequent
if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset
AprioriAll, GSP, PSP, and SPAMfrequent = a set must have support that is greater than the specified minimum support
Support? = 16Transaction dataAssume:minsup = 30%minconf = 80%An example frequent itemset: {Chicken, Clothes, Milk} [sup = 3/7] about 43%Association rules from the itemset: Clothes Milk, Chicken[sup = 3/7, conf = 3/3] Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]
t1:Beef, Chicken, Milkt2:Beef, Cheeset3:Cheese, Bootst4:Beef, Chicken, Cheeset5:Beef, Chicken, Clothes, Cheese, Milkt6:Chicken, Clothes, Milkt7:Chicken, Milk, ClothesApriori AlgorithmTwo steps:Find all itemsets that have minimum support (frequent itemsets).Use frequent itemsets to generate rules.
E.g., a frequent itemset{Chicken, Clothes, Milk} [sup = 3/7]
and one rule from the frequent itemsetClothes Milk, Chicken [sup = 3/7, conf = 3/3]
Finding frequent itemsets itemset:count1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 F1: {1}:2, {2}:3, {3}:3, {5}:3 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2 C3: {2, 3,5}3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5}
TIDItemsT1001, 3, 4T2002, 3, 5T3001, 2, 3, 5T4002, 5Dataset Tminsup=50%Pattern-growthdivide-and-conquer strategy
to focus the search on a restricted portion of the initial database and generate as few candidate sequences as possible
FreeSpan, PrefixSpan, WAP-mine, and FS-MinerSequence databases are recursively projected into a set of smaller projected databases based on the current sequential pattern(s). The sequential patterns are grown in each projected database by exploring only locally frequent fragments.
20Early-pruningutilize a sort of position induction to prune candidate sequences very early in the mining process and to avoid support counting as much as possible
LAPIN, HVSM, and DISC-allIf an items last position is smaller than the current prefix position, the item cannot appear behind the current prefix in the same sequence.
21Web Miningsearching for patterns in data through
content miningSearch enginesstructure miningHyper links (hits / page rank)usage miningUsers browser data and forms submittedContent mining is used to examine data collected by search engines. Structure mining hyper link level. (HITS and Page rank)Usage mining is used to examine data related to a particular user's browser as well as data gathered by forms that the user may have submitted during Web transactions. 22Web MiningOne use is for finding user navigational patterns on the World Wide Web by extracting knowledge from web logsExampleAn example of applying sequential pattern mining
S = {a, b, c, d, e, f}
[P1,] [P2,] [P3,] [P4,]
Frequent pattern of abac sitesWeb access sequence database for 4 users24Visual Data Miningcombines traditional mining methods and information visualization techniquesuser is directly involved
VDMS - simplicity, reliability, reusability, availability, and security
Simplicity is the most important aid in the guiding of the user through the process of knowledge discovery25Visual Data Mininghttp://www.youtube.com/user/quiterian
http://www.youtube.com/watch?v=MtJ4Xa4-J8g
http://www.youtube.com/watch?v=_8HzwQCFFfw
Questions?