Data Mining

Data Mining

Data MiningChase ReppOverview of what Im covering1What is Data Mining?knowledge discovery

searching, analyzing, and sifting through large data sets to find new patterns, trends, and relationships contained within2What is Data Mining?Data mining differs from database querying in the following manner: database querying asks what company purchased $100,000 worth of widgets last year? while this asks what company is likely to purchase over $100,000 of widgets next year and why?

4History of Data Miningcoined in the 1960s

Data mining was used to find basic information from the collections of data such as total revenue over the last three years.

classic statisticsartificial intelligencemachine learningHad a different meaning - During this time, the term described the practice of manually wading through data and finding patterns, whether or not they were useful5Knowledge Discovery Process

Phase 1 Data Integration - Collect data from sourcesPhase 2 Data Selection - Select useful dataPhase 3 Data Cleaning - Rid data of errors, missing values, inconsistent dataPhase 4 Data Transformation - Normalization, smoothing, other forms appropriate for data miningPhase 5 Data Mining - Apply mining techniques to discover patternsPhase 6 Pattern Evaluation / Presentation - Visualization and removing redundant patternsPhase 7 Knowledge Discovery - Use to make decisions The order of the first three phases is somewhat debatable. It depends on if you want to clean the data before you integrate it or not.

6CategoriesPredictive Data MiningTarget valueFuture trends

Descriptive Data MiningNo target valueFocuses on relationsPredictivefocuses on discovering a relationship between independent variables and a relationship between dependent and independent variables

used to forecast specific things Descriptivedescribes a data set in a brief but comprehensive way and gives interesting characteristics of the data without having any predefined target

Focus on relations9Associationpatterns are discovered based on a relationship of a specific item with other items in the same transaction

Descriptive

Example: groceries Classificationto classify each item in a set of data into one of the predefined sets of classes or groups

Often used with machine learning

Predictive

Example: cat or dog person?ClusteringDifferent from classification, the clustering technique also defines the classes and put objects in them

Descriptive

Example: a library Regressionused to predict numbers from data sets that have known target values

Predictive

Example: sales, distance, temperature, value, etc

Sequential Patternsdiscovers frequent sequences or subsequences as patterns in a sequence database

Descriptive

Derived from association miningWhat is a sequence database? stores a number of records as sequences of ordered events that may or may not have a definite notion of time14Sequential Pattern MiningThere are three categories that the main sequential pattern mining techniques fall into.

Apriori-basedPattern-growthEarly-pruning

Aprior-basedfollow the apriori property - all nonempty subsets of a frequent itemset must also be frequent

if {AB} is a frequent itemset, both {A} and {B} should be a frequent itemset

AprioriAll, GSP, PSP, and SPAMfrequent = a set must have support that is greater than the specified minimum support

Support? = 16Transaction dataAssume:minsup = 30%minconf = 80%An example frequent itemset: {Chicken, Clothes, Milk} [sup = 3/7] about 43%Association rules from the itemset: Clothes Milk, Chicken[sup = 3/7, conf = 3/3] Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]

t1:Beef, Chicken, Milkt2:Beef, Cheeset3:Cheese, Bootst4:Beef, Chicken, Cheeset5:Beef, Chicken, Clothes, Cheese, Milkt6:Chicken, Clothes, Milkt7:Chicken, Milk, ClothesApriori AlgorithmTwo steps:Find all itemsets that have minimum support (frequent itemsets).Use frequent itemsets to generate rules.

E.g., a frequent itemset{Chicken, Clothes, Milk} [sup = 3/7]

and one rule from the frequent itemsetClothes Milk, Chicken [sup = 3/7, conf = 3/3]

Finding frequent itemsets itemset:count1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3 F1: {1}:2, {2}:3, {3}:3, {5}:3 C2: {1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2 F2: {1,3}:2, {2,3}:2, {2,5}:3, {3,5}:2 C3: {2, 3,5}3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5}

TIDItemsT1001, 3, 4T2002, 3, 5T3001, 2, 3, 5T4002, 5Dataset Tminsup=50%Pattern-growthdivide-and-conquer strategy

to focus the search on a restricted portion of the initial database and generate as few candidate sequences as possible

FreeSpan, PrefixSpan, WAP-mine, and FS-MinerSequence databases are recursively projected into a set of smaller projected databases based on the current sequential pattern(s). The sequential patterns are grown in each projected database by exploring only locally frequent fragments.

20Early-pruningutilize a sort of position induction to prune candidate sequences very early in the mining process and to avoid support counting as much as possible

LAPIN, HVSM, and DISC-allIf an items last position is smaller than the current prefix position, the item cannot appear behind the current prefix in the same sequence.

21Web Miningsearching for patterns in data through

content miningSearch enginesstructure miningHyper links (hits / page rank)usage miningUsers browser data and forms submittedContent mining is used to examine data collected by search engines. Structure mining hyper link level. (HITS and Page rank)Usage mining is used to examine data related to a particular user's browser as well as data gathered by forms that the user may have submitted during Web transactions. 22Web MiningOne use is for finding user navigational patterns on the World Wide Web by extracting knowledge from web logsExampleAn example of applying sequential pattern mining

S = {a, b, c, d, e, f}

[P1,] [P2,] [P3,] [P4,]

Frequent pattern of abac sitesWeb access sequence database for 4 users24Visual Data Miningcombines traditional mining methods and information visualization techniquesuser is directly involved

VDMS - simplicity, reliability, reusability, availability, and security

Simplicity is the most important aid in the guiding of the user through the process of knowledge discovery25Visual Data Mininghttp://www.youtube.com/user/quiterian

http://www.youtube.com/watch?v=MtJ4Xa4-J8g

http://www.youtube.com/watch?v=_8HzwQCFFfw

Questions?

Data Mining

Documents

Transcript of Data Mining