Data mining 2012 generalwithmethods

Michael Gilman, Ph.D.

Copyright Data Mining Technologies Inc. 2012

www.data-mine.com 631 –692-4400 ext. 100

1

What is Data Mining?

An information extraction activity which

has as its goal the discovery of hidden facts

contained in databases. It finds patterns and

subtle relationships in data, inferring rules

and generalizations that allow the prediction

of future results. To be a true knowledge

discovery method, a data mining tool should

unearth information automatically.

pres071911a 2

Overview

The purpose of this presentation is to

introduce a new and powerful

methodology and associated software

that overcomes many of the limitations

of the other data mining methods in

use today.

pres071911a 3

Background

The manual extraction of patterns from data has occurred for centuries.

Early methods of identifying patterns in data included statistical

methods such as Bayes' theorem (1700s) and regression analysis

(1800s). The proliferation, ubiquity and increasing power of computer

technology has increased data collection, storage and manipulations.

As data sets have grown in size and complexity, direct hands-on data

analysis has increasingly been augmented with indirect, automatic data

processing. This has been aided by other discoveries in computer

science, such as neural networks, clustering, genetic algorithms

(1950s), decision trees (1960s) and support vector machines (1990s).

Data mining is the process of applying these methods to data with the

intention of uncovering hidden patterns. (Wikipedia)

pres071911a 4

How Does Data Mining Work?

Data Mining Involves Building Predictive Models

that enable better understanding of how to proceed

in some enterprise in a better way.

In order to build a predictive model, several steps

are necessary. Before we outline these steps, here

is a real world problem.

pres071911a 5

Question:

How can we keep healthcare quality

high and keep costs down ?

Input Data: File containing clinical data and costs

pres071911a 7

Steps in Data Mining

Define the problem goals

Identify data sources

pres071911a 8

Then build the model

Mine Data

Data Model

If-Then Rules

pres071911a 9

Results

Model containing rules showing what is

best of breed treatment for each case and

why

If diagnosis = Congestive HF and Age =60-

70 and previous. bypass = yes and . . .

Then BOB Treatment = aortic stent

pres071911a 10

Steps in a Data Mining Project

1. Define the business or scientific problem

Example: Which of my current customers are likely to

become inactive in the next 6 months.

2. Gather historical data file

Prepare file of customers (present and past) which

include predictive descriptors such as start date, date of

first sale, date of last sales , sales by month, how

acquired, etc.

Include current status (active or inactive) for each

customer

pres071911a 11

Steps in a Data Mining Project(continued)

3. Cleanse the Data Data cleansing reduces noisy and missing data and removes

erroneous data.

4. Add Derived Attributes Create additional variables from the original data if necessary

(example: compute customer account duration from start date and

current date)

5. Create Test and Holdout Files

Randomly separate the original file into two parts called the test

and holdout files

Build predictive model with modeling software

pres071911a 12

Steps in a Data Mining Project(continued)

6. Validate the Data Validation uses a test set of data which was not used when

building the model. This is the holdout set defined previously.

The learned patterns are applied to this test set and the resulting

output is evaluated for accuracy.

For example, a data mining algorithm trying to distinguish spam

from legitimate emails would be trained on a training set of

sample emails. Once trained, the learned patterns would be

applied to the test set of emails on which it had not been trained.

The accuracy of these patterns can then be measured from how

many emails they correctly classify.

pres071911a 13

Steps in a Data Mining Project

(continued)

At this point a model has been created and it can

now be used.

7. Use the Model to Make Predictions

The final step of knowledge discovery from data is

to use the model produced by the data mining

algorithms. As new data come in, the model is

then applied to this data to make predictions.

pres071911a 14

Comparison of Methods

Nuggets offers benefits that the other methods don’t offer Here are a few:

Handles missing data

Handles very large amounts of predictor attributes

Fast Model Development

Able to model small data patterns missed by other methods

Handles wide variety of data types

Doesn’t require highly trained specialists

Each of the principal methods will now be compared to Nuggets.

pres071911a 15

Principal Data Mining Techniques Industry Standard Methods

Statistics

Neural Nets

Decision Trees

Following is a comparison of Nuggets with these principal competitors

pres071911a 16

Nuggets is a proprietary technology that uses proprietary

search algorithms to intelligently prospect data for valid

hypotheses.

In the act of searching, the algorithms “learn” about the

training data as they proceed.

The result is a very fast and efficient discovery strategy

that does not preclude any potential rule or generalization

from being found. This document outlines its advantages

over its competitors in providing useful and profitable

information from the vast store of data that are being

accumulated at an ever increasing rate.

Nuggets

pres071911a 17

Statistics Methods Pros/Cons

Statistics Pros

Statistical analysis is sometimes a good ‘first step’ in

understanding data. These methods deal well with numerical

data where important mathematical facts such as the

underlying probability distributions of the data are known.

However, in today’s world these mathematical facts are rarely

known. These methods are not as good with nominal data

values such as “good”, “better”, “best” in the case of a

preference attribute or “Europe”, “North America”, “Asia” or

“South America” in the case of a location attribute.

pres071911a 18

Method Pros/Cons

Statistics (continued)

Some of the statistical methods commonly used are

regression analysis, correlation, Chaid analysis,

hypothesis testing, and discriminant analysis.

Statistical analysis is sometimes a good “first step”

in understanding data. These methods deal well with

numerical data where the underlying probability

distributions of the data are known. This is not often

the case in real world problems.

pres071911a 19

Statistics Methods Pros/Cons (cont.)

Nuggets Advantages Over Statistics

Statistical methods require statistical expertise, or a project

person well versed in statistics who is heavily involved.

Such methods require difficult to verify statistical

assumptions. They suffer from the “black box aversion

syndrome”. This means that that non-technical decision

makers, those who will either accept or reject the results of

the study, are often unwilling to make important decisions

based on a technology that gives them answers but does

not explain how it got the answers.

pres071911a 20

Statistics Method Pros/Cons

Nuggets Advantages Over Statistics

To tell a non-statistician CEO that she or he must make a crucial

business decision because of a favorable R statistic or some other

arcane statistical reason is not usually well received. With Nuggets®

you can be told exactly how the conclusion was arrived at.

Another problem is that statistical methods are valid only if certain

assumptions about the data are met. Some of these assumptions are:

linear relationships between pairs of variables, non-multicollinearity,

normal probability distributions and independence of samples. If you

do not validate these assumptions because of time limitations or are

not familiar with them, your analysis may be faulty and therefore your

results may not be valid. Even if you know about them you may not

have the time or information to verify the assumptions.

pres071911a 21

Neural Networks

This is a popular technology, particularly in the financial

community. This method was originally developed in the

1940’s to model biological nervous systems in an attempt

to mimic human thought processes.

pres071911a 22

Method Pros/Cons - Neural Nets


Pros

The end result of a Neural Net project is a

mathematical model of the process. It deals

primarily with numerical attributes such as age,

income, height, etc., but not as well with nominal

data such as state, brand preference, vehicle make,

etc.

pres071911a 23


Nuggets Advantages

There is still much controversy regarding the

efficacy of Neural Nets. One major objection to

the method is that the development of a Neural

Net model is partly an art and partly a science in

that the results often depend on the individual

who built the model. That is, the model form

(called the network topology) and hence the

results, may differ from one researcher to another

for the same data.

pres071911a 24


There is also the problem with Neural Nets of “overfitting”

that results in good prediction of the data used to build the

model but bad results with new data. Neural Nets often use a

sigmoid function in its computations. This is a mathematical

function resembling the shape of the letter “S”. Questions

exist whether there is any theoretical justification for this

somewhat arbitrary choice and makes this approach

somewhat ad hoc.

Another issue is that the modeling results produced by a

Neural Net method are not intuitive. The method is called a

“black box” to indicate the lack of intuitive understanding of

its results. Neural Nets are still in use but becoming less

popular due to these issues. pres071911a 25

Method Pros/Cons Decision Trees

Decision Trees (Cart, Chaid, etc.)

Decision tree methods are techniques for

partitioning a training file into a tree

representation. The starting node is called the root

node. Depending upon the results of a test this

node is then partitioned into two or more sub-sets.

Each node is then further partitioned until a tree is

built. This tree can be mapped into a set of rules.

These rules in the form of a data tree are used to

generate forecasts.

pres071911a 26


Nuggets Advantages

By far the most important negative for decision trees is that

they are forced to make decisions along the way based on

limited information that implicitly leaves out of consideration

the vast majority of potential patterns in the training file. This

approach may leave valuable patterns undiscovered since

decisions made early in the process will preclude some good

rules from being discovered later. This is called “greedy

optimization” and lessens the accuracy of the resulting model.

Furthermore large numbers of predictor attributes as exist in

most of today’s data sets are not handled with decision trees.

pres071911a 27


Nuggets Advantages

Nuggets does not make these greedy

decisions. Instead it “implicitly”

searches all possible patterns and

thus is able to find patterns that are

useful but that wouldn’t be found

with decision trees.

pres071911a 28

Summary of Comparison With Other Methods

Nuggets Advantages

Nuggets offers many advantages over other methods in

common use. A few were presented here.

Nuggets advantages vary from method to method and

most are due to the limiting assumptions required by these

older methods which limit their effectiveness.

Nuggets is designed to circumvent these disadvantages

and offer a superior methodology that can work with the

challenges of the large number of complex data bases that

exist in today’s world.

pres071911a 29

Data mining 2012 generalwithmethods

Technology

Transcript of Data mining 2012 generalwithmethods