University of Arkansas Data Mining with Teradata TM Warehouse Miner Jim Kashner CTO Data Mining.

38
University of Arkansas Data Mining with Teradata TM Warehouse Miner Jim Kashner CTO Data Mining

Transcript of University of Arkansas Data Mining with Teradata TM Warehouse Miner Jim Kashner CTO Data Mining.

University of Arkansas

Data Mining withTeradataTM Warehouse Miner

Jim KashnerCTO Data Mining

11/30/2004 Copyright 2004 Teradata, a division of NCR 2

The Empirical Method and Decision SupportThe Empirical Method and Decision Support

• all of the information in this presentation are “jim’s opinions numbers 8 through 224” (for today) …

• a framework for making decisions in the presence of uncertainty

• seeks to shed light on the validity or plausibility of notions, suppositions, propositions, hypotheses

• is iterative and circular– don’t ever finish– just stop at some point

Notion

Data

Analysis

Interpretation

Supposition

Proposition

Hypothesis

Refined Hypothesis

11/30/2004 Copyright 2004 Teradata, a division of NCR 3

Teradata Warehouse MinerTeradata Warehouse MinerTechnology Enablers for the Data Mining ProcessTechnology Enablers for the Data Mining Process

• the various releases of Teradata Warehouse Miner are intended to serve as very powerful technology enablers for the Data Mining Process

• but, Tools Don’t Build Models, Thoughtful People Do– When a good tool between the ears drives the data mining

process, good models are built– When too much is asked of analytical software, the risk of spurious

and invalid models rises proportionately

• but thoughtful people who build models can also be helped by having a proven and generic process to follow– The formal Teradata Data Mining Method is one of several good

processes used to conduct successful data mining projects• its foundation is the “tried and true” empirical method• its not a prescription, just a set of carefully constructed suggestions

11/30/2004 Copyright 2004 Teradata, a division of NCR 4

Teradata Data Mining MethodTeradata Data Mining Method

Project Management

Knowledge Transfer

BusinessIssues

ArchitectureandTechnologyPreparation

DataPreparation

AnalyticalModeling

KnowledgeDelivery and Deployment

• data mining is a very iterative process– the linear process depicted above serves as a guide, and

identifies the chunky bits of the process

11/30/2004 Copyright 2004 Teradata, a division of NCR 5

Data Mining with Teradata Warehouse MinerData Mining with Teradata Warehouse MinerTeradata’s Data Mining Method – Our ProcessTeradata’s Data Mining Method – Our Process

““Data Profiling”Data Profiling”

• Data Exploration• Data Transformation

TWM – Stats & ADS TWM – Stats & ADS

• Multivariate Statistics• Machine Learning Algorithms

Analytic ModelingAnalytic Modeling

TWM – Analytics TWM – Analytics

Model DeploymentModel Deployment

• Scoring & Evaluation• Lifecycle Maintenance

TWM – DeploymentTWM – Deployment

Model Deploymentand Maintenance

Data Preparationand PreProcessing

Model Constructionand Evaluation

Business Question Identification and

Qualification

Architecture and

TechnologyPreparation

Project Management -and- Knowledge Transfer

Highly Iterative Process

11/30/2004 Copyright 2004 Teradata, a division of NCR 6

Data Mining and the Empirical MethodData Mining and the Empirical Method

• data mining is not automated discovery of hidden patterns in your data

• data mining is thoughtful and technology enabled discovery of hidden patterns in your data

• welcome to the empirical method

11/30/2004 Copyright 2004 Teradata, a division of NCR 7

Teradata as an Analytic EngineTeradata as an Analytic Engine

• Teradata is especially well-suited to perform complex aggregations and evaluations of sets according to conditional logic– native Teradata functions– expressed as SQL– where indexes cannot reasonably be expected to exist for any

particular aggregation, set evaluation, or conditional logic

• analytical modeling algorithms require an engine that can perform complex aggregations and evaluations of sets according to conditional logic

• the very good fit of Teradata as an analytic engine is rather obvious after considering what analytical modeling algorithms actually do under the hood

11/30/2004 Copyright 2004 Teradata, a division of NCR 8

Said another way ...Said another way ...

Given: The following notation is used in virtually all statistical, artificial intelligence, and machine learning algorithms that denote equations used to represent and calculate data mining models:

f (x) - which means sum - and - Σ f (x) - which means sum

f (x) - which means multiply

Є and Є - which mean is, and is not an element of (set theory)

Question: What do they all have in common?

Answer: All of these are what Teradata does better than any other engine on this planet.

Note: f(x) are other supported functions, mathematical and other, either as native Teradata functions, or those that can be expressed in SQL with Teradata extensions very efficiently.

11/30/2004 Copyright 2004 Teradata, a division of NCR 9

Teradata Warehouse Miner is an ongoing Teradata Warehouse Miner is an ongoing experimentexperiment

• TeraMinerTM Stats– June, 1999

• Teradata Warehouse Miner– Stats, Analytics, & Deployment– July, 2001

• Teradata Warehouse Miner– Stats, Analytics, Deployment, & ADS (Analytical Data Set generation)– June, 2004

• additional functionality continually in subsequent releases– to each of these components of Teradata Warehouse Miner

• because of our success with this “experimental approach”, we continue to ask: “Why not?”– Teradata continues to amaze us by what it can do– our Teradata Warehouse Miner Software Engineering Team is quite

amazing too

11/30/2004 Copyright 2004 Teradata, a division of NCR 10

What isWhat isTeradata Warehouse Miner ?Teradata Warehouse Miner ?

• TWM includes a set of .NET Interfaces and a User Interface– generates and executes Teradata-specific SQL

• ANSI SQL when possible

– instantiated by User Interface– easily integrated into other applications (partners, custom)– all analysis parameters, model definition, and analysis results stored in metadata– select results or explain, or persist results in table, temporary table or view

• TWM includes several types of .NET Interfaces– Registry independent application extensions or plug-ins– Teradata Warehouse Miner Descriptive Statistics DLL – Teradata Warehouse Miner ADS DLL– Teradata Warehouse Miner Data Reorganization DLL– Teradata Warehouse Miner Analytic Algorithm & Scoring DLLs (4)– Teradata Warehouse Miner Matrix DLL– Teradata Warehouse Miner Statistical Test DLL

• TWM includes a GUI for the desktop– User interface to .NET Objects– Queries Teradata Data Dictionary to aid in parameterizing functions

• directly using HELP syntax • optionally, MDS DIM (Metadata Services Database Information Model)

– Interactive display of results – SQL, Data, Graphs, Reports

11/30/2004 Copyright 2004 Teradata, a division of NCR 11

Teradata Warehouse MinerTeradata Warehouse MinerHigh Level ArchitectureHigh Level Architecture

Teradata Warehouse Miner

• Windows Interface– build, maintain, and execute

projects– explore and manipulate results

• tabular and graphical– parameterize .NET APIs

• .NET APIs & ADO– .NET Interfaces (APIs)

• documented for developers– ActiveX Data Objects

• DLL interface ”plug-ins”– write all API parameters and all

XML results in TWM metadata • stored in binary data type

– generate & submit SQL– receive query results from

Teradata and present them in user interface

– read model definition and results stored in TWM metadata to display XML reports and graphs

– read model definition in TWM metadata to score and evaluate

Teradata RDBMS

User Interface Services

Teradata Platform:

Teradata RDBMS Version 2 Release 4.1 or later

Business Services

Data Services

Windows NT, 2000, XP, .NET 2003 ServerClient Platform:

Manager

Algorithms (COM) Algorithms (.NET) Data Access

Teradata ODBC

Metadata Access

Projects

Analyses

Teradata Metadata Services

User Interface

Visualizations

11/30/2004 Copyright 2004 Teradata, a division of NCR 12

Teradata Warehouse MinerTeradata Warehouse MinerData Description FunctionsData Description Functions

Univariate StatisticsCountMinimum, MaximumModesMeanStandard DeviationStandard ErrorVarianceCoefficient of VariationSkewnessKurtosisUncorrected Sum of SquaresCorrected Sum of Squares

Quantiles and RanksTop 10/Bottom 10 PercentilesDecilesQuartilesTertilesTop 5/Bottom 5 Ranked Values with Counts

Scatter Plot Analysis2-D and 3-D Plots of Continuous Variables

Correlation AnalysisQuickly view pair-wise correlations among ‘n’ variables

Values Analysis(basic data quality analysis)

Data Types Counts # NULL Values # Positive Values # Negatives Values # Zeros # Blanks # Unique Values

Frequency AnalysesFrequency of Discrete Variables

N-Way Cross-TabulationPair-wise Cross-Tabs

Histogram AnalysesHistograms of Continuous VariablesOptions for

Even WidthUser Defined Widths/BoundariesQuantileAdaptive BinningOverlay columnsStatistics within bins

Overlap AnalysisIndex/Key Column Consistency

Data ExplorerPerforms basic statistical analysis on a set of tables and selected columns within any Teradata database

Intelligent decisions about which functions to perform

Most criteria for “Intelligent” decisions can be modified by user

Values Analysis - Every column in the set of input tables

Univariate Statistical Analysis - Every column of numeric or date type

Frequency Analysis - Every column that has less than or equal to a number of unique values

Histogram Analysis - Every numeric or date type column that has more than a number of unique values

Data Visualizations2D & 3D Histograms

2D & 3D Frequency Bar Charts

Values Bar Charts & Circular Graphs

Box and Whisker Plots

Scatter Plots

Integrated Data Explorer Graphics

11/30/2004 Copyright 2004 Teradata, a division of NCR 13

Teradata Warehouse MinerTeradata Warehouse MinerData Derivation and Transformation FunctionsData Derivation and Transformation Functions

Variable Creation

AggregationsCount, Average, Sum, etc.

Windowed Aggregates/OLAPRank, Quantiles, Moving Sums, etc.

Arithmetic operators/functions: +, -, *, /, MOD, **

ABS, EXP, LN, LOG, SQRT, etc.

Trigonometric & Hyperbolic functions

COS, SIN, TAN, ACOS, etc.

COSH, SINH, TANH, ACOSH, etc.

CASE expressions and NULL operators

valued and searched types

NULLIF, COALESCE

Comparison operators=, >, <, <>, <=, >=

Logical predicatesBETWEEN…AND…, IN (expression list), etc.

Variable Creation (cont)

Calendar functions: day_of_week, day_of_calendar, quarter_of_year, etc.

String functionsLOWER, UPPER, TRIM, ||, etc.

Data Type conversion

SQL predicatesTRUE, FALSE, NULL

Variable Dimensioning

Simple DimensionsSpecific values

Range of values

Combined Dimensions

Hierarchical Dimensions

SysCalendar, etc.

Variable TransformationBin Coding

Design Coding

Recoding

Rescaling

DeriveHook to Variable Creation

Statistical TransformationsZ-Score

Sigmoid

NULL Value ReplacementLiteral value

Mean value

Median value

Mode

Imputed values

11/30/2004 Copyright 2004 Teradata, a division of NCR 14

Teradata Warehouse MinerTeradata Warehouse MinerData Reorganization, Build ADS, Matrix FunctionsData Reorganization, Build ADS, Matrix Functions

Data Reorganization

Random Sampleand Stratified Random

Partitioning

Denormalize/Pivoting

JoiningInner

Left Outer

Right Outer

Full Outer

Build ADS

Create Final ADS

Create Metadata for Refresh

Matrix Functions

Correlation

Covariance

SSCP

Corrected SSCP

11/30/2004 Copyright 2004 Teradata, a division of NCR 15

Teradata Warehouse MinerTeradata Warehouse MinerAnalytical Techniques, Scoring, Visualizations (1)Analytical Techniques, Scoring, Visualizations (1)

Analytic Algorithms

(Multivariate Statistical Techniques)

Linear Regressionmodel statisticsvariable coefficients, standard errors, confidence intervals, etc.

incremental R2

step-wise variable selection optionsforward & forward onlybackward & backward only

Factor AnalysisPrincipal Component AnalysisPrincipal Axis FactorsMaximum Likelihood FactorsOrthogonal & Oblique Rotations

Logistic RegressionLogit Model Coefficients, Odds Ratios and StatisticsModel Success Analysis and Lift Tablesstep-wise variable selection options

forward & forward onlybackward & backward only

Model ScoringLinear RegressionLogistic RegressionFactor AnalysisSQL-based model scoring

all scoring SQL is provided

Supporting VisualizationsScatter PlotLift ChartRegression PlotsFactor PatternScree Plot

Multivariate DiagnosticsExtensive Collinearity DiagnosticsAutomated Identification of ConstantsRow level diagnostics, and much more…SQL-based model evaluation

11/30/2004 Copyright 2004 Teradata, a division of NCR 16

Teradata Warehouse MinerTeradata Warehouse MinerAnalytical Techniques, Scoring, Visualizations (2)Analytical Techniques, Scoring, Visualizations (2)

Analytic Algorithms

(AI and Machine Learning Techniques)

Decision Tree/Rule Inductiongini / regression (i.e., CART)Entropy (i.e., C4.5 / C5.0)CHAIDpruning

gini algorithm pruninggain ratio algorithm pruningmanual pruning

ClusteringK-Means

Nearest Neighbor LinkageExpectation Maximization

Gaussian Mixture ModelPoisson Mixture Modelvariable importance report

Affinity and Sequence AnalysesFeature Rich Implementations

Support

Confidence

Lift

z-Score

Model ScoringDecision TreesClusteringAffinity and Sequence AnalysesSQL-based model scoring

all scoring SQL is provided

Supporting VisualizationsGraphical Tree Browser

Interactive PruningText RulesDistributions

Lift ChartsCluster Sizes / Distance / MeasuresAssociation Color Map

Model Evaluationtruth table (confusion matrix)model statistics & indicesSQL-based model evaluation

11/30/2004 Copyright 2004 Teradata, a division of NCR 17

Teradata Warehouse MinerTeradata Warehouse MinerStatistical TestsStatistical Tests

Binomial TestsBinomial

Sign

Rank TestsMann-Whitney (Kruskal-Wallis)

Wilcoxon

Friedman

Contingency Table TestsChi-square

Median

Parametric TestsF (Two Way) Unequal Sample Size

F (N-Way) Equal Sample Size

T

Normality/Equality TestsKolmogorov-Smirnov

Lilliefors Test

Shapiro-Wilk

D’Agostino & Pearson Omnibus

Smirnov

11/30/2004 Copyright 2004 Teradata, a division of NCR 18

Why Did We Build Teradata Warehouse Miner?Why Did We Build Teradata Warehouse Miner?Integrated Data Mining EnvironmentIntegrated Data Mining Environment

Other TechnologiesInefficient Environment

- Elapsed and Execution Times

Continual Data MovementData RedundancyMetadata Inconsistencies“Many Versions of The Truth”

Teradata and TWMEfficiently Architected

Environment- MPP Performance and Scalability

No Data MovementNo Data RedundancyShared Metadata“One Version of The Truth”

ModelersBuild Models

BusinessDeploys Models

ModelersBuild Models

BusinessDeploys Models

11/30/2004 Copyright 2004 Teradata, a division of NCR 19

Why are Integrated Analytics Important?Why are Integrated Analytics Important?Efficiency, Performance & ScalabilityEfficiency, Performance & Scalability

ModelersBuild Models

BusinessDeploys Models

Source Data

AnalyticMetadata

Analytic Data Set

• Mine data in an integrated environment

Huge data volumes – leverages the parallelism of Teradata

Minimize data redundancy Eliminate proprietary data structures Simplify data & system management Better results using larger amounts of

detailed data Eliminate potential errors during data

movement & external sampling Integrated model building and scoring Reduced overall modeling time

Many resulting elapsed and execution time improvements have been astronomical !

11/30/2004 Copyright 2004 Teradata, a division of NCR 20

The Teradata Warehouse Miner GoalThe Teradata Warehouse Miner GoalEnable Entire Data Mining Process Enable Entire Data Mining Process In Teradata In Teradata

Teradata Teradata Data Warehouse Data Warehouse

ScoredScoredData SetData Set

SourceSourceDataData

AnalyticAnalyticData SetData Set

Data Pre-Processing

ModelDeployment

Analytical Modeling

AnalyticAnalyticMetadataMetadata

• data starts and ends in the database• open to accommodate 3rd party partner tools

11/30/2004 Copyright 2004 Teradata, a division of NCR 21

Teradata Warehouse MinerTeradata Warehouse MinerProjects and Analytic ModulesProjects and Analytic Modules

• Teradata Warehouse Miner Projects contain one or more tasks

• each task is called an Analytic Module– eight categories of analytic modules

• ADS (Analytical Data Set generation)– Variable Creation– Variable Transformation– Build ADS

• Analytics (Analytic Algorithms)• Descriptive Statistics• Matrix Functions (correlation, …)• Miscellaneous

– free form SQL , …

• Reorganization (Structure of Data)• Scoring (and Model Evaluation)• Statistical Tests

• Analytic Modules are the fundamental building blocks used to conduct data analysis in Teradata Warehouse Miner

11/30/2004 Copyright 2004 Teradata, a division of NCR 22

Teradata Warehouse MinerTeradata Warehouse MinerElements in the Primary WindowElements in the Primary Window

Project Icon

Analytic Module Icon

ODBC Connection

Icon

Connection Properties Icon

Run and Stop Icons

Runtime Message Area

Data Source Status

Project Area

Analysis Set-up and Results Viewing Area

hmmm… I wonder what else might fill this large gray area some day...

Main Menus

Main Toolbar

Open, Save, and Save All Icons

11/30/2004 Copyright 2004 Teradata, a division of NCR 23

• there are 7 basic steps in the use of Teradata Warehouse Miner*– connect to an ODBC data source with appropriate permissions– create a new, (or open an existing) Project– add at least one Analytic Module to the Project– set input and analytic options

• select table(s) and column(s) to be analyzed• set Analytic Module parameters**• set other Analytic Module options as necessary**

– set output and results options– execute the Analytic Module (using the run icon )

• optionally, save the Project(s) and Analyses– examine, interpret, and use results of interest**

• that’s it

* use these steps after you or a system administrator has set up an ODBC Data Source (DSN) on your PC. The DSN must point to source, result, and metadata Teradata databases for which you have appropriate permissions

** setting Analytic Model options, and interpreting and using results appropriately requires expertise specific to the Analytic Module chosen

Teradata Warehouse MinerTeradata Warehouse MinerThe 7 Steps to ResultsThe 7 Steps to Results

11/30/2004 Copyright 2004 Teradata, a division of NCR 24

Using Teradata Warehouse Miner

The 7 Steps to Results

An Example

11/30/2004 Copyright 2004 Teradata, a division of NCR 25

Teradata Warehouse MinerTeradata Warehouse MinerStep 1 - connect to an ODBC data sourceStep 1 - connect to an ODBC data source

11/30/2004 Copyright 2004 Teradata, a division of NCR 26

Teradata Warehouse MinerTeradata Warehouse MinerStep 2 - create a new ProjectStep 2 - create a new Project

11/30/2004 Copyright 2004 Teradata, a division of NCR 27

Teradata Warehouse MinerTeradata Warehouse MinerStep 3 - add an Analytic Module to the ProjectStep 3 - add an Analytic Module to the Project

11/30/2004 Copyright 2004 Teradata, a division of NCR 28

Teradata Warehouse MinerTeradata Warehouse MinerStep 4 – set input and analytic optionsStep 4 – set input and analytic options ( (select table and columns to be analyzedselect table and columns to be analyzed))

11/30/2004 Copyright 2004 Teradata, a division of NCR 29

Teradata Warehouse MinerTeradata Warehouse MinerStep 4 – set input and analytic optionsStep 4 – set input and analytic options ( (set Analytic Module parametersset Analytic Module parameters))

11/30/2004 Copyright 2004 Teradata, a division of NCR 30

Teradata Warehouse MinerTeradata Warehouse MinerStep 4 – set input and analytic optionsStep 4 – set input and analytic options ( (set other Analytic Module options as necessary)set other Analytic Module options as necessary)

11/30/2004 Copyright 2004 Teradata, a division of NCR 31

Teradata Warehouse MinerTeradata Warehouse MinerStep 5 – set output and results optionsStep 5 – set output and results options

**Note: This screen-shot is from a Scoring Module for the

analytic algorithm module used in this example

11/30/2004 Copyright 2004 Teradata, a division of NCR 32

Teradata Warehouse MinerTeradata Warehouse MinerStep 6 - execute the Analytic ModuleStep 6 - execute the Analytic Module

11/30/2004 Copyright 2004 Teradata, a division of NCR 33

Teradata Warehouse MinerTeradata Warehouse MinerStep 6 - execute the Analytic ModuleStep 6 - execute the Analytic Module ( (optionally, save the Project(s) and Analysesoptionally, save the Project(s) and Analyses))

11/30/2004 Copyright 2004 Teradata, a division of NCR 34

Teradata Warehouse MinerTeradata Warehouse MinerStep 7 - examine, interpret, and use results (1)Step 7 - examine, interpret, and use results (1)

11/30/2004 Copyright 2004 Teradata, a division of NCR 35

Teradata Warehouse MinerTeradata Warehouse MinerStep 7 - examine, interpret, and use results (2)Step 7 - examine, interpret, and use results (2)

11/30/2004 Copyright 2004 Teradata, a division of NCR 36

Tips for Navigating the Teradata Tips for Navigating the Teradata Warehouse Miner InterfaceWarehouse Miner Interface

• on-line help and user’s guide– very extensive and thorough– tutorials for each function– describes many of the analytical techniques in detail– many reference formulae are provided– use these liberally

• menus and toolbar

• runtime message area

• setting program options and preferences– global– run-time

• setting up Project Directories for files on PC client– optionally, for local HTML reports and associated graphics

11/30/2004 Copyright 2004 Teradata, a division of NCR 37

Teradata Warehouse Miner

Demo

TWM, an enabling technology to assist in addressing qualified business questions that are well suited to the

processes of decision support and data mining

(data exploration – data transformation – exploratory modeling – model building and validation – scoring and

evaluation – lifecycle maintenance – …)

11/30/2004 Copyright 2004 Teradata, a division of NCR 38

University of Arkansas

Data Mining withTeradataTM Warehouse Miner

Questions and Discussion