tm1-iihmr-DM

DATA MINING IN HEALTHCARE

Professor – Healthcare IT,International Institute of Health Management Research

(IIHMR - New Delhi).

Prof. (Dr.) T. Muthukumar M.Sc; M.C.A; M.B.A; M.Phil; Ph.D.

Agenda• Data Mining – Introduction & Background• Data Warehouse and OLAP – Introduction• Data Mining Primitive – Data Preprocessing• Types of Data Mining & Techniques• Mining Complex Types of Data• Application – Tools, Case Studies, Healthcare

Application

Recommended Reference Books• E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011

• S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002

• R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996

• U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001

Recommended Reference Books

• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 (3ed. 2011)

• T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009

• B. Liu, Web Data Mining, Springer 2006.

• T. M. Mitchell, Machine Learning, McGraw Hill, 1997

• P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005

Recommended Reference Books• S. M. Weiss and N. Indurkhya, Predictive Data

Mining, Morgan Kaufmann, 1998

• I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005

Introduction of Data Mining• Why Data Mining?

• What Is Data Mining?

• A Multi-Dimensional View of Data Mining

• What Kind of Data Can Be Mined?

• What Kinds of Patterns Can Be Mined?

• What Technology Are Used?

• What Kind of Applications Are Targeted?

• Major Issues in Data Mining

• A Brief History of Data Mining and Data Mining Society

• Summary

Why Data Mining?

• The Explosive Growth of Data: from terabytes to petabytes

– Data collection and data availability

• Automated data collection tools, database systems, Web, computerized society

– Major sources of abundant data

• Business: Web, e-commerce, transactions, stocks, …

Why Data Mining? • Science: Remote sensing,

bioinformatics, scientific simulation …

• Society and everyone: news, digital cameras, YouTube

Why Data Mining?

• We are drowning in data, but starving for knowledge!

• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets

Data pyramid

Data

Information

Knowledge

Wisdom

Data + context

Information + rules

Knowledge + experience

Why we need Data Mining?• Large number of records(cases) (108-1012 bytes)

– One thousand (103) bytes = 1 kilobyte (KB)– One million (106) bytes = 1 megabyte (MB)– One billion (109) bytes = 1 gigabyte (GB)– One trillion (1012) bytes = 1 terabyte (TB)

• High dimensional data (variables)– 10-104 attributes

• Only a small portion, typically 5% to 10%, of the collected data is ever analyzed.

• Data collected and stored at enormous speeds (Gbyte/hour)– remote sensor on a satellite– telescope scanning the skies– scientific simulations generating terabytes of data

• Classical modeling techniques are infeasible• Data reduction• Cataloging, classifying, segmenting data• Helps scientists in Hypothesis Formation

Scientific Viewpoint

Evolution of Sciences: New Data Science Era

• Before 1600: Empirical science• 1600-1950s: Theoretical science

– Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.


• 1950s-1990s: Computational science– Over the last 50 years, most disciplines have

grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)

– Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.

• 1990-now: Data science– The flood of data from new scientific

instruments and simulations– The ability to economically store and

manage petabytes of data online– The Internet and computing Grid that

makes all these archives universally accessible


– Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes

– Data mining is a major new challenge!


Chapter 1. Introduction• Why Data Mining?









• Summary

What Is Data Mining?

• Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial,

implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.

What Is Data Mining?• Alternative names

– Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

• Watch out: Is everything “data mining”? – Simple search and query processing – (Deductive) expert systems

What is Data Mining?

• The automated extraction of predictive information from large databases – Automated – Extraction – Predictive In large Database.

Business IntelligenceImproving Business Insight

“A broad category of applications and technologies for gathering, storing, analyzing, sharing and providing access to data to help enterprise users make better business decisions.”

– Gartner

RelationshipsAnd Acronyms...

What does Data Mining Do?

Explores Your Data

Finds Patterns

Performs Predictions

DM and BI

• BI is geared at an end user, such as a business owner, knowledge worker etc.

• DM is an IT technology generally geared towards a more advanced user – today

• By the way: who is qualified to use DM today?

Knowledge Discovery (KDD) Process

• This is a view from typical database systems and data warehousing communities

• Data mining plays an essential role in the knowledge discovery process

Data Cleaning

Data Integration

Databases

Data Warehouse

Task-relevant Data

Selection

Data Mining

Pattern Evaluation

______

______

______

Transformed Data

Patternsand

Rules

Target Data

RawData

KnowledgeData MiningTransformation

Interpretation& Evaluation

Selection& Cleaning

IntegrationUnderstanding

Knowledge Discovery Process

DATAWarehouse

Knowledge

Data Mining in Business Intelligence

Increasing potentialto support

business decisions End User

Business Analyst

DataAnalyst

DBA

Decision Making

Data PresentationVisualization Techniques

Data MiningInformation Discovery

Data ExplorationStatistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses

Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems

KDD Process: A Typical View from ML and Statistics

Input Data Data Mining

Data Pre-Processing

Post-Processing

• This is a view from typical machine learning and statistics communities

Data integrationNormalization

Feature selectionDimension reduction

Pattern discoveryAssociation & correlation

ClassificationClustering

Outlier analysis… … … …

Pattern evaluationPattern selection

Pattern interpretationPattern visualization

Example: Medical Data Mining

• Health care & medical data mining – often adopted such a view in statistics and machine learning

• Preprocessing of the data (including feature extraction and dimension reduction)

• Classification or/and clustering processes

• Post-processing for presentation

The Evolution of Data AnalysisEvolutionary Step

Business Question

Enabling Technologies

Product Providers

Characteristics

Data Collection (1960s)

"What was my total revenue in the last five years?"

Computers, tapes, disks

IBM, CDC

Retrospective, static data delivery

Data Access (1980s)

"What were unit sales in New England last March?"

Relational databases (RDBMS), Structured Query Language (SQL), ODBC

Oracle, Sybase, Informix, IBM, Microsoft

Retrospective, dynamic data delivery at record level

Data Warehousing & Decision Support (1990s)

"What were unit sales in New England last March? Drill down to Boston."

On-line analytic processing (OLAP), multidimensional databases, data warehouses

SPSS, Comshare, Arbor, Cognos, Microstrategy,NCR

Retrospective, dynamic data delivery at multiple levels

Data Mining (Emerging Today)

"What’s likely to happen to Boston unit sales next month? Why?"

Advanced algorithms, multiprocessor computers, massive databases

SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups

Prospective, proactive information delivery

When is DM useful

• Data rich world• Large data (dimensionality and size)

– Image data (size)– Gene chip data (dimensionality)

• Little knowledge about data (exploratory data analysis)– What if we have some knowledge?

Data Mining Versus Statistical Analysis

•Data Analysis– Tests for statistical correctness of

models• Are statistical assumptions of

models correct?– Eg Is the R-Square good?

– Hypothesis testing• Is the relationship significant?

– Use a t-test to validate significance

– Tends to rely on sampling– Techniques are not optimised for

large amounts of data– Requires strong statistical skills

•Data Mining– Originally developed to act as

expert systems to solve problems

– Less interested in the mechanics of the technique

– If it makes sense then let’s use it

– Does not require assumptions to be made about data

– Can find patterns in very large amounts of data

– Requires understanding of data and business problem

Data Mining versus OLAP

•OLAP - On-line Analytical Processing

– Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening

Data Mining vs. Database

• DB’s user knows what is looking for.• DM’s user might/might not know what is looking for.• DB’s answer to query is 100% accurate, if data correct.• DM’s effort is to get the answer as accurate as possible.• DB’s data are retrieved as stored.• DM’s data need to be cleaned (some what) before producing results.• DB’s results are subset of data.• DM’s results are the analysis of the data.• The meaningfulness of the results is not the concern of Database as• it is the main issue in Data Mining.

Data Mining vs. KDD

• Knowledge Discovery in Databases (KDD) is the process of finding useful information and patterns in the data.

• Data Mining is the use of algorithms to find the usefulinformation in the KDD process.

• KDD process is:» Data cleaning & integration (Data Pre-processing)» Creating a common data repository for all sources, such as data warehouse.

Data mining» Visualization for the generated results

Data mining is not• Brute-force crunching of bulk data • “Blind” application of algorithms• Going to find relationships where

none exist• Presenting data in different ways• A database intensive task• A difficult to understand

technology requiring an advanced degree in computer science










• Summary

Data Mining: On What Kinds of Data?

• Database-oriented data sets and applications

– Relational database, data warehouse, transactional database

• Advanced data sets and advanced applications

– Data streams and sensor data

– Time-series data, temporal data, sequence data (incl. bio-sequences)

Data Mining: On What Kinds of Data?

– Structure data, graphs, social networks and multi-linked data

– Object-relational databases

– Heterogeneous databases

– Spatial data and spatiotemporal data

– Multimedia database

– Text databases & The World-Wide Web










• Summary

Data Mining Algorithms

Online AnalyticalProcessing

Discovery Driven Methods

SQL Query ToolsDescription Prediction

Classification Regressions

Decision Trees

Neural Networks

Visualization

Clustering

Association

Sequential Analysis

Data Mining Function: (1) Generalization

• Information integration and data warehouse construction

– Data cleaning, transformation, integration, and multidimensional data model

• Data cube technology

– Scalable methods for computing (i.e., materializing) multidimensional aggregates

– OLAP (online analytical processing)• Multidimensional concept description: Characterization and

discrimination

– Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region

Concept Description

• Characterization: provides a concise and succinct summarization of the given collection of data

• Discrimination: provides descriptions comparing two or more collections of data

• can handle complex data types of the attributes

• a more automated process

Name Gender Major Birth-Place Birth_date Residence Phone # GPA

Jim Woodman

M CS Vancouver,BC,Canada

8-12-76 3511 Main St., Richmond

687-4598 3.67

Scott Lachance

M CS Montreal, Que, Canada

28-7-75 345 1st Ave., Richmond

253-9106 3.70

Laura Lee …

F …

Physics …

Seattle, WA, USA …

25-8-70 …

125 Austin Ave., Burnaby …

420-5232 …

3.83 …

Removed Retained Sci,Eng,Bus

Country Age range City Removed Excl, VG,..

Gender Major Birth_region Age_range Residence GPA Count M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … … … … …

Birth_Region

GenderCanada Foreign Total

M 16 14 30 F 10 22 32

Total 26 36 62

Generalized Relation

Initial Relation

Concept description: Characterization

Data Mining Function: (2) Association and Correlation Analysis• Frequent patterns (or frequent itemsets)

– What items are frequently purchased together in your Walmart?

• Association, correlation vs. causality

– A typical association rule– Are strongly associated items also strongly

correlated?• How to mine such patterns and rules efficiently in large

datasets?

Association rule• Association (correlation and causality)

– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]

• Association rule mining– Finding frequent patterns, associations, correlations

among sets of items or objects in transaction databases, relational databases, and other information repositories

– Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database

• Motivation: finding regularities in data– What products were often purchased together?

Example: Association rule

• Itemset A1,A2={a1, …, ak}• Find all the rules A1A2 with min

confidence and support– support, s, probability that a

transaction contains A1A2– confidence, c, conditional

probability that a transaction having A1 also contains A2.Let min_support = 50%,

min_conf = 50%:a1 a3 (50%, 66.7%)a3 a1 (50%, 100%)

Transaction-id Items bought10 a1,a2, a3

20 a1, a3

30 a1, a4

40 a2, a5, a6

Data Mining Function: (3) Classification

• Classification and label prediction – Describe and distinguish classes or concepts

for future prediction• E.g., classify countries based on (climate),

or classify cars based on (mileage)– Predict some unknown class labels

Data Mining Function: (3) Classification• Typical methods

– Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern – based classification, logistic regression, …

• Typical applications:– Credit card fraud detection, direct marketing,

classifying stars, diseases, web-pages, …

Classification (1): Model Construction

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

TrainingData

NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no

ClassificationAlgorithms

IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’

Classifier(Model)

Classification (2): Prediction Using the Model

Classifier

TestingData

NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?

Classification Techniques• Decision Tree Induction• Bayesian Classification• Neural Networks• Genetic Algorithms• Fuzzy Set and Logic

Data Mining Function: (4) Cluster Analysis

• Unsupervised learning (i.e., Class label is unknown)

• Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns

• Principle: Maximizing intra-class similarity & minimizing interclass similarity

• Many methods and applications

Clustering• Cluster: a collection of data objects

– Similar to one another within the same cluster– Dissimilar to the objects in other clusters

• Clustering– Grouping a set of data objects into clusters based on the

principle: maximizing the intra-class similarity and minimizing the interclass similarity

• Example– Land use: Identification of areas of similar land use in

an earth observation database– City-planning: Identifying groups of houses according

to their house type, value, and geographical location

Data Mining Function: (5) Outlier Analysis

• Outlier analysis– Outlier: A data object that does not comply

with the general behavior of the data– Noise or exception? ― One person’s garbage

could be another person’s treasure– Methods: by product of clustering or

regression analysis, …– Useful in fraud detection, rare events analysis

Time and Ordering: Sequential Pattern, Trend and Evolution Analysis

• Sequence, trend and evolution analysis– Trend, time-series, and deviation analysis: e.g.,

regression and value prediction– Sequential pattern mining

• e.g., first buy digital camera, then buy large SD memory cards

– Periodicity analysis– Biological sequence analysis

• Mining data streams– Ordered, time-varying, potentially infinite, data

streams

Sequential Pattern Analysis• Given a set of sequences, find the complete set of

frequent subsequences

• Applications of sequential pattern– Customer shopping sequences:

• First buy computer, then CD-ROM, and then digital camera, within 3 months.

– Weblog click streams– Telephone calling patterns

SID sequence10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb>40 <eg(af)cbc>

Given support threshold min_sup =2, <(ab)c> is a sequential pattern

Structure and Network Analysis

• Graph mining– Finding frequent subgraphs (e.g., chemical

compounds), trees (XML), substructures (web fragments)

• Information network analysis– Social networks: actors (objects, nodes) and

relationships (edges)• e.g., author networks in CS, terrorist

networks

Structure and Network Analysis– Multiple heterogeneous networks

• A person could be multiple information networks: friends, family, classmates ..

• Web mining– Web is a big information network: Google– Analysis of Web information networks

• Web community discovery, opinion mining, usage mining, …

Evaluation of Knowledge

• Are all mined knowledge interesting?

– One can mine tremendous amount of “patterns” and knowledge

– Some may fit only certain dimension space (time, location, …)

– Some may not be representative, may be transient, …

Evaluation of Knowledge

• Evaluation of mined knowledge → directly mine only interesting knowledge?

– Descriptive vs. predictive

– Coverage

– Typicality vs. novelty

– Accuracy

– Timeliness

Regression• Regression is similar to classification

– First, construct a model– Second, use model to predict unknown

value• Methods

– Linear and multiple regression– Non-linear regression

• Regression is different from classification– Classification refers to predict

categorical class label– Regression models continuous-valued

functions










• Summary

Data Mining: Confluence of Multiple Disciplines

Data Mining

MachineLearning

Statistics

Applications

Algorithm

PatternRecognition

High-PerformanceComputing

Visualization

Database Technology

Why Confluence of Multiple Disciplines?

• Tremendous amount of data– Algorithms must be highly scalable to handle such as tera-bytes of data

• High-dimensionality of data – Micro-array may have tens of thousands of dimensions

• High complexity of data– Data streams and sensor data– Time-series data, temporal data, sequence data – Structure data, graphs, social networks and multi-linked data– Heterogeneous databases and legacy databases– Spatial, spatiotemporal, multimedia, text and Web data– Software programs, scientific simulations

• New and sophisticated applications










• Summary

Applications of Data Mining

• Web page analysis: from web page classification, clustering algorithms

• Collaborative analysis & recommender systems

• Basket data analysis to targeted marketing

• Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis

Applications of Data Mining

• Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)

• From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining










• Summary

Major Issues in Data Mining (1)

• Mining Methodology

– Mining various and new kinds of knowledge

– Mining knowledge in multi-dimensional space

– Data mining: An interdisciplinary effort

– Boosting the power of discovery in a networked environment

Major Issues in Data Mining (1)– Handling noise, uncertainty, and incompleteness of

data

– Pattern evaluation and pattern- or constraint-guided mining

• User Interaction

– Interactive mining

– Incorporation of background knowledge

– Presentation and visualization of data mining results


• Efficiency and Scalability

– Efficiency and scalability of data mining algorithms

– Parallel, distributed, stream, and incremental mining methods


• Diversity of data types

– Handling complex types of data

– Mining dynamic, networked, and global data repositories

• Data mining and society

– Social impacts of data mining

– Privacy-preserving data mining

– Invisible data mining










• Summary

A Brief History of Data Mining Society

• 1989 IJCAI Workshop on Knowledge Discovery in Databases

– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)

• 1991-1994 Workshops on Knowledge Discovery in Databases

– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

• 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)

– Journal of Data Mining and Knowledge Discovery (1997)

• ACM SIGKDD conferences since 1998 and SIGKDD Explorations

• More conferences on data mining

– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM (2008), etc.

• ACM Transactions on KDD (2007)

Conferences and Journals on Data Mining

• KDD Conferences– ACM SIGKDD Int. Conf. on

Knowledge Discovery in Databases and Data Mining (KDD)

– SIAM Data Mining Conf. (SDM)– (IEEE) Int. Conf. on Data Mining

(ICDM)– European Conf. on Machine Learning

and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD)

– Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)

– Int. Conf. on Web Search and Data Mining (WSDM)

Other related conferences DB conferences: ACM SIGMOD,

VLDB, ICDE, EDBT, ICDT, … Web and IR conferences: WWW,

SIGIR, WSDM ML conferences: ICML, NIPS

PR conferences: CVPR, Journals

Data Mining and Knowledge Discovery (DAMI or DMKD)

IEEE Trans. On Knowledge and Data Eng. (TKDE)

KDD Explorations ACM Trans. on KDD

Where to Find References? DBLP, Google

• Data mining and KDD (SIGKDD: CDROM)– Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.– Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD

• Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)– Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA– Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.

• AI & Machine Learning– Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.– Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-

PAMI, etc.

• Web and IR – Conferences: SIGIR, WWW, CIKM, etc.– Journals: WWW: Internet and Web Information Systems,

• Statistics– Conferences: Joint Stat. Meeting, etc.– Journals: Annals of statistics, etc.

• Visualization– Conference proceedings: CHI, ACM-SIGGraph, etc.– Journals: IEEE Trans. visualization and computer graphics, etc.










• Summary

Summary

• Data mining: Discovering interesting patterns and knowledge from massive amount of data

• A natural evolution of science and information technology, in great demand, with wide applications

• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation

• Mining can be performed in a variety of data

• Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.

• Data mining technologies and applications

• Major issues in data mining

Data Mining in Association Rule

And now discussion

Data Mining in Association Rule

• Contact:

• Prof. T. Muthukumar• [email protected]• (0-9871969455)

mailto:[email protected]

tm1-iihmr-DM

Documents

Transcript of tm1-iihmr-DM