Research on Data Mining and Knowledge Discovery at WPI

37
WPI Center for Research in Exploratory Data and Information Analysis CREDI A Research on Data Mining and Knowledge Discovery at WPI Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute

description

Research on Data Mining and Knowledge Discovery at WPI. Prof. Carolina Ruiz Department of Computer Science Worcester Polytechnic Institute. Outline of this talk. Short tutorial on Data Mining and Knowledge Discovery in Databases (KDD) Sample ongoing KDD research projects at WPI. - PowerPoint PPT Presentation

Transcript of Research on Data Mining and Knowledge Discovery at WPI

Page 1: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Research on Data Mining and Knowledge Discovery at WPI

Prof. Carolina Ruiz

Department of Computer ScienceWorcester Polytechnic Institute

Page 2: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Outline of this talk

• Short tutorial on Data Mining and Knowledge Discovery in Databases (KDD)

• Sample ongoing KDD research projects at WPI

Page 3: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Need for Data Mining• Data are being gathered and stored

extremely fast– Currently, the amount of new data stored in digital computer

systems every day is roughly equivalent to 3000 pages of text for every person on Earth (estimate based on a projection to 2003 of a study led by Lyman & Varian at UC-Berkeley in 2000).

• Computational tools and techniques are needed to help humans in summarizing, understanding, and taking advantage of accumulated data

Page 4: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [Fayyad et al. 1996]

• Raw Data Data Mining

• Patterns

» Analytical and Statistical Patterns (rules, decision trees, …)» Visual Patterns

What is Data Mining?or more generally, Knowledge Discovery in Databases (KDD)

Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.

Page 5: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

0102030405060708090

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

EastW estNorth

Data Analysis (KDD)Process

data sources

data analysisdata mining• analytical

statistical• visual

models

model/patterns deployment

• prediction• decision supportnew data

data management

• databases• data warehouses

“good” model

model/patternevaluation• quantitative• qualitative

data “pre”-processing

• noisy/missing data • dim. reduction

cleandata

data

Page 6: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

KDD is Interdisciplinarytechniques come from multiple fields

• Machine Learning (AI)– Contributes (semi-)automatic

induction of empirical laws from observations & experimentation

• Statistics– Contributes language, framework,

and techniques

• Pattern Recognition– Contributes pattern extraction and

pattern matching techniques

• Databases– Contributes efficient data

storage, data cleansing, and data access techniques

• Data Visualization– Contributes visual data displays

and data exploration• High Performance Comp.

– Contributes techniques to efficiently handling complexity

• Application Domain– Contributes domain knowledge

Page 7: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Data Mining Modes

• Confirmatory (verification)– Given a hypothesis, verify

its validity against the data

• Exploratory (discovery)– Prescriptive patterns

• Patterns for predicting behavior of newly encountered entities

– Descriptive patterns• Patterns for presenting the

behavior of observed entities in a human-understandable format

Page 8: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAAnalytical and Visual Data

Mining• Analytical

– A model that represents the data is constructed using computational methods

• Visual– Data are displayed on

computer screen using colors and shapes

– Patterns in the data are identified by the human (user) eye.

Page 9: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAWhat do you want to learn from your data?

KDD approaches

Data

classification

regression

clustering

summarization

dependency/assoc. analysis

change/deviation detection

0102030405060708090

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

EastW estNorth

IF a & b & c THEN d & kIF k & a THEN e

A B

C D

0.5

0.750.3A, B -> C 80%C, D -> A 22%

b lu e

B

b lue

C

o ran ge

D

A

IF A & B THEN IF A & D THEN

Page 10: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Commercial Data Mining Systems

Page 11: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Closer Look: IBM’s Intelligent Miner

Page 12: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAData Mining Academic Systems

CBALiu et al., National Univ. of Singapore

WPI WEKA - Our Temporal/Spatial Association Rules

WEKAFrank et al., University of Waikato, New Zealand

ARMiner Cristofor et al., UMass/Boston

Page 13: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIASome Current Analytical Data

Mining Research Projects at WPI• Mining Complex Data: Set and Sequence Mining

– Systems performance Data– Sleep Data– Financial Data– Web Data

• Data Mining for Genetic Analysis– Correlating genetic information with diseases– Predicting gene expression patterns

• Data Mining for Electronic Commerce– Collaborative and Content-Based Filtering

• Using Association Rules and using Neural Networks

Page 14: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

{joe smith, greg jones} 27

<burglary 2/86,fraud 11/93,murder 3/99>

M

{kathy pearls, kathy dow,susan harris}

97,72,67,80,… 53 <child abuse 9/98, kidnapping 2/03>

F

{drew harris}

10,29,37,16,… 49 < > M

… … … … … …

Mining Complex Data names/aliases bank account age felonies gender iris scan …

P1

P2

P3

Based partially on work w/ Norfolk County Sheriff Office

Page 15: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Sample Complex Patterns

Potential temporal/spatial association:– Teenage males from Eastern Massachusetts

who are convicted of burglary are likely (7%) to commit violent crimes when they are adults.

Page 16: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

(Source: http://www. blsc.com)

DATA SETClinical (sequential)Electro-encephalogram (EEG),

Electro-oculogram (EOG),

Electro-myogram (EMG),

Probe measuring flow of Oxygen

in blood etc.

Purpose: Associations between sleep patterns and health/pathology

Obtain patterns of different sleep stages (4 sleep+REM +Wake)

Potential Rules:

(A) Association Rules

(Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13%

(B) Classification Rules (snoring= HEAVY) & (AHI* > 30/hour): severe OSA***

=> (Race = Caucasian) confidence=70%, support= 8%

*AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea

Analyzing Sleep Data

Diagnostic (tabular)Questionnaire responses

Patient’s demographic info.

Patient’s medical history

WPI, UMassMedical, BC

Page 17: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

{depression, fatigue} 27 M 5

{stroke, dementia, fatigue}

97,72,67,80,… 73 90,92,96,89,86,… F 23

{arthritis} 102,99,87,96,… 49 97,100,82,80,70,…

M 14

… … … … … …

Input Data• Each instance: [Tabular | set | sequential] * attributesattr1 attr2 attr3 attr4 attr5 [class] illnesses heart rate age oxygen gender Epworth

P1

P2

P3

Page 18: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAAnalyzing Financial Data

• Sequential data – daily stock values• “Normal” (tabular/relational) data

– sector (computers, agricultural, educational, …), type of government, product releases, companies awards, …

• Desired rules:– If DELL’s stock value increases & 1999<year<2002 =>

IBM’s stock value decreases

Page 19: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAFinancial Data Analysis

Stock values…

Products Athlon XP 2200+(Nov 11, 02)

Aironet 1100 Series (Oct 2, 02) …

Awards Lifetime Achievement(Oct 31, 02)

None …

Neg. Events Reduce workforce(Nov 14, 02) None …

Expansion/Merge None None …

AMD Cisco

Page 20: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

• Sa02: the mean oxygen saturation (SaO2) around 90%• heart rate shown by ECG in beats per minute• the sleep stages - W or Wake, 1 or Stage1, 2, 3, 4 and REM or Rapid Eye Movement stage.Also shown brown markings are:• Epoch (of duration 30sec) and • Clock time (indicating total sleep time).

Events –Sleep Data6 Basic sleep events/stages: W,S1,S2,S3,S4,REM

Page 21: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAEvents – Financial Data

Basic events: 16 or so financial templates [Little&Rhodes78]

difficult pattern matching – alignments and time warping

Rounding Top Reversal Descending Triangle Reversal

Panic Reversal Head & Shoulders Reversal

Page 22: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

{depression, fatigue} 27 M 5

{stroke, dementia, fatigue}

97,72,67,80,… 73 90,92,96,89,86,… F 23

{arthritis} 102,99,87,96,… 49 97,100,82,80,70,…

M 14

… … … … … …

• Templates = increase , decrease , sustain• Confidence = 90%, support = 15%, class = Epworthillnesses heart rate age oxygen gender Epworth

P1

P2

P3

Example: Event Identification

Page 23: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIATemporal Relations between two Events

event1 event2meets

beforeafter

overlaps

is equal to

starts

during

finishes

Page 24: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Example: temporal association rules

• heart rate decreases immediately after oxygen stops increasing & gender=M => epworth=10 (conf=95%, supp= 23%)– HR-dec[t1,t2] & oxygen-inc[t0,t1] & gender=M =>epworth=10

• Heart rate sustains while oxygen increases & patient suffers of dementia => ethnicity=white (conf=99%, supp= 16%)

Patient suffers of dementia and depression & gender=F & REM[t0,t2] => oxygen-inc[t1,t3] (conf=91%, supp= 17%)

t0 t1 t2 t3

Page 25: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIACloser Look: WPI Weka

Tool for mining complex temporal/spatial associations

Page 26: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Data Mining for Genetic Analysisw/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI),

and Alvarez (CS, BC)

• SNP analysis– discovering correlations between

sequence variations and diseases

• Gene expression– discovering patterns that cause a gene

to be expressed in a particular cell

Page 27: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Correlating Genetics with Diseases

• Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research

• Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness.

Page 28: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Genomic Data Resources

Patient Patient GenderGender

SMA TypeSMA Type(Severity)(Severity)

SNP SNP LocationLocation

C212 C212 Father / MotherFather / Mother

AG1-CAAG1-CAFather / MotherFather / Mother

FemaleFemale SevereSevere Y272CY272C 31 / 31 / 28 2928 29 102 / 102 / 108 112108 112

MaleMale MildMild Y272CY272C 28 2928 29 / 25 / 25 108 112108 112 / 114 / 114

Wirth, B. et al. Journal of Human Molecular Genetics

Page 29: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Data Mining Techniques

• Association Rule Mining• Metrics for evaluation of mined rules

– Confidence P(Consequent | Premise)– Support P(Consequent Premise) – Lift P(Consequent | Premise) / P(Consequent)

• Example:Ag1-CA, 110 = absent Ag1-CA, 108 = associated Gender = Female

Confidence: 100 %Support: 9.364%Lift: 2.39

[ ] SMA Type = Severe

Page 30: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

• Different cells require different proteins• DNA uses a four letter alphabet (ATCG)

• Cell expression pattern depends on motifsneurons musclen1 n2 n3 m1 m2 m3

promoter sequences

red=ON white=OFF

20 basepairs

10 basepairs

Mining Gene Expression Patterns

Page 31: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAGene expression Analysis

ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC

ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA

TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA

GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC

ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA

CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA

GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA

AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA

CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT PR1 PROMOTER(S) CELL TYPES PR2

PR3

PR4

PR5

PR6

PR7

PR8

PR9

M1

M1

M1

M1

M1

M1

M2

M2

M2

M2

M4

M4

M4

M4

M4

M4

M4

M5

M5

M5

M5

M5

M5

M3

M3

M3

Gene 1

Gene 2

Gene 3

Gene 4

Gene 5

Gene 6

Gene 7

Gene 8 Gene 9

neural

neural

muscle

neural

muscle

neural

neural

neural

muscle

Page 32: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Our System: CAGETo predict gene expression based on DNA

sequences.Muscle Cell

Neural Cell

Seam Cells

CAGE

Gene 1Gene 2

Gene 3

Gene 1Gene 2

Gene 3

Gene 3Gene 2Gene 1On

Off

Page 33: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIA

Summary• KDD is the “non-trivial process of identifying valid, novel,

potentially useful, and ultimately understandable patterns in data”

• The KDD process includes data collection and pre-processing, data mining, and evaluation and validation of those patterns

• Data mining is the discovery and extraction of patterns from data, not the extraction of data

• Important challenges in data mining: privacy, security, scalability, real-time, and handling non-conventional data

Page 34: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAData Mining Resources –

Books1. Advances in Knowledge Discovery and Data Mining. Eds.: Fayyad, Piatetsky-Shapiro, Smyth, and

Uthurusamy. The MIT Press, 1995. 2. Data Mining: Concepts and Techniques. J. Han and M. Kamber. Morgan Kaufmann Publishers.

2001. 3. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. I.

Witten and E. Frank. Morgan Kaufmann Publishers. 2000. 4. Data Mining. Technologies, Techniques, Tools, and Trends. B. Thuraisingham. CRC, 1998. 5. Principles of Data Mining , D. J. Hand, H. Mannila and P. Smyth, MIT Press, 2000 6. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, T. Hastie, R.

Tibshirani, J. Friedman, Springer Verlag, 2001. 7. Data Mining Cookbook, modeling data for marketing, risk, and CRM. O. Parr Rud, Wiley, 2001.8. Data Mining. A hands-on approach for business professionals. R. Groth. Prentice Hall, 1998. 9. Data Preparation for Data Mining. Dorian Pyle, Morgan Kaufmann, 1999 10. Data Mining Methods for Knowledge Discovery Cios, Pedrycz, & Swiniarski, Kluwer, 1998.

Page 35: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAData Mining Resources –

Books (cont.)11. Mastering Data Mining, M. Berry & G. Linoff, John Wiley & Sons, 2000.12. Data Mining Techniques for Marketing, Sales and Customer Support. Berry & Linoff. John

Wiley & Sons, 1997. 13. Decision Support using Data Mining. S. Anand and A. Buchner. Financial Times Pitman

Publishing, 1998 14. Feature Selection for Knowledge Discovery and Data Mining. Liu and Motoda, Kluwer, 1998.15. Feature Extraction, Construction and Selection: A Data Mining Perpective. Eds: Motoda and Liu.

Kluwer, 1998 16. Knowledge Acquisition from Databases. Xindong Wu. 17. Mining Very Large Databases with Parallel Processing. A. Freitas & S. Lavington. Kluwer, 1998. 18. Predictive Data-Mining: A Practical Guide. Weiss & Indurkhya. Morgan Kaufmann. 1998.19. Machine Learning and Data Mining: Methods and Applications. Michalski, Bratko, and Kubat,

John Wiley & Sons. 1998.20. Rough Sets and Data Mining: Analysis of Imprecise Data. Eds: Lin and Cercone; Kluwer. 21. Seven Methods for Transforming Corporate Data into Business Intelligence. Vasant Dhar and

Roger Stein; Prentice-Hall, 1997.

Page 36: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAData Mining Resources –

Journals• Data Mining and Knowledge Discovery JournalNewsletters:

• ACM SIGKDD Explorations Newsletter Related Journals:

• TKDE: IEEE Transactions in Knowledge and Data Engineering• TODS: ACM Transaction on Database Systems• JACM: Journal of ACM• Data and Knowledge Engineering• JIIS: Intl. Journal of Intelligent Information Systems

Page 37: Research on Data Mining and Knowledge Discovery at WPI

WPI Center for Research in Exploratory Data and Information Analysis CREDIAData Mining Resources –

Conferences• KDD: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining• ICDM: IEEE International Conference on Data Mining, • SIAM International Conference on Data Mining • PKDD: European Conference on Principles and Practice of Knowledge Discovery in

Databases• PAKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining• DaWak: Intl. Conference on Data Warehousing and Knowledge Discovery Related Conferences:• ICML: Intl. Conf. On Machine Learning• IDEAL: Intl. Conf. On Intelligent Data Engineering and Automated Learning• IJCAI: International Joint Conference on Artificial Intelligence • AAAI: American Association for Artificial Intelligence Conference• SIGMOD/PODS: ACM Intl. Conference on Data Management• ICDE: International Conference on Data Engineering • VLDB: International Conference on Very Large Data Bases