Research on Data Mining and Knowledge Discovery at WPI
description
Transcript of Research on Data Mining and Knowledge Discovery at WPI
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Research on Data Mining and Knowledge Discovery at WPI
Prof. Carolina Ruiz
Department of Computer ScienceWorcester Polytechnic Institute
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Outline of this talk
• Short tutorial on Data Mining and Knowledge Discovery in Databases (KDD)
• Sample ongoing KDD research projects at WPI
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Need for Data Mining• Data are being gathered and stored
extremely fast– Currently, the amount of new data stored in digital computer
systems every day is roughly equivalent to 3000 pages of text for every person on Earth (estimate based on a projection to 2003 of a study led by Lyman & Varian at UC-Berkeley in 2000).
• Computational tools and techniques are needed to help humans in summarizing, understanding, and taking advantage of accumulated data
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
“Non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data” [Fayyad et al. 1996]
• Raw Data Data Mining
• Patterns
» Analytical and Statistical Patterns (rules, decision trees, …)» Visual Patterns
What is Data Mining?or more generally, Knowledge Discovery in Databases (KDD)
Fayyad, U., Piatetsky-Shapiro, G., and Smyth, P. "From Data Mining to Knowledge Discovery in Databases" AAAI Magazine, pp. 37-54. Fall 1996.
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
0102030405060708090
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
EastW estNorth
Data Analysis (KDD)Process
data sources
data analysisdata mining• analytical
statistical• visual
models
model/patterns deployment
• prediction• decision supportnew data
data management
• databases• data warehouses
“good” model
model/patternevaluation• quantitative• qualitative
data “pre”-processing
• noisy/missing data • dim. reduction
cleandata
data
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
KDD is Interdisciplinarytechniques come from multiple fields
• Machine Learning (AI)– Contributes (semi-)automatic
induction of empirical laws from observations & experimentation
• Statistics– Contributes language, framework,
and techniques
• Pattern Recognition– Contributes pattern extraction and
pattern matching techniques
• Databases– Contributes efficient data
storage, data cleansing, and data access techniques
• Data Visualization– Contributes visual data displays
and data exploration• High Performance Comp.
– Contributes techniques to efficiently handling complexity
• Application Domain– Contributes domain knowledge
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Data Mining Modes
• Confirmatory (verification)– Given a hypothesis, verify
its validity against the data
• Exploratory (discovery)– Prescriptive patterns
• Patterns for predicting behavior of newly encountered entities
– Descriptive patterns• Patterns for presenting the
behavior of observed entities in a human-understandable format
WPI Center for Research in Exploratory Data and Information Analysis CREDIAAnalytical and Visual Data
Mining• Analytical
– A model that represents the data is constructed using computational methods
• Visual– Data are displayed on
computer screen using colors and shapes
– Patterns in the data are identified by the human (user) eye.
WPI Center for Research in Exploratory Data and Information Analysis CREDIAWhat do you want to learn from your data?
KDD approaches
Data
classification
regression
clustering
summarization
dependency/assoc. analysis
change/deviation detection
0102030405060708090
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
EastW estNorth
IF a & b & c THEN d & kIF k & a THEN e
A B
C D
0.5
0.750.3A, B -> C 80%C, D -> A 22%
b lu e
B
b lue
C
o ran ge
D
A
IF A & B THEN IF A & D THEN
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Commercial Data Mining Systems
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Closer Look: IBM’s Intelligent Miner
WPI Center for Research in Exploratory Data and Information Analysis CREDIAData Mining Academic Systems
CBALiu et al., National Univ. of Singapore
WPI WEKA - Our Temporal/Spatial Association Rules
WEKAFrank et al., University of Waikato, New Zealand
ARMiner Cristofor et al., UMass/Boston
WPI Center for Research in Exploratory Data and Information Analysis CREDIASome Current Analytical Data
Mining Research Projects at WPI• Mining Complex Data: Set and Sequence Mining
– Systems performance Data– Sleep Data– Financial Data– Web Data
• Data Mining for Genetic Analysis– Correlating genetic information with diseases– Predicting gene expression patterns
• Data Mining for Electronic Commerce– Collaborative and Content-Based Filtering
• Using Association Rules and using Neural Networks
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
{joe smith, greg jones} 27
<burglary 2/86,fraud 11/93,murder 3/99>
M
{kathy pearls, kathy dow,susan harris}
97,72,67,80,… 53 <child abuse 9/98, kidnapping 2/03>
F
{drew harris}
10,29,37,16,… 49 < > M
… … … … … …
Mining Complex Data names/aliases bank account age felonies gender iris scan …
P1
P2
P3
…
Based partially on work w/ Norfolk County Sheriff Office
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Sample Complex Patterns
Potential temporal/spatial association:– Teenage males from Eastern Massachusetts
who are convicted of burglary are likely (7%) to commit violent crimes when they are adults.
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
(Source: http://www. blsc.com)
DATA SETClinical (sequential)Electro-encephalogram (EEG),
Electro-oculogram (EOG),
Electro-myogram (EMG),
Probe measuring flow of Oxygen
in blood etc.
Purpose: Associations between sleep patterns and health/pathology
Obtain patterns of different sleep stages (4 sleep+REM +Wake)
Potential Rules:
(A) Association Rules
(Sleep latency <3 min) & (hereditary disorder) => Narcolepsy confidence=92%, support= 13%
(B) Classification Rules (snoring= HEAVY) & (AHI* > 30/hour): severe OSA***
=> (Race = Caucasian) confidence=70%, support= 8%
*AHI = Apnea – Hypopnea index, **OSA = Obstructive Sleep Apnea
Analyzing Sleep Data
Diagnostic (tabular)Questionnaire responses
Patient’s demographic info.
Patient’s medical history
WPI, UMassMedical, BC
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
{depression, fatigue} 27 M 5
{stroke, dementia, fatigue}
97,72,67,80,… 73 90,92,96,89,86,… F 23
{arthritis} 102,99,87,96,… 49 97,100,82,80,70,…
M 14
… … … … … …
Input Data• Each instance: [Tabular | set | sequential] * attributesattr1 attr2 attr3 attr4 attr5 [class] illnesses heart rate age oxygen gender Epworth
P1
P2
P3
…
WPI Center for Research in Exploratory Data and Information Analysis CREDIAAnalyzing Financial Data
• Sequential data – daily stock values• “Normal” (tabular/relational) data
– sector (computers, agricultural, educational, …), type of government, product releases, companies awards, …
• Desired rules:– If DELL’s stock value increases & 1999<year<2002 =>
IBM’s stock value decreases
WPI Center for Research in Exploratory Data and Information Analysis CREDIAFinancial Data Analysis
Stock values…
Products Athlon XP 2200+(Nov 11, 02)
Aironet 1100 Series (Oct 2, 02) …
Awards Lifetime Achievement(Oct 31, 02)
None …
Neg. Events Reduce workforce(Nov 14, 02) None …
Expansion/Merge None None …
AMD Cisco
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
• Sa02: the mean oxygen saturation (SaO2) around 90%• heart rate shown by ECG in beats per minute• the sleep stages - W or Wake, 1 or Stage1, 2, 3, 4 and REM or Rapid Eye Movement stage.Also shown brown markings are:• Epoch (of duration 30sec) and • Clock time (indicating total sleep time).
Events –Sleep Data6 Basic sleep events/stages: W,S1,S2,S3,S4,REM
WPI Center for Research in Exploratory Data and Information Analysis CREDIAEvents – Financial Data
Basic events: 16 or so financial templates [Little&Rhodes78]
difficult pattern matching – alignments and time warping
Rounding Top Reversal Descending Triangle Reversal
Panic Reversal Head & Shoulders Reversal
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
{depression, fatigue} 27 M 5
{stroke, dementia, fatigue}
97,72,67,80,… 73 90,92,96,89,86,… F 23
{arthritis} 102,99,87,96,… 49 97,100,82,80,70,…
M 14
… … … … … …
• Templates = increase , decrease , sustain• Confidence = 90%, support = 15%, class = Epworthillnesses heart rate age oxygen gender Epworth
P1
P2
P3
…
Example: Event Identification
WPI Center for Research in Exploratory Data and Information Analysis CREDIATemporal Relations between two Events
event1 event2meets
beforeafter
overlaps
is equal to
starts
during
finishes
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Example: temporal association rules
• heart rate decreases immediately after oxygen stops increasing & gender=M => epworth=10 (conf=95%, supp= 23%)– HR-dec[t1,t2] & oxygen-inc[t0,t1] & gender=M =>epworth=10
• Heart rate sustains while oxygen increases & patient suffers of dementia => ethnicity=white (conf=99%, supp= 16%)
Patient suffers of dementia and depression & gender=F & REM[t0,t2] => oxygen-inc[t1,t3] (conf=91%, supp= 17%)
t0 t1 t2 t3
WPI Center for Research in Exploratory Data and Information Analysis CREDIACloser Look: WPI Weka
Tool for mining complex temporal/spatial associations
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Data Mining for Genetic Analysisw/ Profs. Ryder (BB, WPI), Krushkal (BB, U. Tennessee), Ward (CS, WPI),
and Alvarez (CS, BC)
• SNP analysis– discovering correlations between
sequence variations and diseases
• Gene expression– discovering patterns that cause a gene
to be expressed in a particular cell
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Correlating Genetics with Diseases
• Utilize Data Mining Techniques with Actual Genetic Data Sampled from Research
• Spinal Muscular Atrophy: inherited disease that results in progressive muscle degeneration and weakness.
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Genomic Data Resources
Patient Patient GenderGender
SMA TypeSMA Type(Severity)(Severity)
SNP SNP LocationLocation
C212 C212 Father / MotherFather / Mother
AG1-CAAG1-CAFather / MotherFather / Mother
FemaleFemale SevereSevere Y272CY272C 31 / 31 / 28 2928 29 102 / 102 / 108 112108 112
MaleMale MildMild Y272CY272C 28 2928 29 / 25 / 25 108 112108 112 / 114 / 114
Wirth, B. et al. Journal of Human Molecular Genetics
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Data Mining Techniques
• Association Rule Mining• Metrics for evaluation of mined rules
– Confidence P(Consequent | Premise)– Support P(Consequent Premise) – Lift P(Consequent | Premise) / P(Consequent)
• Example:Ag1-CA, 110 = absent Ag1-CA, 108 = associated Gender = Female
Confidence: 100 %Support: 9.364%Lift: 2.39
[ ] SMA Type = Severe
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
• Different cells require different proteins• DNA uses a four letter alphabet (ATCG)
• Cell expression pattern depends on motifsneurons musclen1 n2 n3 m1 m2 m3
promoter sequences
red=ON white=OFF
20 basepairs
10 basepairs
Mining Gene Expression Patterns
WPI Center for Research in Exploratory Data and Information Analysis CREDIAGene expression Analysis
ATTGTCTATAGTCCCCTGACTTAGTCATTCTGTACTCGATATC
ATTGTGACTTGGGCGTAGTATATAGTCTGTACGTCGTCGAAA
TGCGATTCATGGGCTAGTTATATAGGTAGTACGTCTAAGAAA
GTTGTGTAGTGGGCCCCGACTATAGTCTGTATTCCGTCGAAC
ACTGTCTAATGGCTAACTTATATAGTGACTACGTCGTCGAGA
CCTGGACTATGGGCCCCTTCTAAAGTCTGTACGTCGTCGATA
GGCCTAAAATGTAGTCCTTATATAGTCTGATTCTCGTCGAAA
AGTGTCCTAAGGGCGACTTATCTAGTCTGTATTCCGTCGACA
CTTGTCTAATGGGCCGACTATATAGTCTGTACGATTCCGAAT PR1 PROMOTER(S) CELL TYPES PR2
PR3
PR4
PR5
PR6
PR7
PR8
PR9
M1
M1
M1
M1
M1
M1
M2
M2
M2
M2
M4
M4
M4
M4
M4
M4
M4
M5
M5
M5
M5
M5
M5
M3
M3
M3
Gene 1
Gene 2
Gene 3
Gene 4
Gene 5
Gene 6
Gene 7
Gene 8 Gene 9
neural
neural
muscle
neural
muscle
neural
neural
neural
muscle
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Our System: CAGETo predict gene expression based on DNA
sequences.Muscle Cell
Neural Cell
Seam Cells
CAGE
Gene 1Gene 2
Gene 3
Gene 1Gene 2
Gene 3
Gene 3Gene 2Gene 1On
Off
WPI Center for Research in Exploratory Data and Information Analysis CREDIA
Summary• KDD is the “non-trivial process of identifying valid, novel,
potentially useful, and ultimately understandable patterns in data”
• The KDD process includes data collection and pre-processing, data mining, and evaluation and validation of those patterns
• Data mining is the discovery and extraction of patterns from data, not the extraction of data
• Important challenges in data mining: privacy, security, scalability, real-time, and handling non-conventional data
WPI Center for Research in Exploratory Data and Information Analysis CREDIAData Mining Resources –
Books1. Advances in Knowledge Discovery and Data Mining. Eds.: Fayyad, Piatetsky-Shapiro, Smyth, and
Uthurusamy. The MIT Press, 1995. 2. Data Mining: Concepts and Techniques. J. Han and M. Kamber. Morgan Kaufmann Publishers.
2001. 3. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. I.
Witten and E. Frank. Morgan Kaufmann Publishers. 2000. 4. Data Mining. Technologies, Techniques, Tools, and Trends. B. Thuraisingham. CRC, 1998. 5. Principles of Data Mining , D. J. Hand, H. Mannila and P. Smyth, MIT Press, 2000 6. The Elements of Statistical Learning: Data Mining, Inference, and Prediction, T. Hastie, R.
Tibshirani, J. Friedman, Springer Verlag, 2001. 7. Data Mining Cookbook, modeling data for marketing, risk, and CRM. O. Parr Rud, Wiley, 2001.8. Data Mining. A hands-on approach for business professionals. R. Groth. Prentice Hall, 1998. 9. Data Preparation for Data Mining. Dorian Pyle, Morgan Kaufmann, 1999 10. Data Mining Methods for Knowledge Discovery Cios, Pedrycz, & Swiniarski, Kluwer, 1998.
WPI Center for Research in Exploratory Data and Information Analysis CREDIAData Mining Resources –
Books (cont.)11. Mastering Data Mining, M. Berry & G. Linoff, John Wiley & Sons, 2000.12. Data Mining Techniques for Marketing, Sales and Customer Support. Berry & Linoff. John
Wiley & Sons, 1997. 13. Decision Support using Data Mining. S. Anand and A. Buchner. Financial Times Pitman
Publishing, 1998 14. Feature Selection for Knowledge Discovery and Data Mining. Liu and Motoda, Kluwer, 1998.15. Feature Extraction, Construction and Selection: A Data Mining Perpective. Eds: Motoda and Liu.
Kluwer, 1998 16. Knowledge Acquisition from Databases. Xindong Wu. 17. Mining Very Large Databases with Parallel Processing. A. Freitas & S. Lavington. Kluwer, 1998. 18. Predictive Data-Mining: A Practical Guide. Weiss & Indurkhya. Morgan Kaufmann. 1998.19. Machine Learning and Data Mining: Methods and Applications. Michalski, Bratko, and Kubat,
John Wiley & Sons. 1998.20. Rough Sets and Data Mining: Analysis of Imprecise Data. Eds: Lin and Cercone; Kluwer. 21. Seven Methods for Transforming Corporate Data into Business Intelligence. Vasant Dhar and
Roger Stein; Prentice-Hall, 1997.
WPI Center for Research in Exploratory Data and Information Analysis CREDIAData Mining Resources –
Journals• Data Mining and Knowledge Discovery JournalNewsletters:
• ACM SIGKDD Explorations Newsletter Related Journals:
• TKDE: IEEE Transactions in Knowledge and Data Engineering• TODS: ACM Transaction on Database Systems• JACM: Journal of ACM• Data and Knowledge Engineering• JIIS: Intl. Journal of Intelligent Information Systems
WPI Center for Research in Exploratory Data and Information Analysis CREDIAData Mining Resources –
Conferences• KDD: ACM SIGKDD Intl. Conf. on Knowledge Discovery and Data Mining• ICDM: IEEE International Conference on Data Mining, • SIAM International Conference on Data Mining • PKDD: European Conference on Principles and Practice of Knowledge Discovery in
Databases• PAKDD Pacific-Asia Conference on Knowledge Discovery and Data Mining• DaWak: Intl. Conference on Data Warehousing and Knowledge Discovery Related Conferences:• ICML: Intl. Conf. On Machine Learning• IDEAL: Intl. Conf. On Intelligent Data Engineering and Automated Learning• IJCAI: International Joint Conference on Artificial Intelligence • AAAI: American Association for Artificial Intelligence Conference• SIGMOD/PODS: ACM Intl. Conference on Data Management• ICDE: International Conference on Data Engineering • VLDB: International Conference on Very Large Data Bases