tm1-iihmr-DM
-
Upload
gaurav-bhati -
Category
Documents
-
view
36 -
download
3
Transcript of tm1-iihmr-DM
DATA MINING IN HEALTHCARE
Professor – Healthcare IT,International Institute of Health Management Research
(IIHMR - New Delhi).
Prof. (Dr.) T. Muthukumar M.Sc; M.C.A; M.B.A; M.Phil; Ph.D.
Agenda• Data Mining – Introduction & Background• Data Warehouse and OLAP – Introduction• Data Mining Primitive – Data Preprocessing• Types of Data Mining & Techniques• Mining Complex Types of Data• Application – Tools, Case Studies, Healthcare
Application
Recommended Reference Books• E. Alpaydin. Introduction to Machine Learning, 2nd ed., MIT Press, 2011
• S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
• R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
• T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
• U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996
• U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001
Recommended Reference Books
• J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006 (3ed. 2011)
• T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed., Springer-Verlag, 2009
• B. Liu, Web Data Mining, Springer 2006.
• T. M. Mitchell, Machine Learning, McGraw Hill, 1997
• P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
Recommended Reference Books• S. M. Weiss and N. Indurkhya, Predictive Data
Mining, Morgan Kaufmann, 1998
• I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2nd ed. 2005
66
Introduction of Data Mining• Why Data Mining?
• What Is Data Mining?
• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Why Data Mining?
• The Explosive Growth of Data: from terabytes to petabytes
– Data collection and data availability
• Automated data collection tools, database systems, Web, computerized society
– Major sources of abundant data
• Business: Web, e-commerce, transactions, stocks, …
Why Data Mining? • Science: Remote sensing,
bioinformatics, scientific simulation …
• Society and everyone: news, digital cameras, YouTube
Why Data Mining?
• We are drowning in data, but starving for knowledge!
• “Necessity is the mother of invention”—Data mining—Automated analysis of massive data sets
Data pyramid
Data
Information
Knowledge
Wisdom
Data + context
Information + rules
Knowledge + experience
Why we need Data Mining?• Large number of records(cases) (108-1012 bytes)
– One thousand (103) bytes = 1 kilobyte (KB)– One million (106) bytes = 1 megabyte (MB)– One billion (109) bytes = 1 gigabyte (GB)– One trillion (1012) bytes = 1 terabyte (TB)
• High dimensional data (variables)– 10-104 attributes
• Only a small portion, typically 5% to 10%, of the collected data is ever analyzed.
• Data collected and stored at enormous speeds (Gbyte/hour)– remote sensor on a satellite– telescope scanning the skies– scientific simulations generating terabytes of data
• Classical modeling techniques are infeasible• Data reduction• Cataloging, classifying, segmenting data• Helps scientists in Hypothesis Formation
Scientific Viewpoint
Evolution of Sciences: New Data Science Era
• Before 1600: Empirical science• 1600-1950s: Theoretical science
– Each discipline has grown a theoretical component. Theoretical models often motivate experiments and generalize our understanding.
Evolution of Sciences: New Data Science Era
• 1950s-1990s: Computational science– Over the last 50 years, most disciplines have
grown a third, computational branch (e.g. empirical, theoretical, and computational ecology, or physics, or linguistics.)
– Computational Science traditionally meant simulation. It grew out of our inability to find closed-form solutions for complex mathematical models.
• 1990-now: Data science– The flood of data from new scientific
instruments and simulations– The ability to economically store and
manage petabytes of data online– The Internet and computing Grid that
makes all these archives universally accessible
Evolution of Sciences: New Data Science Era
– Scientific info. management, acquisition, organization, query, and visualization tasks scale almost linearly with data volumes
– Data mining is a major new challenge!
Evolution of Sciences: New Data Science Era
Chapter 1. Introduction• Why Data Mining?
• What Is Data Mining?
• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
What Is Data Mining?
• Data mining (knowledge discovery from data) – Extraction of interesting (non-trivial,
implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data.
What Is Data Mining?• Alternative names
– Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
• Watch out: Is everything “data mining”? – Simple search and query processing – (Deductive) expert systems
What is Data Mining?
• The automated extraction of predictive information from large databases – Automated – Extraction – Predictive In large Database.
Business IntelligenceImproving Business Insight
“A broad category of applications and technologies for gathering, storing, analyzing, sharing and providing access to data to help enterprise users make better business decisions.”
– Gartner
RelationshipsAnd Acronyms...
What does Data Mining Do?
Explores Your Data
Finds Patterns
Performs Predictions
DM and BI
• BI is geared at an end user, such as a business owner, knowledge worker etc.
• DM is an IT technology generally geared towards a more advanced user – today
• By the way: who is qualified to use DM today?
Knowledge Discovery (KDD) Process
• This is a view from typical database systems and data warehousing communities
• Data mining plays an essential role in the knowledge discovery process
Data Cleaning
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
______
______
______
Transformed Data
Patternsand
Rules
Target Data
RawData
KnowledgeData MiningTransformation
Interpretation& Evaluation
Selection& Cleaning
IntegrationUnderstanding
Knowledge Discovery Process
DATAWarehouse
Knowledge
Data Mining in Business Intelligence
Increasing potentialto support
business decisions End User
Business Analyst
DataAnalyst
DBA
Decision Making
Data PresentationVisualization Techniques
Data MiningInformation Discovery
Data ExplorationStatistical Summary, Querying, and Reporting
Data Preprocessing/Integration, Data Warehouses
Data SourcesPaper, Files, Web documents, Scientific experiments, Database Systems
KDD Process: A Typical View from ML and Statistics
Input Data Data Mining
Data Pre-Processing
Post-Processing
• This is a view from typical machine learning and statistics communities
Data integrationNormalization
Feature selectionDimension reduction
Pattern discoveryAssociation & correlation
ClassificationClustering
Outlier analysis… … … …
Pattern evaluationPattern selection
Pattern interpretationPattern visualization
Example: Medical Data Mining
• Health care & medical data mining – often adopted such a view in statistics and machine learning
• Preprocessing of the data (including feature extraction and dimension reduction)
• Classification or/and clustering processes
• Post-processing for presentation
The Evolution of Data AnalysisEvolutionary Step
Business Question
Enabling Technologies
Product Providers
Characteristics
Data Collection (1960s)
"What was my total revenue in the last five years?"
Computers, tapes, disks
IBM, CDC
Retrospective, static data delivery
Data Access (1980s)
"What were unit sales in New England last March?"
Relational databases (RDBMS), Structured Query Language (SQL), ODBC
Oracle, Sybase, Informix, IBM, Microsoft
Retrospective, dynamic data delivery at record level
Data Warehousing & Decision Support (1990s)
"What were unit sales in New England last March? Drill down to Boston."
On-line analytic processing (OLAP), multidimensional databases, data warehouses
SPSS, Comshare, Arbor, Cognos, Microstrategy,NCR
Retrospective, dynamic data delivery at multiple levels
Data Mining (Emerging Today)
"What’s likely to happen to Boston unit sales next month? Why?"
Advanced algorithms, multiprocessor computers, massive databases
SPSS/Clementine, Lockheed, IBM, SGI, SAS, NCR, Oracle, numerous startups
Prospective, proactive information delivery
When is DM useful
• Data rich world• Large data (dimensionality and size)
– Image data (size)– Gene chip data (dimensionality)
• Little knowledge about data (exploratory data analysis)– What if we have some knowledge?
Data Mining Versus Statistical Analysis
•Data Analysis– Tests for statistical correctness of
models• Are statistical assumptions of
models correct?– Eg Is the R-Square good?
– Hypothesis testing• Is the relationship significant?
– Use a t-test to validate significance
– Tends to rely on sampling– Techniques are not optimised for
large amounts of data– Requires strong statistical skills
•Data Mining– Originally developed to act as
expert systems to solve problems
– Less interested in the mechanics of the technique
– If it makes sense then let’s use it
– Does not require assumptions to be made about data
– Can find patterns in very large amounts of data
– Requires understanding of data and business problem
Data Mining versus OLAP
•OLAP - On-line Analytical Processing
– Provides you with a very good view of what is happening, but can not predict what will happen in the future or why it is happening
Data Mining vs. Database
• DB’s user knows what is looking for.• DM’s user might/might not know what is looking for.• DB’s answer to query is 100% accurate, if data correct.• DM’s effort is to get the answer as accurate as possible.• DB’s data are retrieved as stored.• DM’s data need to be cleaned (some what) before producing results.• DB’s results are subset of data.• DM’s results are the analysis of the data.• The meaningfulness of the results is not the concern of Database as• it is the main issue in Data Mining.
Data Mining vs. KDD
• Knowledge Discovery in Databases (KDD) is the process of finding useful information and patterns in the data.
• Data Mining is the use of algorithms to find the usefulinformation in the KDD process.
• KDD process is:» Data cleaning & integration (Data Pre-processing)» Creating a common data repository for all sources, such as data warehouse.
Data mining» Visualization for the generated results
Data mining is not• Brute-force crunching of bulk data • “Blind” application of algorithms• Going to find relationships where
none exist• Presenting data in different ways• A database intensive task• A difficult to understand
technology requiring an advanced degree in computer science
Chapter 1. Introduction• Why Data Mining?
• What Is Data Mining?
• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Data Mining: On What Kinds of Data?
• Database-oriented data sets and applications
– Relational database, data warehouse, transactional database
• Advanced data sets and advanced applications
– Data streams and sensor data
– Time-series data, temporal data, sequence data (incl. bio-sequences)
Data Mining: On What Kinds of Data?
– Structure data, graphs, social networks and multi-linked data
– Object-relational databases
– Heterogeneous databases
– Spatial data and spatiotemporal data
– Multimedia database
– Text databases & The World-Wide Web
Chapter 1. Introduction• Why Data Mining?
• What Is Data Mining?
• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Data Mining Algorithms
Online AnalyticalProcessing
Discovery Driven Methods
SQL Query ToolsDescription Prediction
Classification Regressions
Decision Trees
Neural Networks
Visualization
Clustering
Association
Sequential Analysis
Data Mining Function: (1) Generalization
• Information integration and data warehouse construction
– Data cleaning, transformation, integration, and multidimensional data model
• Data cube technology
– Scalable methods for computing (i.e., materializing) multidimensional aggregates
– OLAP (online analytical processing)• Multidimensional concept description: Characterization and
discrimination
– Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet region
Concept Description
• Characterization: provides a concise and succinct summarization of the given collection of data
• Discrimination: provides descriptions comparing two or more collections of data
• can handle complex data types of the attributes
• a more automated process
Name Gender Major Birth-Place Birth_date Residence Phone # GPA
Jim Woodman
M CS Vancouver,BC,Canada
8-12-76 3511 Main St., Richmond
687-4598 3.67
Scott Lachance
M CS Montreal, Que, Canada
28-7-75 345 1st Ave., Richmond
253-9106 3.70
Laura Lee …
F …
Physics …
Seattle, WA, USA …
25-8-70 …
125 Austin Ave., Burnaby …
420-5232 …
3.83 …
Removed Retained Sci,Eng,Bus
Country Age range City Removed Excl, VG,..
Gender Major Birth_region Age_range Residence GPA Count M Science Canada 20-25 Richmond Very-good 16 F Science Foreign 25-30 Burnaby Excellent 22 … … … … … … …
Birth_Region
GenderCanada Foreign Total
M 16 14 30 F 10 22 32
Total 26 36 62
Generalized Relation
Initial Relation
Concept description: Characterization
Data Mining Function: (2) Association and Correlation Analysis• Frequent patterns (or frequent itemsets)
– What items are frequently purchased together in your Walmart?
• Association, correlation vs. causality
– A typical association rule– Are strongly associated items also strongly
correlated?• How to mine such patterns and rules efficiently in large
datasets?
Association rule• Association (correlation and causality)
– age(X, “20..29”) ^ income(X, “20..29K”) buys(X, “PC”) [support = 2%, confidence = 60%]
• Association rule mining– Finding frequent patterns, associations, correlations
among sets of items or objects in transaction databases, relational databases, and other information repositories
– Frequent pattern: pattern (set of items, sequence, etc.) that occurs frequently in a database
• Motivation: finding regularities in data– What products were often purchased together?
Example: Association rule
• Itemset A1,A2={a1, …, ak}• Find all the rules A1A2 with min
confidence and support– support, s, probability that a
transaction contains A1A2– confidence, c, conditional
probability that a transaction having A1 also contains A2.Let min_support = 50%,
min_conf = 50%:a1 a3 (50%, 66.7%)a3 a1 (50%, 100%)
Transaction-id Items bought10 a1,a2, a3
20 a1, a3
30 a1, a4
40 a2, a5, a6
Data Mining Function: (3) Classification
• Classification and label prediction – Describe and distinguish classes or concepts
for future prediction• E.g., classify countries based on (climate),
or classify cars based on (mileage)– Predict some unknown class labels
Data Mining Function: (3) Classification• Typical methods
– Decision trees, naïve Bayesian classification, support vector machines, neural networks, rule-based classification, pattern – based classification, logistic regression, …
• Typical applications:– Credit card fraud detection, direct marketing,
classifying stars, diseases, web-pages, …
Classification (1): Model Construction
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
TrainingData
NAME RANK YEARS TENUREDMike Assistant Prof 3 noMary Assistant Prof 7 yesBill Professor 2 yesJim Associate Prof 7 yesDave Assistant Prof 6 noAnne Associate Prof 3 no
ClassificationAlgorithms
IF rank = ‘professor’OR years > 6THEN tenured = ‘yes’
Classifier(Model)
Classification (2): Prediction Using the Model
Classifier
TestingData
NAME RANK YEARS TENUREDTom Assistant Prof 2 noMerlisa Associate Prof 7 noGeorge Professor 5 yesJoseph Assistant Prof 7 yes
Unseen Data
(Jeff, Professor, 4)
Tenured?
Classification Techniques• Decision Tree Induction• Bayesian Classification• Neural Networks• Genetic Algorithms• Fuzzy Set and Logic
Data Mining Function: (4) Cluster Analysis
• Unsupervised learning (i.e., Class label is unknown)
• Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns
• Principle: Maximizing intra-class similarity & minimizing interclass similarity
• Many methods and applications
Clustering• Cluster: a collection of data objects
– Similar to one another within the same cluster– Dissimilar to the objects in other clusters
• Clustering– Grouping a set of data objects into clusters based on the
principle: maximizing the intra-class similarity and minimizing the interclass similarity
• Example– Land use: Identification of areas of similar land use in
an earth observation database– City-planning: Identifying groups of houses according
to their house type, value, and geographical location
Data Mining Function: (5) Outlier Analysis
• Outlier analysis– Outlier: A data object that does not comply
with the general behavior of the data– Noise or exception? ― One person’s garbage
could be another person’s treasure– Methods: by product of clustering or
regression analysis, …– Useful in fraud detection, rare events analysis
Time and Ordering: Sequential Pattern, Trend and Evolution Analysis
• Sequence, trend and evolution analysis– Trend, time-series, and deviation analysis: e.g.,
regression and value prediction– Sequential pattern mining
• e.g., first buy digital camera, then buy large SD memory cards
– Periodicity analysis– Biological sequence analysis
• Mining data streams– Ordered, time-varying, potentially infinite, data
streams
Sequential Pattern Analysis• Given a set of sequences, find the complete set of
frequent subsequences
• Applications of sequential pattern– Customer shopping sequences:
• First buy computer, then CD-ROM, and then digital camera, within 3 months.
– Weblog click streams– Telephone calling patterns
SID sequence10 <a(abc)(ac)d(cf)>20 <(ad)c(bc)(ae)>30 <(ef)(ab)(df)cb>40 <eg(af)cbc>
Given support threshold min_sup =2, <(ab)c> is a sequential pattern
Structure and Network Analysis
• Graph mining– Finding frequent subgraphs (e.g., chemical
compounds), trees (XML), substructures (web fragments)
• Information network analysis– Social networks: actors (objects, nodes) and
relationships (edges)• e.g., author networks in CS, terrorist
networks
Structure and Network Analysis– Multiple heterogeneous networks
• A person could be multiple information networks: friends, family, classmates ..
• Web mining– Web is a big information network: Google– Analysis of Web information networks
• Web community discovery, opinion mining, usage mining, …
Evaluation of Knowledge
• Are all mined knowledge interesting?
– One can mine tremendous amount of “patterns” and knowledge
– Some may fit only certain dimension space (time, location, …)
– Some may not be representative, may be transient, …
Evaluation of Knowledge
• Evaluation of mined knowledge → directly mine only interesting knowledge?
– Descriptive vs. predictive
– Coverage
– Typicality vs. novelty
– Accuracy
– Timeliness
Regression• Regression is similar to classification
– First, construct a model– Second, use model to predict unknown
value• Methods
– Linear and multiple regression– Non-linear regression
• Regression is different from classification– Classification refers to predict
categorical class label– Regression models continuous-valued
functions
Chapter 1. Introduction• Why Data Mining?
• What Is Data Mining?
• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Data Mining: Confluence of Multiple Disciplines
Data Mining
MachineLearning
Statistics
Applications
Algorithm
PatternRecognition
High-PerformanceComputing
Visualization
Database Technology
Why Confluence of Multiple Disciplines?
• Tremendous amount of data– Algorithms must be highly scalable to handle such as tera-bytes of data
• High-dimensionality of data – Micro-array may have tens of thousands of dimensions
• High complexity of data– Data streams and sensor data– Time-series data, temporal data, sequence data – Structure data, graphs, social networks and multi-linked data– Heterogeneous databases and legacy databases– Spatial, spatiotemporal, multimedia, text and Web data– Software programs, scientific simulations
• New and sophisticated applications
Chapter 1. Introduction• Why Data Mining?
• What Is Data Mining?
• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Applications of Data Mining
• Web page analysis: from web page classification, clustering algorithms
• Collaborative analysis & recommender systems
• Basket data analysis to targeted marketing
• Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis
Applications of Data Mining
• Data mining and software engineering (e.g., IEEE Computer, Aug. 2009 issue)
• From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining
Chapter 1. Introduction• Why Data Mining?
• What Is Data Mining?
• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Major Issues in Data Mining (1)
• Mining Methodology
– Mining various and new kinds of knowledge
– Mining knowledge in multi-dimensional space
– Data mining: An interdisciplinary effort
– Boosting the power of discovery in a networked environment
Major Issues in Data Mining (1)– Handling noise, uncertainty, and incompleteness of
data
– Pattern evaluation and pattern- or constraint-guided mining
• User Interaction
– Interactive mining
– Incorporation of background knowledge
– Presentation and visualization of data mining results
Major Issues in Data Mining (2)
• Efficiency and Scalability
– Efficiency and scalability of data mining algorithms
– Parallel, distributed, stream, and incremental mining methods
Major Issues in Data Mining (2)
• Diversity of data types
– Handling complex types of data
– Mining dynamic, networked, and global data repositories
• Data mining and society
– Social impacts of data mining
– Privacy-preserving data mining
– Invisible data mining
Chapter 1. Introduction• Why Data Mining?
• What Is Data Mining?
• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
A Brief History of Data Mining Society
• 1989 IJCAI Workshop on Knowledge Discovery in Databases
– Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, 1991)
• 1991-1994 Workshops on Knowledge Discovery in Databases
– Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
• 1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)
– Journal of Data Mining and Knowledge Discovery (1997)
• ACM SIGKDD conferences since 1998 and SIGKDD Explorations
• More conferences on data mining
– PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM (2001), WSDM (2008), etc.
• ACM Transactions on KDD (2007)
Conferences and Journals on Data Mining
• KDD Conferences– ACM SIGKDD Int. Conf. on
Knowledge Discovery in Databases and Data Mining (KDD)
– SIAM Data Mining Conf. (SDM)– (IEEE) Int. Conf. on Data Mining
(ICDM)– European Conf. on Machine Learning
and Principles and practices of Knowledge Discovery and Data Mining (ECML-PKDD)
– Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD)
– Int. Conf. on Web Search and Data Mining (WSDM)
Other related conferences DB conferences: ACM SIGMOD,
VLDB, ICDE, EDBT, ICDT, … Web and IR conferences: WWW,
SIGIR, WSDM ML conferences: ICML, NIPS
PR conferences: CVPR, Journals
Data Mining and Knowledge Discovery (DAMI or DMKD)
IEEE Trans. On Knowledge and Data Eng. (TKDE)
KDD Explorations ACM Trans. on KDD
Where to Find References? DBLP, Google
• Data mining and KDD (SIGKDD: CDROM)– Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.– Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD
• Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM)– Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA– Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc.
• AI & Machine Learning– Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc.– Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, IEEE-
PAMI, etc.
• Web and IR – Conferences: SIGIR, WWW, CIKM, etc.– Journals: WWW: Internet and Web Information Systems,
• Statistics– Conferences: Joint Stat. Meeting, etc.– Journals: Annals of statistics, etc.
• Visualization– Conference proceedings: CHI, ACM-SIGGraph, etc.– Journals: IEEE Trans. visualization and computer graphics, etc.
Chapter 1. Introduction• Why Data Mining?
• What Is Data Mining?
• A Multi-Dimensional View of Data Mining
• What Kind of Data Can Be Mined?
• What Kinds of Patterns Can Be Mined?
• What Technology Are Used?
• What Kind of Applications Are Targeted?
• Major Issues in Data Mining
• A Brief History of Data Mining and Data Mining Society
• Summary
Summary
• Data mining: Discovering interesting patterns and knowledge from massive amount of data
• A natural evolution of science and information technology, in great demand, with wide applications
• A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation
• Mining can be performed in a variety of data
• Data mining functionalities: characterization, discrimination, association, classification, clustering, trend and outlier analysis, etc.
• Data mining technologies and applications
• Major issues in data mining
Data Mining in Association Rule
And now discussion