Data Mining-Current Status and Research Directions

50
2022년 6년 15년 Data Mining: Status and Direc tions 1 Data Mining: Current Status and Research Directions Jiawei Han Intelligent Database Systems Research Lab School of Computing Science Simon Fraser University, Canada http://www.cs.sfu.ca/~han

Transcript of Data Mining-Current Status and Research Directions

Page 1: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 1

Data Mining: Current Status and Research

Directions

Jiawei Han

Intelligent Database Systems Research Lab

School of Computing Science

Simon Fraser University, Canada

http://www.cs.sfu.ca/~han

Page 2: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 2

Outline

Why is data mining hot? Current status: Major technical

progress Is data mining flying high, or not? How to fly data mining high?—

Research directions on data mining

Page 3: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 3

Why Is Data Mining Hot?

Data mining (knowledge discovery in databases)

Extraction of interesting (non-trivial, implicit,

previously unknown and potentially useful)

information (knowledge) or patterns from data in

large databases or other information repositories

Necessity is the mother of invention

Data is everywhere—data mining should be

everywhere, too!

Understand and use data—an imminent task!

Page 4: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 4

Data, Data, Everywhere!!

Relational database—A commodity of every enterprise Huge data warehouses are under construction POS (Point of Sales): Transactional DBs in terabytes Object-relational databases, distributed, heterogeneous,

and legacy databases Spatial databases (GIS), remote sensing database (EOS),

and scientific/engineering databases Time-series data (e.g., stock trading) and temporal data Text (documents, emails) and multimedia databases WWW: A huge, hyper-linked, dynamic, global information

system

Page 5: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 5

Data Mining Is Everywhere, too!—A Multi-Dimensional View of Data Mining

Databases to be mined

Relational, transactional, object-relational, active, spatial,

time-series, text, multi-media, heterogeneous, legacy,

WWW, etc. Knowledge to be mined

Characterization, discrimination, association, classification,

clustering, trend, deviation and outlier analysis, etc. Techniques utilized

Database-oriented, data warehouse (OLAP), machine

learning, statistics, visualization, neural network, etc. Applications adapted

Retail, telecommunication, banking, fraud analysis, DNA mining,

stock market analysis, Web mining, Weblog analysis, etc.

Page 6: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 6

Data Mining: Confluence of Multiple Disciplines

Data Mining

Database Technology

Statistics

OtherDisciplines

InformationScience

MachineLearning (AI) Visualization

Page 7: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 7

Data Mining—One Can Trace Back to Early Civilization

Most scientific discoveries involve “data mining” Kepler’s Law, Newton’s Laws, periodic table of

chemical elements, …, from “big bang” to DNA Statistics: A discipline dedicated to data analysis Then why data mining? What are the differences?

Huge amount of data—in giga to tera bytes Fast computer—quick response, interactive analysis Multi-dimensional, powerful, thorough analysis High-level, “declarative”—user’s ease and control Automated or semi-automated—mining functions

hidden or built-in in many systems

Page 8: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 8

A Brief History of Data Mining Activities

1989 IJCAI Workshop on Knowledge Discovery in Databases Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W.

Frawley, 1991) 1991-1994 Workshops on Knowledge Discovery in Databases

Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)

1995-1998 International Conferences on Knowledge Discovery in Databases and Data Mining (KDD’95-98)

Journal of Data Mining and Knowledge Discovery (1997) 1998 ACM SIGKDD, SIGKDD’1999-2001 conferences, and

SIGKDD Explorations More conferences on data mining

PAKDD, PKDD, SIAM-Data Mining, (IEEE) ICDM, DaWaK, SPIE-DM, etc.

Page 9: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 9

Research Progress in the Last Decade

Multi-dimensional data analysis: Data warehouse and OLAP (on-line analytical processing)

Association, correlation, and causality analysis Classification: scalability and new approaches Clustering and outlier analysis Sequential patterns and time-series analysis Similarity analysis: curves, trends, images, texts,

etc. Text mining, Web mining and Weblog analysis Spatial, multimedia, scientific data analysis Data preprocessing and database compression Data visualization and visual data mining Many others, e.g., collaborative filtering

Page 10: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 10

Multi-Dimensional Data Analysis

Data warehousing: integration from heterogeneous or semi-structured databases

Multi-dimensional modeling of data: star & snowflake schemas

Efficient and scalable computation of data cubes or iceberg cubes

OLAP (on-line analytical processing): drilling, dicing, slicing, etc.

Discovery-driven exploration of data cubes From OLAP to OLAM: A multi-dimensional

view for on-line analytical mining

Page 11: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 11

Association and Frequent Pattern Analysis

Efficient mining of frequent patterns and association rules: Apriori and FP-growth algorithms Multi-level, multi-dimensional, quantitative

association mining From association to correlation, sequential

patterns, partial periodicity, cyclic rules, ratio rules, etc.

Query and constraint-based association analysis

Page 12: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 12

Classification: Scalable Methods and Handling of Complex Types of Data

Classification has been an essential theme in machine learning, and statistics research Decision trees, Bayesian classification, neural

networks, k-nearest neighbors, etc. Tree-pruning, Boosting, bagging techniques

Efficient and scalable classification methods Exploration of attribute-class pairs SLIQ, SPRINT, RainForest, BOAT, etc.

Classification of semi-structured and non-structured data Classification by clustering association rules (ARCS) Association-based classification Web document classification

Page 13: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 13

Clustering and Outlier Analysis

Partitioning methods k-means, k-medoids, CLARANS

Hierarchical methods: micro-clusters Birch, Cure, Chameleon

Density-based methods: DBSCAN and OPTICS, DENCLU

Grid-based methods STING, CLIQUE, WaveCluster

Outlier analysis: statistics-based, distance-based, deviation-

based Constraint-based clustering

COD (Clustering with Obstructed Distance) User-specified constraints

Page 14: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 14

Sequential Patterns and Time-Series Analysis

Trend analysis Trend movement vs. cyclic variations, seasonal

variations and random fluctuations Similarity search in time-series database

Handling gaps, scaling, etc. Indexing methods and query languages for time-

series Sequential pattern mining

Various kinds of sequences, various methods From GSP to PrefixSpan

Periodicity analysis Full periodicity, partial periodicity, cyclic

association rules

Page 15: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 15

Similarity Search: Similar Curves, Trends, Images, and Texts

Various kinds of data, various similarity mining methods

Discovery of similar trends in time-series data Data transformation & high-dimensional structures

Finding similar images based on color, texture, etc. Content-based vs. keyword-based retrieval Color histogram-based signature Multi-feature composed signature

Finding documents with similar texts Similar keywords (synonymy & polysemy) Term frequency matrix Latent semantic indexing

Page 16: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 16

Spatial, Multimedia, Scientific Data Analysis

Multi-dimensional analysis of spatial, multimedia and scientific data Geo-spatial data cube and spatial OLAP The curse of dimensionality problem

Association analysis A progressive refinement methodology Micro-clustering can be used for preprocessing

in the analysis of complex types of data Classification

Association-based for handling high-dimensionality and sparse data

Page 17: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 17

Data Mining Industry and Applications

From research prototypes to data mining products, languages, and standards IBM Intelligent Miner, SAS Enterprise Miner,

SGI MineSet, Clementine, MS/SQLServer 2000, DBMiner, BlueMartini, MineIt, DigiMine, etc.

A few data mining languages and standards (esp. MS OLEDB for Data Mining).

Application achievements in many domains Market analysis, trend analysis, fraud

detection, outlier analysis, Web mining, etc.

Page 18: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 18

Is Data Mining Flying? Or Not??

Data mining is flying R & D have been striding forward greatly Applications have been broadened substantially

But not as high as some may have hoped. Why not? Hope to see billions of $’s within years?

A young and coming technology, not a hype! Not bread-and-butter but value-added service

DBMS, WWW, and other information systems will still be a “data mining” aircraft-carrier

Not on-the-shelf in nature Need training, understanding, and customizing (re-

develop.) Young technology—need much R&D to fly high

Much research, development, and real problem solving!

Page 19: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 19

How to Fly Data Mining High?—Research Directions

Web mining Towards integrated data mining

environments and tools “Vertical” (or application-specific) data

mining Invisible data mining

Towards intelligent, efficient, and scalable data mining methods

Page 20: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 20

Web Mining: A Fast Expanding Frontier in Data Mining

Mine what Web search engine finds

Automatic classification of Web documents

Discovery of authoritative Web pages, Web

structures and Web communities

Meta-Web Warehousing: Web yellow page

service

Web usage mining

Page 21: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 21

Mine What Web Search Engine Finds

Current Web search engines: A convenient source for mining keyword-based, return too many, often low quality

answers, still missing a lot, not customized, etc. Data mining will help:

coverage: “Enlarge and then shrink,” using synonyms and conceptual hierarchies

better search primitives: user preferences/hints linkage analysis: authoritative pages and clusters Web-based languages: XML + WebSQL + WebML customization: home page + Weblog + user

profiles

Page 22: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 22

Discovery of Authoritative Pages in WWW

Page-rank method ( Brin and Page, 1998): Rank the "importance" of Web pages, based on a

model of a "random browser." Hub/authority method (Kleinberg, 1998):

Prominent authorities often do not endorse one another directly on the Web.

Hub pages have a large number of links to many relevant authorities.

Thus hubs and authorities exhibit a mutually reinforcing relationship:

Both the page-rank and hub/authority methodologies have been shown to provide qualitatively good search results for broad query topics on the WWW.

Page 23: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 23

Automatic Classification of Web Documents

Web document classification: Good human classification: Yahoo!, CS term

hierarchies These classifications can be used as training

sets to build up learning model Key-word based classification is different from

multi-dimensional classification Association or clustering-based classification is

often more effective Multi-level classification is important

Page 24: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 24

A Multiple Layered Meta-Web Architecture

Generalized Descriptions

More Generalized Descriptions

Layer0

Layer1

Layern

...

Page 25: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 25

Web Yellow Page Service: A Multi-Layer, Meta-Web Approach

XML: facilitates structured and meta-information extraction Automatic classification of Web documents:

based on Yahoo!, etc. as training set + keyword-based correlation/classification analysis (IR/AI assistance)

Automatic ranking of important Web pages authoritative site recognition and clustering Web pages

Generalization-based multi-layer meta-Web construction With the assistance of clustering and classification analysis

Meta-Web can be warehoused and incrementally updated Querying and mining can be performed on or assisted by meta-

Web

Page 26: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 26

Importance of Constructing Multi-Layer Meta Web

Benefits of Multi-Layer Meta-Web: Multi-dimensional Web info summary analysis Approximate and intelligent query answering Web high-level query answering (WebSQL, WebML) Web content and structure mining Observing the dynamics/evolution of the Web

Is it realistic to construct such a meta-Web? It benefits even if it is partially constructed The benefit may justify the cost of tool

development, standardization, and partial restructuring

Page 27: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 27

Web Usage (Click-Stream) Mining

Weblog provides rich information about Web dynamics Multidimensional Weblog analysis:

disclose potential customers, users, markets, etc. Plan mining (mining general Web accessing regularities):

Web linkage adjustment, performance improvements Web accessing association/sequential pattern analysis:

Web cashing, prefetching, swapping Trend analysis:

Dynamics of the Web: what has been changing? Customized to individual users

Page 28: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 28

Towards Integrated Data Mining Environments and Tools

OLAP Mining: Integration of Data Warehousing and Data Mining

Querying and Mining: An Integrated Information Analysis Environment

Basic Mining Operations and Mining Query Optimization

“Vertical” (or application-specific) data mining

Invisible data mining

Page 29: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 29

OLAP Mining: An Integration of Data Mining and Data Warehousing

Data mining systems, DBMS, Data warehouse systems

coupling

No coupling, loose-coupling, semi-tight-coupling, tight-coupling

On-line analytical mining data

integration of mining and OLAP technologies

Interactive mining multi-level knowledge

Necessity of mining knowledge and patterns at different levels

of abstraction by drilling/rolling, pivoting, slicing/dicing, etc.

Integration of multiple mining functions

Characterized classification, first clustering and then association

Page 30: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 30

An OLAM Architecture

Data Warehouse

Meta Data

MDDB

OLAMEngine

OLAPEngine

User GUI API

Data Cube API

Database API

Data cleaning

Data integration

Layer3

OLAP/OLAM

Layer2

MDDB

Layer1

Data Repository

Layer4

User Interface

Filtering&Integration Filtering

Databases

Mining query Mining result

Page 31: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 31

Querying and Mining: An Integrated Information Analysis Environment

Data mining as a component of DBMS, data warehouse, or Web information system Integrated information processing environment

MS/SQLServer-2000 (Analysis service) IBM IntelligentMiner on DB2 SAS EnterpriseMiner: data warehousing + mining

Query-based mining Querying database/DW/Web knowledge Efficiency and flexibility: preprocessing, on-line

processing, optimization, integration, etc.

Page 32: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 32

Basic Mining Operations and Mining Query Optimization

Relational databases: There are a set of basic relational operations and a standard query language, SQL E.g., selection, projection, join, set difference,

intersection, Cartesian product, etc. Are there a set of standard data mining operations, on

which optimizations can be done? Difficulty: different definitions on operations Importance: optimization can be performed on them

systematically, standardization to facilitate information exchange and system interoperability

Page 33: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 33

“Vertical” Data Mining

Generic data mining tools? —Too simple to match domain-specific, sophisticated applications

Expert knowledge and business logic represent many years of work in their own fields!

Data mining + business logic + domain experts

A multi-dimensional view of data miners Complexity of data: Web, sequence, spatial, multimedia, … Complexity of domains: DNA, astronomy, market, telecom, …

Domain-specific data mining tools Provide concrete, killer solution to specific problems Feedback to build more powerful tools

Page 34: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 34

Invisible Data Mining

Build mining functions into daily information services

Web search engine (link analysis, authoritative

pages, user profiles)—adaptive web sites, etc.

Improvement of query processing: history + data

Making service smart and efficient

Benefits from/to data mining research

Data mining research has produced many scalable,

efficient, novel mining solutions

Applications feed new challenge problems to

research

Page 35: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 35

Towards Intelligent Tools for Data Mining

Integration paves the way to intelligent mining

Smart interface brings intelligence Easy to use, understand and manipulate

One picture may worth 1,000 words Visual and audio data mining

Human-Centered Data Mining Towards self-tuning, self-managing, self-

triggering data mining

Page 36: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 36

Integrated Mining: A Booster for Intelligent Mining

Integration paves the way to intelligent mining

Data mining integrates with DBMS, DW, WebDB, etc

Integration inherits the power of up-to-date information

technology: querying, MD analysis, similarity search, etc.

Mining can be viewed as querying database knowledge

Integration leads to standard interface/language,

function/process standardization, utility, and reachability

Efficiency and scalability bring intelligent mining to reality

Page 37: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 37

One Picture May Worth 1000 Words!

Visual Data Mining Visualization of data Visualization of data mining results Visualization of data mining processes Interactive data mining: visual classification

One melody may worth 1000 words too! Audio data mining: turn data into music and

melody! Uses audio signals to indicate the patterns of data

or the features of data mining results

Page 38: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 38

Visualization of data mining results in SAS Enterprise Miner: scatter plots

Page 39: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 39

Visualization of association rules in MineSet 3.0

Page 40: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 40

Visualization of a decision tree in MineSet 3.0

Page 41: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 41

Visualization of Data Mining Processes by Clementine

Page 42: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 42

Interactive Visual Mining by Perception-Based Classification (PBC)

Page 43: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 43

Human-Centered Data Mining

Finding all the patterns autonomously in a database? — unrealistic because the patterns could be too many but uninteresting

Data mining should be an interactive process User directs what to be mined

Users must be provided with a set of primitives to be used to communicate with the data mining system — using a data mining query language

User should provide constraints on what to be mined

System should use such constraints to guide the mining process (constraint-based mining or mining query optimization)

Page 44: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 44

Constraint-Based Mining

What kinds of constraints can be used in mining? Knowledge type constraint: classification, association,

etc. Data constraint: SQL-like queries

Find products sold together in Vancouver in Feb.’01. Dimension/level constraints:

in relevance to region, price, brand, customer category.

Rule constraints: small sales (price < $10) triggers big sales (sum >

$200). Interestingness constraints:

E.g., strong rules (min_support 3%, min_confidence 60%, min_lift > 3.0).

Page 45: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 45

Rule Constraints: A Classification

Succinctness

Anti-monotonicity Monotonicity

Convertible constraints

Inconvertible constraints

Page 46: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 46

Constraint-Based Clustering Analysis

User-specified constraints: no cluster has less than 1000 gold customers

Resource allocation (clustering) with obstacles

Page 47: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 47

Towards Automated Data Mining?

It is not realistic to automatically find all the knowledge in a large database

Thus we promote human-centered, constraint-based mining

However, to achieve genuine intelligent data mining, data mining process should be self-tuning, self-managing, self-triggering

Functions should be developed to achieve such performance

Page 48: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 48

Conclusions

Data mining—A promising research frontier

Data mining research has been striding forward greatly

in the last decade

However, data mining, as an industry, has not been

flying as high as expected

Much research and application exploration are needed Web mining

Towards integrated data mining environments and tools

Towards intelligent, efficient, and scalable data mining methods

Page 49: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 49

http://www.cs.sfu.ca/~han http://db.cs.sfu.ca

Thank you !!!Thank you !!!

Page 50: Data Mining-Current Status and Research Directions

2023년 4월 12일 Data Mining: Status and Directions 50

References

J. Han and M. Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann, 2001.

J. Han, L. V. S. Lakshmanan, and R. T. Ng, "Constraint-Based, Multidimensional Data Mining", COMPUTER (special issues on Data Mining), 32(8): 46-50, 1999.