01 Introduction to Data Mining

HAN 08-ch01-001-038-9780123814791 2011/6/1 3:12 Page 1 #1

1IntroductionThis book is an introduction to the young and fast-growing field of data mining (also known

as knowledge discovery from data, or KDD for short). The book focuses on fundamentaldata mining concepts and techniques for discovering interesting patterns from data invarious applications. In particular, we emphasize prominent techniques for developingeffective, efficient, and scalable data mining tools.

This chapter is organized as follows. In Section 1.1, you will learn why data mining isin high demand and how it is part of the natural evolution of information technology.Section 1.2 defines data mining with respect to the knowledge discovery process. Next,you will learn about data mining from many aspects, such as the kinds of data that canbe mined (Section 1.3), the kinds of knowledge to be mined (Section 1.4), the kinds oftechnologies to be used (Section 1.5), and targeted applications (Section 1.6). In thisway, you will gain a multidimensional view of data mining. Finally, Section 1.7 outlinesmajor data mining research and development issues.

1.1 Why Data Mining?Necessity, who is the mother of invention. Plato

We live in a world where vast amounts of data are collected daily. Analyzing such datais an important need. Section 1.1.1 looks at how data mining can meet this need byproviding tools to discover knowledge from data. In Section 1.1.2, we observe how datamining can be viewed as a result of the natural evolution of information technology.

1.1.1 Moving toward the Information AgeWe are living in the information age is a popular saying; however, we are actually livingin the data age. Terabytes or petabytes1 of data pour into our computer networks, theWorld Wide Web (WWW), and various data storage devices every day from business,

1A petabyte is a unit of information or computer storage equal to 1 quadrillion bytes, or a thousandterabytes, or 1 million gigabytes.

c 2012 Elsevier Inc. All rights reserved.Data Mining: Concepts and Techniques 1

HAN 08-ch01-001-038-9780123814791 2011/6/1 3:12 Page 2 #2

2 Chapter 1 Introduction

society, science and engineering, medicine, and almost every other aspect of daily life.This explosive growth of available data volume is a result of the computerization ofour society and the fast development of powerful data collection and storage tools.Businesses worldwide generate gigantic data sets, including sales transactions, stocktrading records, product descriptions, sales promotions, company profiles and perfor-mance, and customer feedback. For example, large stores, such as Wal-Mart, handlehundreds of millions of transactions per week at thousands of branches around theworld. Scientific and engineering practices generate high orders of petabytes of data ina continuous manner, from remote sensing, process measuring, scientific experiments,system performance, engineering observations, and environment surveillance.

Global backbone telecommunication networks carry tens of petabytes of data trafficevery day. The medical and health industry generates tremendous amounts of data frommedical records, patient monitoring, and medical imaging. Billions of Web searchessupported by search engines process tens of petabytes of data daily. Communities andsocial media have become increasingly important data sources, producing digital pic-tures and videos, blogs, Web communities, and various kinds of social networks. Thelist of sources that generate huge amounts of data is endless.

This explosively growing, widely available, and gigantic body of data makes ourtime truly the data age. Powerful and versatile tools are badly needed to automaticallyuncover valuable information from the tremendous amounts of data and to transformsuch data into organized knowledge. This necessity has led to the birth of data mining.The field is young, dynamic, and promising. Data mining has and will continue to makegreat strides in our journey from the data age toward the coming information age.

Example 1.1 Data mining turns a large collection of data into knowledge. A search engine (e.g.,Google) receives hundreds of millions of queries every day. Each query can be viewedas a transaction where the user describes her or his information need. What novel anduseful knowledge can a search engine learn from such a huge collection of queries col-lected from users over time? Interestingly, some patterns found in user search queriescan disclose invaluable knowledge that cannot be obtained by reading individual dataitems alone. For example, Googles Flu Trends uses specific search terms as indicators offlu activity. It found a close relationship between the number of people who search forflu-related information and the number of people who actually have flu symptoms. Apattern emerges when all of the search queries related to flu are aggregated. Using aggre-gated Google search data, Flu Trends can estimate flu activity up to two weeks fasterthan traditional systems can.2 This example shows how data mining can turn a largecollection of data into knowledge that can help meet a current global challenge.

1.1.2 Data Mining as the Evolution of Information TechnologyData mining can be viewed as a result of the natural evolution of information tech-nology. The database and data management industry evolved in the development of

2This is reported in [GMP+09].

HAN 08-ch01-001-038-9780123814791 2011/6/1 3:12 Page 3 #3

1.1 Why Data Mining? 3

Data Collection and Database Creation(1960s and earlier)

Primitive file processing

Database Management Systems(1970s to early 1980s)

Hierarchical and network database systemsRelational database systemsData modeling: entity-relationship models, etc.Indexing and accessing methodsQuery languages: SQL, etc.User interfaces, forms, and reportsQuery processing and optimizationTransactions, concurrency control, and recoveryOnline transaction processing (OLTP)

Advanced Database Systems(mid-1980s to present)

Advanced data models: extended-relational,object relational, deductive, etc.Managing complex data: spatial, temporal,multimedia, sequence and structured,scientific, engineering, moving objects, etc.Data streams and cyber-physical data systemsWeb-based databases (XML, semantic web)Managing uncertain data and data cleaningIntegration of heterogeneous sourcesText database systems and integration withinformation retrievalExtremely large data managementDatabase system tuning and adaptive systemsAdvanced queries: ranking, skyline, etc.Cloud computing and parallel data processingIssues of data privacy and security

Advanced Data Analysis(late- 1980s to present)

Data warehouse and OLAPData mining and knowledge discovery:classification, clustering, outlier analysis,association and correlation, comparativesummary, discrimination analysis, patterndiscovery, trend and deviation analysis, etc.Mining complex types of data: streams,sequence, text, spatial, temporal, multimedia,Web, networks, etc.Data mining applications: business, society,retail, banking, telecommunications, scienceand engineering, blogs, daily life, etc.Data mining and society: invisible datamining, privacy-preserving data mining,mining social and information networks,recommender systems, etc.

Future Generation of Information Systems(Present to future)

Figure 1.1 The evolution of database system technology.

several critical functionalities (Figure 1.1): data collection and database creation, datamanagement (including data storage and retrieval and database transaction processing),and advanced data analysis (involving data warehousing and data mining). The earlydevelopment of data collection and database creation mechanisms served as a prerequi-site for the later development of effective mechanisms for data storage and retrieval,as well as query and transaction processing. Nowadays numerous database systemsoffer query and transaction processing as common practice. Advanced data analysis hasnaturally become the next step.

HAN 08-ch01-001-038-9780123814791 2011/6/1 3:12 Page 4 #4


Since the 1960s, database and information technology has evolved systematicallyfrom primitive file processing systems to sophisticated and powerful database systems.The research and development in database systems since the 1970s progressed fromearly hierarchical and network database systems to relational database systems (wheredata are stored in relational table structures; see Section 1.3.1), data modeling tools,and indexing and accessing methods. In addition, users gained convenient and flexibledata access through query languages, user interfaces, query optimization, and transac-tion management. Efficient methods for online transaction processing (OLTP), where aquery is viewed as a read-only transaction, contributed substantially to the evolution andwide acceptance of relational technology as a major tool for efficient storage, retrieval,and management of large amounts of data.

After the establishment of database management systems, database technologymoved toward the development of advanced database systems, data warehousing, anddata mining for advanced data analysis and web-based databases. Advanced databasesystems, for example, resulted from an upsurge of research from the mid-1980s onward.These systems incorporate new and powerful data models such as extended-relational,object-oriented, object-relational, and deductive models. Application-oriented databasesystems have flourished, including spatial, temporal, multimedia, active, stream andsensor, scientific and engineering databases, knowledge bases, and office informationbases. Issues related to the distribution, diversification, and sharing of data have beenstudied extensively.

Advanced data analysis sprang up from the late 1980s onward. The steady anddazzling progress of computer hardware technology in the past three decades led tolarge supplies of powerful and affordable computers, data collection equipment, andstorage media. This technology provides a great boost to the database and informationindustry, and it enables a huge number of databases and information repositories to beavailable for transaction management, information retrieval, and data analysis. Datacan now be stored in many different kinds of databases and information repositories.

One emerging data repository architecture is the data warehouse (Section 1.3.2).This is a repository of multiple heterogeneous data sources organized under a uni-fied schema at a single site to facilitate management decision making. Data warehousetechnology includes data cleaning, data integration, and online analytical processing(OLAP)that is, analysis techniques with functionalities such as summarization, con-solidation, and aggregation, as well as the ability to view information from differentangles. Although OLAP tools support multidimensional analysis and decision making,additional data analysis tools are required for in-depth analysisfor example, data min-ing tools that provide data classification, clustering, outlier/anomaly detection, and thecharacterization of changes in data over time.

Huge volumes of data have been accumulated beyond databases and data ware-houses. During the 1990s, the World Wide Web and web-based databases (e.g., XMLdatabases) began to appear. Internet-based global information bases, such as the WWWand various kinds of interconnected, heterogeneous databases, have emerged and playa vital role in the information industry. The effective and efficient analysis of data fromsuch different forms of data by integration of information retrieval, data mining, andinformation network analysis technologies is a challenging task.

HAN 08-ch01-001-038-9780123814791 2011/6/1 3:12 Page 5 #5

1.2 What Is Data Mining? 5

How can I analyze these data?

Figure 1.2 The world is data rich but information poor.

In summary, the abundance of data, coupled with the need for powerful data analysistools, has been described as a data rich but information poor situation (Figure 1.2). Thefast-growing, tremendous amount of data, collected and stored in large and numerousdata repositories, has far exceeded our human ability for comprehension without power-ful tools. As a result, data collected in large data repositories become data tombsdataarchives that are seldom visited. Consequently, important decisions are often madebased not on the information-rich data stored in data repositories but rather on a deci-sion makers intuition, simply because the decision maker does not have the tools toextract the valuable knowledge embedded in the vast amounts of data. Efforts havebeen made to develop expert system and knowledge-based technologies, which typicallyrely on users or domain experts to manually input knowledge into knowledge bases.Unfortunately, however, the manual knowledge input procedure is prone to biases anderrors and is extremely costly and time consuming. The widening gap between data andinformation calls for the systematic development of data mining tools that can turn datatombs into golden nuggets of knowledge.

1.2 What Is Data Mining?It is no surprise that data mining, as a truly interdisciplinary subject, can be definedin many different ways. Even the term data mining does not really present all the majorcomponents in the picture. To refer to the mining of gold from rocks or sand, we say goldmining instead of rock or sand mining. Analogously, data mining should have been more

HAN 08-ch01-001-038-9780123814791 2011/6/1 3:12 Page 6 #6


Knowledge

Figure 1.3 Data miningsearching for knowledge (interesting patterns) in data.

appropriately named knowledge mining from data, which is unfortunately somewhatlong. However, the shorter term, knowledge mining may not reflect the emphasis onmining from large amounts of data. Nevertheless, mining is a vivid term characterizingthe process that finds a small set of precious nuggets from a great deal of raw material(Figure 1.3). Thus, such a misnomer carrying both data and mining became a pop-ular choice. In addition, many other terms have a similar meaning to data miningforexample, knowledge mining from data, knowledge extraction, data/pattern analysis, dataarchaeology, and data dredging.

Many people treat data mining as a synonym for another popularly used term,knowledge discovery from data, or KDD, while others view data mining as merely anessential step in the process of knowledge discovery. The knowledge discovery process isshown in Figure 1.4 as an iterative sequence of the following steps:

1. Data cleaning (to remove noise and inconsistent data)

2. Data integration (where multiple data sources may be combined)3

3A popular trend in the information industry is to perform data cleaning and data integration as apreprocessing step, where the resulting data are stored in a data warehouse.

Front Cover Data Mining: Concepts and TechniquesCopyrightDedicationTable of ContentsForewordForeword to Second EditionPrefaceAcknowledgmentsAbout the AuthorsChapter 1. Introduction1.1 Why Data Mining?1.2 What Is Data Mining?1.3 What Kinds of Data Can Be Mined?1.4 What Kinds of Patterns Can Be Mined?1.5 Which Technologies Are Used?1.6 Which Kinds of Applications Are Targeted?1.7 Major Issues in Data Mining1.8 Summary1.9 Exercises1.10 Bibliographic Notes

Chapter 2. Getting to Know Your Data2.1 Data Objects and Attribute Types2.2 Basic Statistical Descriptions of Data2.3 Data Visualization2.4 Measuring Data Similarity and Dissimilarity2.5 Summary2.6 Exercises2.7 Bibliographic Notes

Chapter 3. Data Preprocessing3.1 Data Preprocessing: An Overview3.2 Data Cleaning3.3 Data Integration3.4 Data Reduction3.5 Data Transformation and Data Discretization3.6 Summary3.7 Exercises3.8 Bibliographic Notes

Chapter 4. Data Warehousing and Online Analytical Processing4.1 Data Warehouse: Basic Concepts4.2 Data Warehouse Modeling: Data Cube and OLAP4.3 Data Warehouse Design and Usage4.4 Data Warehouse Implementation4.5 Data Generalization by Attribute-Oriented Induction4.6 Summary4.7 Exercises4.8 Bibliographic Notes

Chapter 5. Data Cube Technology5.1 Data Cube Computation: Preliminary Concepts5.2 Data Cube Computation Methods5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology5.4 Multidimensional Data Analysis in Cube Space5.5 Summary5.6 Exercises5.7 Bibliographic Notes

Chapter 6. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods6.1 Basic Concepts6.2 Frequent Itemset Mining Methods6.3 Which Patterns Are Interesting?Pattern Evaluation Methods6.4 Summary6.5 Exercises6.6 Bibliographic Notes

Chapter 7. Advanced Pattern Mining7.1 Pattern Mining: A Road Map7.2 Pattern Mining in Multilevel, Multidimensional Space7.3 Constraint-Based Frequent Pattern Mining7.4 Mining High-Dimensional Data and Colossal Patterns7.5 Mining Compressed or Approximate Patterns7.6 Pattern Exploration and Application7.7 Summary7.8 Exercises7.9 Bibliographic Notes

Chapter 8. Classification: Basic Concepts8.1 Basic Concepts8.2 Decision Tree Induction8.3 Bayes Classification Methods8.4 Rule-Based Classification8.5 Model Evaluation and Selection8.6 Techniques to Improve Classification Accuracy8.7 Summary8.8 Exercises8.9 Bibliographic Notes

Chapter 9. Classification: Advanced Methods9.1 Bayesian Belief Networks9.2 Classification by Backpropagation9.3 Support Vector Machines9.4 Classification Using Frequent Patterns9.5 Lazy Learners (or Learning from Your Neighbors)9.6 Other Classification Methods9.7 Additional Topics Regarding Classification9.8 Summary9.9 Exercises9.10 Bibliographic Notes

Chapter 10. Cluster Analysis: Basic Concepts and Methods10.1 Cluster Analysis10.2 Partitioning Methods10.3 Hierarchical Methods10.4 Density-Based Methods10.5 Grid-Based Methods10.6 Evaluation of Clustering10.7 Summary10.8 Exercises10.9 Bibliographic Notes

Chapter 11. Advanced Cluster Analysis11.1 Probabilistic Model-Based Clustering11.2 Clustering High-Dimensional Data11.3 Clustering Graph and Network Data11.4 Clustering with Constraints11.5 Summary11.6 Exercises11.7 Bibliographic Notes

Chapter 12. Outlier Detection12.1 Outliers and Outlier Analysis12.2 Outlier Detection Methods12.3 Statistical Approaches12.4 Proximity-Based Approaches12.5 Clustering-Based Approaches12.6 Classification-Based Approaches12.7 Mining Contextual and Collective Outliers12.8 Outlier Detection in High-Dimensional Data12.9 Summary12.10 Exercises12.11 Bibliographic Notes

Chapter 13. Data Mining Trends and Research Frontiers13.1 Mining Complex Data Types13.2 Other Methodologies of Data Mining13.3 Data Mining Applications13.4 Data Mining and Society13.5 Data Mining Trends13.6 Summary13.7 Exercises13.8 Bibliographic Notes

BibliographyIndexFront Cover Data Mining: Concepts and TechniquesCopyrightDedicationTable of ContentsForewordForeword to Second EditionPrefaceAcknowledgmentsAbout the AuthorsChapter 1. Introduction1.1 Why Data Mining?1.2 What Is Data Mining?1.3 What Kinds of Data Can Be Mined?1.4 What Kinds of Patterns Can Be Mined?1.5 Which Technologies Are Used?1.6 Which Kinds of Applications Are Targeted?1.7 Major Issues in Data Mining1.8 Summary1.9 Exercises1.10 Bibliographic Notes

































































BibliographyIndex

01 Introduction to Data Mining

Documents

Transcript of 01 Introduction to Data Mining