06 data Mining-Data Preprocessing-Integration

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 93 #11

3.3 Data Integration 93

trace. Most errors, however, will require data transformations. That is, once we finddiscrepancies, we typically need to define and apply (a series of) transformations tocorrect them.

Commercial tools can assist in the data transformation step. Data migration toolsallow simple transformations to be specified such as to replace the string gender bysex. ETL (extraction/transformation/loading) tools allow users to specify transformsthrough a graphical user interface (GUI). These tools typically support only a restrictedset of transforms so that, often, we may also choose to write custom scripts for this stepof the data cleaning process.

The two-step process of discrepancy detection and data transformation (to correctdiscrepancies) iterates. This process, however, is error-prone and time consuming. Sometransformations may introduce more discrepancies. Some nested discrepancies may onlybe detected after others have been fixed. For example, a typo such as 20010 in a yearfield may only surface once all date values have been converted to a uniform format.Transformations are often done as a batch process while the user waits without feedback.Only after the transformation is complete can the user go back and check that no newanomalies have been mistakenly created. Typically, numerous iterations are requiredbefore the user is satisfied. Any tuples that cannot be automatically handled by a giventransformation are typically written to a file without any explanation regarding the rea-soning behind their failure. As a result, the entire data cleaning process also suffers froma lack of interactivity.

New approaches to data cleaning emphasize increased interactivity. Potters Wheel,for example, is a publicly available data cleaning tool that integrates discrepancy detec-tion and transformation. Users gradually build a series of transformations by composingand debugging individual transformations, one step at a time, on a spreadsheet-likeinterface. The transformations can be specified graphically or by providing examples.Results are shown immediately on the records that are visible on the screen. The usercan choose to undo the transformations, so that transformations that introduced addi-tional errors can be erased. The tool automatically performs discrepancy checking inthe background on the latest transformed view of the data. Users can gradually developand refine transformations as discrepancies are found, leading to more effective andefficient data cleaning.

Another approach to increased interactivity in data cleaning is the development ofdeclarative languages for the specification of data transformation operators. Such workfocuses on defining powerful extensions to SQL and algorithms that enable users toexpress data cleaning specifications efficiently.

As we discover more about the data, it is important to keep updating the metadatato reflect this knowledge. This will help speed up data cleaning on future versions of thesame data store.

3.3 Data IntegrationData mining often requires data integrationthe merging of data from multiple datastores. Careful integration can help reduce and avoid redundancies and inconsistencies

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 94 #12

94 Chapter 3 Data Preprocessing

in the resulting data set. This can help improve the accuracy and speed of the subsequentdata mining process.

The semantic heterogeneity and structure of data pose great challenges in data inte-gration. How can we match schema and objects from different sources? This is theessence of the entity identification problem, described in Section 3.3.1. Are any attributescorrelated? Section 3.3.2 presents correlation tests for numeric and nominal data. Tupleduplication is described in Section 3.3.3. Finally, Section 3.3.4 touches on the detectionand resolution of data value conflicts.

3.3.1 Entity Identification ProblemIt is likely that your data analysis task will involve data integration, which combines datafrom multiple sources into a coherent data store, as in data warehousing. These sourcesmay include multiple databases, data cubes, or flat files.

There are a number of issues to consider during data integration. Schema integrationand object matching can be tricky. How can equivalent real-world entities from multipledata sources be matched up? This is referred to as the entity identification problem.For example, how can the data analyst or the computer be sure that customer id in onedatabase and cust number in another refer to the same attribute? Examples of metadatafor each attribute include the name, meaning, data type, and range of values permittedfor the attribute, and null rules for handling blank, zero, or null values (Section 3.2).Such metadata can be used to help avoid errors in schema integration. The metadatamay also be used to help transform the data (e.g., where data codes for pay type in onedatabase may be H and S but 1 and 2 in another). Hence, this step also relates todata cleaning, as described earlier.

When matching attributes from one database to another during integration, specialattention must be paid to the structure of the data. This is to ensure that any attributefunctional dependencies and referential constraints in the source system match those inthe target system. For example, in one system, a discount may be applied to the order,whereas in another system it is applied to each individual line item within the order.If this is not caught before integration, items in the target system may be improperlydiscounted.

3.3.2 Redundancy and Correlation AnalysisRedundancy is another important issue in data integration. An attribute (such as annualrevenue, for instance) may be redundant if it can be derived from another attributeor set of attributes. Inconsistencies in attribute or dimension naming can also causeredundancies in the resulting data set.

Some redundancies can be detected by correlation analysis. Given two attributes,such analysis can measure how strongly one attribute implies the other, based on theavailable data. For nominal data, we use the 2 (chi-square) test. For numeric attributes,we can use the correlation coefficient and covariance, both of which access how oneattributes values vary from those of another.

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 95 #13


2 Correlation Test for Nominal DataFor nominal data, a correlation relationship between two attributes, A and B, can bediscovered by a2 (chi-square) test. Suppose A has c distinct values, namely a1,a2, . . .ac .B has r distinct values, namely b1,b2, . . .br . The data tuples described by A and B can beshown as a contingency table, with the c values of A making up the columns and the rvalues of B making up the rows. Let (Ai ,Bj) denote the joint event that attribute A takeson value ai and attribute B takes on value bj , that is, where (A= ai ,B = bj). Each andevery possible (Ai ,Bj) joint event has its own cell (or slot) in the table. The 2 value(also known as the Pearson 2 statistic) is computed as

2 =c

i=1

rj=1

(oij eij)2eij

, (3.1)

where oij is the observed frequency (i.e., actual count) of the joint event (Ai ,Bj) and eij isthe expected frequency of (Ai ,Bj), which can be computed as

eij =count(A= ai) count(B = bj)

n, (3.2)

where n is the number of data tuples, count(A= ai) is the number of tuples having valueai for A, and count(B = bj) is the number of tuples having value bj for B. The sum inEq. (3.1) is computed over all of the r c cells. Note that the cells that contribute themost to the 2 value are those for which the actual count is very different from thatexpected.

The 2 statistic tests the hypothesis that A and B are independent, that is, there is nocorrelation between them. The test is based on a significance level, with (r 1) (c 1)degrees of freedom. We illustrate the use of this statistic in Example 3.1. If the hypothesiscan be rejected, then we say that A and B are statistically correlated.

Example 3.1 Correlation analysis of nominal attributes using 2. Suppose that a group of 1500people was surveyed. The gender of each person was noted. Each person was polled asto whether his or her preferred type of reading material was fiction or nonfiction. Thus,we have two attributes, gender and preferred reading. The observed frequency (or count)of each possible joint event is summarized in the contingency table shown in Table 3.1,where the numbers in parentheses are the expected frequencies. The expected frequen-cies are calculated based on the data distribution for both attributes using Eq. (3.2).

Using Eq. (3.2), we can verify the expected frequencies for each cell. For example,the expected frequency for the cell (male, fiction) is

e11 = count(male) count(fiction)n

= 300 4501500

= 90,

and so on. Notice that in any row, the sum of the expected frequencies must equal thetotal observed frequency for that row, and the sum of the expected frequencies in anycolumn must also equal the total observed frequency for that column.

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 96 #14


Table 3.1 Example 2.1s 2 2 Contingency Table Datamale female Total

fiction 250 (90) 200 (360) 450

non fiction 50 (210) 1000 (840) 1050

Total 300 1200 1500

Note: Are gender and preferred reading correlated?

Using Eq. (3.1) for 2 computation, we get

2 = (250 90)2

90+ (50 210)

2

210+ (200 360)

2

360+ (1000 840)

2

840

= 284.44+ 121.90+ 71.11+ 30.48= 507.93.For this 2 2 table, the degrees of freedom are (2 1)(2 1)= 1. For 1 degree of free-dom, the 2 value needed to reject the hypothesis at the 0.001 significance level is 10.828(taken from the table of upper percentage points of the 2 distribution, typically avail-able from any textbook on statistics). Since our computed value is above this, we canreject the hypothesis that gender and preferred reading are independent and concludethat the two attributes are (strongly) correlated for the given group of people.

Correlation Coefficient for Numeric DataFor numeric attributes, we can evaluate the correlation between two attributes, A and B,by computing the correlation coefficient (also known as Pearsons product momentcoefficient, named after its inventer, Karl Pearson). This is

rA,B =

ni=1(ai A)(bi B)

nAB=

ni=1(aibi) nAB

nAB, (3.3)

where n is the number of tuples, ai and bi are the respective values of A and B in tuple i,A and B are the respective mean values of A and B, A and B are the respective standarddeviations of A and B (as defined in Section 2.2.2), and 6(aibi) is the sum of the ABcross-product (i.e., for each tuple, the value for A is multiplied by the value for B in thattuple). Note that 1 rA,B +1. If rA,B is greater than 0, then A and B are positivelycorrelated, meaning that the values of A increase as the values of B increase. The higherthe value, the stronger the correlation (i.e., the more each attribute implies the other).Hence, a higher value may indicate that A (or B) may be removed as a redundancy.

If the resulting value is equal to 0, then A and B are independent and there is nocorrelation between them. If the resulting value is less than 0, then A and B are negativelycorrelated, where the values of one attribute increase as the values of the other attributedecrease. This means that each attribute discourages the other. Scatter plots can also beused to view correlations between attributes (Section 2.2.3). For example, Figure 2.8s

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 97 #15


scatter plots respectively show positively correlated data and negatively correlated data,while Figure 2.9 displays uncorrelated data.

Note that correlation does not imply causality. That is, if A and B are correlated, thisdoes not necessarily imply that A causes B or that B causes A. For example, in analyzing ademographic database, we may find that attributes representing the number of hospitalsand the number of car thefts in a region are correlated. This does not mean that onecauses the other. Both are actually causally linked to a third attribute, namely, population.

Covariance of Numeric DataIn probability theory and statistics, correlation and covariance are two similar measuresfor assessing how much two attributes change together. Consider two numeric attributesA and B, and a set of n observations {(a1,b1), . . . ,(an,bn)}. The mean values of A and B,respectively, are also known as the expected values on A and B, that is,

E(A)= A=n

i=1 ain

and

E(B)= B =n

i=1 bin

.

The covariance between A and B is defined as

Cov(A,B)= E((A A)(B B))=n

i=1(ai A)(bi B)n

. (3.4)

If we compare Eq. (3.3) for rA,B (correlation coefficient) with Eq. (3.4) for covariance,we see that

rA,B = Cov(A,B)AB

, (3.5)

where A and B are the standard deviations of A and B, respectively. It can also beshown that

Cov(A,B)= E(A B) AB. (3.6)This equation may simplify calculations.

For two attributes A and B that tend to change together, if A is larger than A (theexpected value of A), then B is likely to be larger than B (the expected value of B).Therefore, the covariance between A and B is positive. On the other hand, if one ofthe attributes tends to be above its expected value when the other attribute is below itsexpected value, then the covariance of A and B is negative.

If A and B are independent (i.e., they do not have correlation), then E(A B)= E(A) E(B). Therefore, the covariance is Cov(A,B)= E(A B) AB = E(A) E(B) AB = 0.However, the converse is not true. Some pairs of random variables (attributes) may havea covariance of 0 but are not independent. Only under some additional assumptions

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 98 #16


Table 3.2 Stock Prices for AllElectronics and HighTech

Time point AllElectronics HighTech

t1 6 20

t2 5 10

t3 4 14

t4 3 5

t5 2 5

(e.g., the data follow multivariate normal distributions) does a covariance of 0 implyindependence.

Example 3.2 Covariance analysis of numeric attributes. Consider Table 3.2, which presents a sim-plified example of stock prices observed at five time points for AllElectronics andHighTech, a high-tech company. If the stocks are affected by the same industry trends,will their prices rise or fall together?

E(AllElectronics)= 6+ 5+ 4+ 3+ 25

= 205= $4

and

E(HighTech)= 20+ 10+ 14+ 5+ 55

= 545= $10.80.

Thus, using Eq. (3.4), we compute

Cov(AllElectroncis,HighTech)= 6 20+ 5 10+ 4 14+ 3 5+ 2 55

4 10.80= 50.2 43.2= 7.

Therefore, given the positive covariance we can say that stock prices for both companiesrise together.

Variance is a special case of covariance, where the two attributes are identical (i.e., thecovariance of an attribute with itself). Variance was discussed in Chapter 2.

3.3.3 Tuple DuplicationIn addition to detecting redundancies between attributes, duplication should also bedetected at the tuple level (e.g., where there are two or more identical tuples for a givenunique data entry case). The use of denormalized tables (often done to improve per-formance by avoiding joins) is another source of data redundancy. Inconsistencies oftenarise between various duplicates, due to inaccurate data entry or updating some but notall data occurrences. For example, if a purchase order database contains attributes for

HAN 10-ch03-083-124-9780123814791 2011/6/1 3:16 Page 99 #17

3.4 Data Reduction 99

the purchasers name and address instead of a key to this information in a purchaserdatabase, discrepancies can occur, such as the same purchasers name appearing withdifferent addresses within the purchase order database.

3.3.4 Data Value Conflict Detection and ResolutionData integration also involves the detection and resolution of data value conflicts. Forexample, for the same real-world entity, attribute values from different sources may dif-fer. This may be due to differences in representation, scaling, or encoding. For instance,a weight attribute may be stored in metric units in one system and British imperialunits in another. For a hotel chain, the price of rooms in different cities may involvenot only different currencies but also different services (e.g., free breakfast) and taxes.When exchanging information between schools, for example, each school may have itsown curriculum and grading scheme. One university may adopt a quarter system, offerthree courses on database systems, and assign grades from A+ to F, whereas anothermay adopt a semester system, offer two courses on databases, and assign grades from 1to 10. It is difficult to work out precise course-to-grade transformation rules betweenthe two universities, making information exchange difficult.

Attributes may also differ on the abstraction level, where an attribute in one sys-tem is recorded at, say, a lower abstraction level than the same attribute in another.For example, the total sales in one database may refer to one branch of All Electronics,while an attribute of the same name in another database may refer to the total salesfor All Electronics stores in a given region. The topic of discrepancy detection is furtherdescribed in Section 3.2.3 on data cleaning as a process.

3.4 Data ReductionImagine that you have selected data from the AllElectronics data warehouse for analysis.The data set will likely be huge! Complex data analysis and mining on huge amounts ofdata can take a long time, making such analysis impractical or infeasible.

Data reduction techniques can be applied to obtain a reduced representation of thedata set that is much smaller in volume, yet closely maintains the integrity of the originaldata. That is, mining on the reduced data set should be more efficient yet produce thesame (or almost the same) analytical results. In this section, we first present an overviewof data reduction strategies, followed by a closer look at individual techniques.

3.4.1 Overview of Data Reduction StrategiesData reduction strategies include dimensionality reduction, numerosity reduction, anddata compression.

Dimensionality reduction is the process of reducing the number of random variablesor attributes under consideration. Dimensionality reduction methods include wavelet

Front Cover Data Mining: Concepts and TechniquesCopyrightDedicationTable of ContentsForewordForeword to Second EditionPrefaceAcknowledgmentsAbout the AuthorsChapter 1. Introduction1.1 Why Data Mining?1.2 What Is Data Mining?1.3 What Kinds of Data Can Be Mined?1.4 What Kinds of Patterns Can Be Mined?1.5 Which Technologies Are Used?1.6 Which Kinds of Applications Are Targeted?1.7 Major Issues in Data Mining1.8 Summary1.9 Exercises1.10 Bibliographic Notes

Chapter 2. Getting to Know Your Data2.1 Data Objects and Attribute Types2.2 Basic Statistical Descriptions of Data2.3 Data Visualization2.4 Measuring Data Similarity and Dissimilarity2.5 Summary2.6 Exercises2.7 Bibliographic Notes

Chapter 3. Data Preprocessing3.1 Data Preprocessing: An Overview3.2 Data Cleaning3.3 Data Integration3.4 Data Reduction3.5 Data Transformation and Data Discretization3.6 Summary3.7 Exercises3.8 Bibliographic Notes

Chapter 4. Data Warehousing and Online Analytical Processing4.1 Data Warehouse: Basic Concepts4.2 Data Warehouse Modeling: Data Cube and OLAP4.3 Data Warehouse Design and Usage4.4 Data Warehouse Implementation4.5 Data Generalization by Attribute-Oriented Induction4.6 Summary4.7 Exercises4.8 Bibliographic Notes

Chapter 5. Data Cube Technology5.1 Data Cube Computation: Preliminary Concepts5.2 Data Cube Computation Methods5.3 Processing Advanced Kinds of Queries by Exploring Cube Technology5.4 Multidimensional Data Analysis in Cube Space5.5 Summary5.6 Exercises5.7 Bibliographic Notes

Chapter 6. Mining Frequent Patterns, Associations, and Correlations: Basic Concepts and Methods6.1 Basic Concepts6.2 Frequent Itemset Mining Methods6.3 Which Patterns Are Interesting?Pattern Evaluation Methods6.4 Summary6.5 Exercises6.6 Bibliographic Notes

Chapter 7. Advanced Pattern Mining7.1 Pattern Mining: A Road Map7.2 Pattern Mining in Multilevel, Multidimensional Space7.3 Constraint-Based Frequent Pattern Mining7.4 Mining High-Dimensional Data and Colossal Patterns7.5 Mining Compressed or Approximate Patterns7.6 Pattern Exploration and Application7.7 Summary7.8 Exercises7.9 Bibliographic Notes

Chapter 8. Classification: Basic Concepts8.1 Basic Concepts8.2 Decision Tree Induction8.3 Bayes Classification Methods8.4 Rule-Based Classification8.5 Model Evaluation and Selection8.6 Techniques to Improve Classification Accuracy8.7 Summary8.8 Exercises8.9 Bibliographic Notes

Chapter 9. Classification: Advanced Methods9.1 Bayesian Belief Networks9.2 Classification by Backpropagation9.3 Support Vector Machines9.4 Classification Using Frequent Patterns9.5 Lazy Learners (or Learning from Your Neighbors)9.6 Other Classification Methods9.7 Additional Topics Regarding Classification9.8 Summary9.9 Exercises9.10 Bibliographic Notes

Chapter 10. Cluster Analysis: Basic Concepts and Methods10.1 Cluster Analysis10.2 Partitioning Methods10.3 Hierarchical Methods10.4 Density-Based Methods10.5 Grid-Based Methods10.6 Evaluation of Clustering10.7 Summary10.8 Exercises10.9 Bibliographic Notes

Chapter 11. Advanced Cluster Analysis11.1 Probabilistic Model-Based Clustering11.2 Clustering High-Dimensional Data11.3 Clustering Graph and Network Data11.4 Clustering with Constraints11.5 Summary11.6 Exercises11.7 Bibliographic Notes

Chapter 12. Outlier Detection12.1 Outliers and Outlier Analysis12.2 Outlier Detection Methods12.3 Statistical Approaches12.4 Proximity-Based Approaches12.5 Clustering-Based Approaches12.6 Classification-Based Approaches12.7 Mining Contextual and Collective Outliers12.8 Outlier Detection in High-Dimensional Data12.9 Summary12.10 Exercises12.11 Bibliographic Notes

Chapter 13. Data Mining Trends and Research Frontiers13.1 Mining Complex Data Types13.2 Other Methodologies of Data Mining13.3 Data Mining Applications13.4 Data Mining and Society13.5 Data Mining Trends13.6 Summary13.7 Exercises13.8 Bibliographic Notes

BibliographyIndexFront Cover Data Mining: Concepts and TechniquesCopyrightDedicationTable of ContentsForewordForeword to Second EditionPrefaceAcknowledgmentsAbout the AuthorsChapter 1. Introduction1.1 Why Data Mining?1.2 What Is Data Mining?1.3 What Kinds of Data Can Be Mined?1.4 What Kinds of Patterns Can Be Mined?1.5 Which Technologies Are Used?1.6 Which Kinds of Applications Are Targeted?1.7 Major Issues in Data Mining1.8 Summary1.9 Exercises1.10 Bibliographic Notes














































































BibliographyIndex

06 data Mining-Data Preprocessing-Integration

Documents

Transcript of 06 data Mining-Data Preprocessing-Integration