Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out
-
Upload
malcolm-kirby -
Category
Documents
-
view
24 -
download
0
description
Transcript of Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out
1
Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out
Written By: Putten, Kok, Gupta
Presented By: Ernesto OchandioDSCI 5240November Dec 7, 2005
2
Problem Definition
• Exponential growth in data capture leads to data fragmentation.– POS customer tracking– Corporate Data Warehouse– Advanced Analytics
• Increased popularity of personalized messages.• Prohibitive attitudinal data costs.
3
Data Fusion Overview
• Data Fusion is the combination of information from different sources.
• Also known as: Micro Data Set Merging, Statistical Record Linkage, and Multi-Source Imputation
• Example: – Demographic and psychographic data aggregated at
geographical level.– Same characteristics for people in the same region.
• Motivation:– Algorithms can create generalized fusions providing richer
data sets for use in applications or future data mining projects.
4
Data Fusion Terminology
• Recipient, Donor, Fused Variables, Common Variables, Critical Common Variables
+ =
Recipient Donor Fused Dataset
Common Variables Fused Variables
5
C1 X, Y, Z 15 15 15 15C2 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC3 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC4 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC5 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC6 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC7 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC8 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC9 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC10 X, Y, Z 20 20 20 20C11 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC12 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC13 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC14 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC15 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC16 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC17 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC18 X, Y, Zxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Data Fusion Algorithm
• Find best Donor elements that match the Recipient element.• Ensure Critical Variable exact match.• Limit Donor element usage.• Use averages from the Donor set to estimate the Fused variables
for the Recipient set.
+ =
Recipient Donor Fused Dataset
C1 X, Y, ZC2 X, Y, ZC3 X, Y, ZC4 X, Y, ZC5 X, Y, ZC6 X, Y, ZC7 X, Y, ZC8 X, Y, ZC9 X, Y, ZC10 X, Y, ZC11 X, Y, ZC12 X, Y, ZC13 X, Y, ZC14 X, Y, ZC15 X, Y, ZC16 X, Y, ZC17 X, Y, ZC18 X, Y, Z
X, Y, Z 10 10 10 10X, Y, Z 20 20 20 20X, Y, Z 10 10 10 10X, Y, Z 20 20 20 20X, Y, Z 30 30 30 30
6
Conclusion
• Data Fusion increases the value of Data Mining by creating more data to mine while reducing costs and ensuring the best matches possible without over-representing elements in the Donor set.