Data Mining: Crossing the Chasm
description
Transcript of Data Mining: Crossing the Chasm
![Page 1: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/1.jpg)
Data Mining: Crossing the Chasm
Rakesh AgrawalIBM Almaden Research Center
![Page 2: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/2.jpg)
Thesis
• The greatest challenge facing data mining is to make the transition from being an early market technology to mainstream technology
• We have the opportunity to make this transition successful
![Page 3: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/3.jpg)
Outline
• Chasm in the technology adoption life cycle, à la Geoffrey Moore†
• Experience with Quest/Intelligent Miner• Ideas for successful chasm crossing
† Geoffrey A Moore. Crossing the Chasm. Harper Business. http://www.chasmgroup.com
![Page 4: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/4.jpg)
Technology Adoption Life Cycle
Techies: Try it!
Visionaries: Get ahead of the herd!
Pragmatists: Stick with the herd!
Conservatives: Hold on!
Skeptics: No way!
Late Majority
Early Majority
Early Adopters
LaggardsInnovators
Psychographic profile of each group is different
![Page 5: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/5.jpg)
Innovators: Technology Enthusiasts
• Intrigued by any fundamental advance in technology
• Like to alpha test new products• Can ignore the missing elements• Want access to top technologists• Want no-profit pricing (preferably free)
Gatekeepers to early adopters
![Page 6: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/6.jpg)
Early Adopters: Visionaries
• Driven by vision of dramatic competitive advantage via revolutionary breakthroughs
• Great imagination for strategic applications• Not so price-sensitive• Want rapid time to market• Demand high degree of customization
Fund the development of early market
![Page 7: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/7.jpg)
Early Majority: Pragmatists
• Want sustainable productivity improvement through evolutionary change
• Astute managers of mission-critical apps• Understand real-world issues and tradeoffs• Focus on proven applications; want to see
the solution in production
Bulwark of the mainstream market
![Page 8: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/8.jpg)
Late Majority: Conservatives
• Want to stay even with the competition• Risk averse• Price sensitive• Need completely pre-assembled solutions
Extend technology life cycles
![Page 9: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/9.jpg)
Laggards: Skeptics
• Driven to maintain status quo• Good at debunking marketing hype• Disbelieve productivity-improvement
arguments• Can be formidable opposition to early
adoption of a technology
Retard the development of high-tech markets
![Page 10: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/10.jpg)
Crack in the curve
Early Market Mainstream Market
Chasm
The greatest peril in the development of a high-tech market lies in making the transition from an early market dominated by a few visionaries to a mainstream market dominated by pragmatists.
![Page 11: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/11.jpg)
Visionaries vs. Pragmatists
• Adventurous• First strike capability• Early buy-in• State of the art• Think big• Spend big
• Prudent• Staying power• Wait-and-see• Industry standard• Manage expectation• Spend to budget
![Page 12: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/12.jpg)
Is data mining following this curve?
• Yes!!!• My personal viewpoint based on
Quest/Intelligent Miner experience
![Page 13: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/13.jpg)
Quest
• Started as skunk work in early nineties• Inspired by needs articulated by industry
visionaries:– Transaction data collected over a long period– Current tools/SQL don’t cut it– About ready to throw data
![Page 14: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/14.jpg)
Approach
• Examine “real” applications• Identify operations that cut across
applications• Design fast, scalable algorithms for each
operation• Develop applications by composing
operations
![Page 15: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/15.jpg)
Operations
• Associations• Sequential Patterns• Similar time series
• New Operations• Completeness,
scalability
• Classification• Clustering• Deviations
• Adopted from Statistics/Learning
• Scalability
http://www.almaden.ibm.com/cs/quest
![Page 16: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/16.jpg)
Bringing Quest to market
• Visionaries who inspired Quest did not become first customers:– Wanted evidence that the technology “worked”
• Frustrating attempts to interest major IBM customers:– Integration with existing applications– Too-far-out technology– Resistance from in-house analytic groups
![Page 17: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/17.jpg)
First hits
• Small information-based companies who provided data in exchange for free results
• CIO who wanted to be seen as the technology pioneer in his industry
• CIO who wanted the success story to feature in the company’s annual report
Led to the formation of a group offering services using Quest
![Page 18: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/18.jpg)
Characteristics of engagements
• Mostly associations and sequential patterns• Completeness a big plus• Unanticipated uses• Feedback for further development
![Page 19: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/19.jpg)
Into the product land
• Formation of a small “out-of-plan” product group to productize Quest
• Facilitated by a closet mathematician• Successes of the services group used for
market validation• Continued development and infusion of
technology
![Page 20: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/20.jpg)
Intelligent Miner
• Serious product• Integrates technologies from various groups• Fast, scalable, runs on multiple platforms• Several “early market” success
stories
http://www.software.ibm.com/data/iminer/
![Page 21: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/21.jpg)
Are we in the chasm?
• Perceived to be sophisticated technology, usable only by specialists
• Long, expensive projects• Stand-alone, loosely-coupled with data
infrastructures• Difficult to infuse into existing mission-
critical applications
![Page 22: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/22.jpg)
Chasm Crossing
• Personal speculations on some technical challenges
• Do not imply IBM research/product directions
![Page 23: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/23.jpg)
XML-based Data Mining Standard (1)
• Model Building:– A pair of standard
DTDs for each operation
– Interchangeable library of operator implementations
Operator
Model
ParametersData Specs Standard
DTD
Standard DTD
Library
Ack: Mattos, Pirahesh, Schwenkries
![Page 24: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/24.jpg)
XML-based Data Mining Standard (2)
• Model Deployment:– Mapping XML object
provides mapping between names and format in the model object and the data record
– Model could have been developed on a different system
Application
Result
Mapping
Standard DTDs
Standard DTD
Library
Model DataRecord
![Page 25: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/25.jpg)
Implications
• Standard interfaces for application developers to incorporate data mining
• Coupling with relational databases – mappings from DTDs to relational schemas– implementation using existing infrastructure
![Page 26: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/26.jpg)
Data Mining Benchmarks
• UC Irvine repository• Generating synthetic benchmarks modeled
after real data sets is a hard problem– How to map names into meaningful literals– How to preserve empirical distributions
Ack: Srikant, Ullman
![Page 27: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/27.jpg)
Auto-focus data mining
• Automatic parameter tuning• Automatic algorithm selection (à la join
method selection in database query optimization)
Ack: Andreas Arning
![Page 28: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/28.jpg)
Web: Greatest opportunity
• Huge collection of data (e.g. Yahoo collecting ~50GB every day)
• Universal digital distribution medium makes data mining results actionable in fundamentally new ways
• But watch for privacy pitfall
![Page 29: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/29.jpg)
Privacy-preserving data mining
• Technical vs. legislated solutions• Implication for data mining algorithms
when some fields of a data record have been fudged according to the user’s privacy sensitivity
Ack: R. Srikant
![Page 30: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/30.jpg)
Personalization
• Internet might provide for the first time tools necessary for users to capture information about themselves and to selectively release this information†
• Will we be providing these tools?
† John Hagel, Marc Singer. Net Worth. Harvard Business School Press.
![Page 31: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/31.jpg)
What about Association Rules?
• Very long patterns• Separating wheat from chaff• Principled introduction of domain
knowledge
![Page 32: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/32.jpg)
What else?
• Formal foundations of data mining
![Page 33: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/33.jpg)
Summary
• Closely couple data mining with database systems
• Embed data mining into applications
• Focus on web
• Standard interfaces• Benchmarks• Auto focussing
• Personalization• Privacy
![Page 34: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/34.jpg)
Concluding remarks
• Data mining, a great technology– Combination of intriguing theoretical questions
with large commercial interest in the technology
• Poised for transitioning into mainstream technology
• Will we rise to the challenge as a community?
![Page 35: Data Mining: Crossing the Chasm](https://reader035.fdocuments.us/reader035/viewer/2022062810/56815c12550346895dc9edc1/html5/thumbnails/35.jpg)
Acknowledgments
Arning Arnold Bayardo Baur Bollinger Brodbeck
Baune Carey Chandra Cody Faloutsos Gardner
Gehrke Ghosh Greissl Gruhl Grove Gunopulos
Gupta Haas Ho Imielinski Iyer Lent
Leyman Lin Lingenfelder Mason McPherson Megiddo
Mehta Miranda Psaila Raghavan Rissanen Sawhney
Sarawagi Schwenkries Schkolnick Shafer Shim Somani
Srikant Staub Swami Traiger Vu Zait