Idea Engineering
-
Upload
cs-ncstate -
Category
Technology
-
view
178 -
download
0
Transcript of Idea Engineering
Idea Engineering
Oct’13
0. algorithmmining
1. landscapemining
2. decisionmining
3. discussionmining
yesterday today
tomorrow future
The Premises of PROMISE(2005)
– Wanted: predictions• Nope. Users want decision, or engagement
The Premises of PROMISE(2005)
– Wanted: predictions• Nope. Users want decision, or engagement
– Data mining will reveal “the truth” about SE• [Dejaeger: TSE’11], [Hall: TSE’12], [Shepperd:COW’13]• Not(Better learners = better conclusions)
The Premises of PROMISE(2005)
– Wanted: predictions• Nope. Users want decision, or engagement
– Data mining will reveal “the truth” about SE• [Dejaeger: TSE’11], [Hall: TSE’12], [Shepperd:COW’13]• Not(Better learners = better conclusions)
– Sooner or later: enough data for general conclusions• Found more differences than generalities• Special issues: [IST’13], [ESEj’13]• Best papers, ASE’11, MSR’12• Menzies, Zimmermann et al [TSE’13]• Lots of local models
5
Landscape mining:look before your leap
• Report what is true about the data– Not trivia on how algorithms
walk that data
• Map the landscape– Reason on each part of map
• E.g. landscape mining– Unsupervised iterative
dichotomization– Cluster, prune– Then generate rules
6
Landscape mining:look before your leap
• Report what is true about the data– Not trivia on how algorithms
walk that data
• Map the landscape– Reason on each part of map
• E.g. landscape mining– Unsupervised iterative
dichotomization– Cluster, prune– Then generate rules
• Different to “leap before you look”– i.e. skew learning by class variable– then study the results
• E.g. C4.5, CART, Fayya-Iranni, etc– Supervised iterative dichotomization
• E.g. 61% * 300+effort estimation papers– Algorithm tinkering, without end
7
Find landscape = cluster data, assign “heights”
Find decisions = report delta highs to lows
Monitor discussions = watch, help, communities explore deltas
IDEA Engineering = <landscape, decisions, discussion>
Spectral Landscape Mining• Spectrum = condition that is not
limited to a specific set of values but varies in a continuum.
• Groups together a broad range of conditions or behaviors under one single title
• In mathematics, the spectrum of a (finite-dimensional) matrix is the set of its eigenvalues.
• Nystrom algorithms: approximations to eigenvalues– FASTMAP: linear time
Project data on first 2 PCA; grid that datae.g. Nasa93dem
1) project 23 dimensions projected into 2 2a) cluster 2b) replace clusters with centroids.
MOEA: score= effort+defects +months
Sanity check:What information loss?
• E.g. POI-3 – 400+ examples– 20 centroids
• Prediction via:– Extrapolation between two
nearest centroids
• Works as well as– Random forest, Naïve Bayes
• For defect prediction (10 data sets)
– Linear regression, M5’• For effort estimation (10 data sets)
11
• Find delta between neighbors that go worse to better• Very small rules, found in logLinear time• Menzies et al. [TSE’13]
Planning = Inter-cluster contrast sets
Applications
• Prediction• Planning• Monitoring• Multi-objective optimization
– Cluster first on N objectives • Anomaly detection• Incremental theory revision• Compression• Privacy• etc
Idea Engineering
0. algorithmmining
1. landscapemining
2. decisionmining
3. discussionmining
yesterday today
tomorrow future
Beyond Data Mining, T. Menzies, IEEE Software, 2013, to appear
13
Q: why call it mining?
• A1: because all the primitives for the above are in the data mining literature• So we know how to get from here to there
• A2: because data mining scales