Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

79
Lies, Damned Lies and Software Analytics: Why Big Data Needs Thick Data Margaret-Anne (Peggy) Storey University of Victoria @margaretstorey ACM SIGSOFT Webinar, May 4th 2016

Transcript of Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Page 1: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Lies, Damned Lies and Software Analytics: Why Big Data Needs Thick Data

Margaret-Anne (Peggy) StoreyUniversity of Victoria

@margaretstorey

ACM SIGSOFT Webinar, May 4th 2016

Page 2: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

My research…

Human and social aspects in software engineering:

Software visualizationThe social programmer and a participatory

culture in software engineering Qualitative research and mixed methods in

software engineering

Page 3: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Acknowledgements:

Alexey Zagalsky, Daniel German, Matthieu Foucault (UVic)

Jacek Czerwonka, Brendan Murphy (Microsoft Research)

http://www.slideshare.net/mastorey/lies-damned-lies-and-software-analytics-why-big-data-needs-rich-data

Page 4: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Dashboards for developers awareness:

Treude and Storey, “Awareness 2.0: staying aware of projects, developers and tasks using dashboards and feeds,” ICSE 2010.

Page 5: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

1968 1980 1990 2000 20101970

Developer tools…

Page 6: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

How developers stay up to date using Twitter

How developers assess each other based on their development and networking activity

How a crowd of developers document open source API’s through Stackoverflow

How developers share tacit knowledge on

How developers coordinate which code is committed and accepted through GitHub

Page 7: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

1968 1980 1990 2000 20101970

Telephone

Face2Face

ProjectWorkbook

Documents

Email

Email Lists

VisualAge

Visual Studio

NetBeans EclipseIRC

ICQ Skype

SourceForge

WikisTrello

Basecamp

Jazz

Slack

GoogleHangouts

Punchcards TFS

Books UsenetStack

Overflow

Twitt

er

Google Groups

PodcastsBlogs

GitH

ub

Conferences

Societies LinkedIn

Facebook

SlashdotHackerNews

Nondigital Digital Digital & Socially Enabled

Masterbranch

Coderwall

Meetups

Yam

mer

Page 8: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

1968 1980 1990 2000 20101970

Telephone

Face2Face

ProjectWorkbook

Documents

Email

Email Lists

VisualAge

Visual Studio

NetBeans EclipseIRC

ICQ Skype

SourceForge

WikisTrello

Basecamp

Jazz

Slack

GoogleHangouts

Punchcards TFS

Books UsenetStack

Overflow

Twitt

er

Google Groups

PodcastsBlogs

GitH

ub

Conferences

Societies LinkedIn

Facebook

SlashdotHackerNews

Nondigital Digital Digital & Socially Enabled

Masterbranch

Coderwall

Meetups

Yam

mer

Surveyed over 2,500 devs

Page 9: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Ecosystem of tools and activities

Page 10: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Learning

Cod

e H

ostin

g

Q&A

site

s

Web

sea

rch

Ecosystem of tools and activities

Page 11: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Coordination

Cod

e H

ostin

gC

oord

inat

ion

tool

s

Priv

ate

chat

Priv

ate

disc

uss

Ecosystem of tools and activities

Page 12: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Face

to F

ace

Connecting

Mic

robl

oggi

ng

Priv

ate

disc

uss

Face

to F

ace

Cod

e ho

stin

g

Ecosystem of tools and activities

Page 13: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Social tools facilitate a participatory development culture in software engineering, with support for the social creation and sharing of content, informal mentorship, and awareness that contributions matter to one another

Storey, M.-A., L. Singer, F. Figueira Filho, B. Cleary and A. Zagalsky,The (R)evolutionary Role of Social Media in Software Engineering, ICSE 2014 Future of Software Engineering.

Page 14: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

How to study a participatory culture?

Page 15: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

(Competing) concerns in software engineering…

Code: faster, cheaper, more features,more reliable/secure

Developers: more productive, more skilled, happier, better connected

Organizations/communities: attract/retain contributors, encourage a participatory culture, increase value

Page 16: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

https://www.flickr.com/photos/opensourceway/5755219017

Do the answers lie in here?

Page 17: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

“The machine does not isolate us from the great problems of nature but plunges us more deeply into them.”

Antoine de Saint Exupéry

Page 18: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Thick data…

Page 19: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Talk outline…History of software analytics in software engineering

Risks of software analytics

Why big data needs thick data

Consider both researchers and practitioners….

Page 20: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Talk outline…

History of software analytics in software engineering

Risks of software analytics

Why big data needs thick data

Consider both researchers and practitioners….

Page 21: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Role of data science in software engineering

Metrics (late 1960’s)

Mining software repositories (mid 2000’s)

Software analytics (early 2010’s)

Page 22: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Role of data science in software engineering

Metrics (late 1960’s)

Mining software repositories (mid 2000’s)

Software analytics (early 2010’s)

Page 23: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

The dawn of software metrics

“The realization came over me with full force that a good part of the remainder of my life was going to be spent in finding errors in my own programs.” Maurice Wilkes, 1949

“If you can't measure it, you can't manage it” Tom de Marco, 1982

Page 24: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Why use metrics?

To discover facts about the worldTo steer our actionsTo modify human behaviour

[DeMarco]

Used by individuals, teams, companies, external organizations…

Page 25: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Software metricsProduct: KLOC, Complexity measures (cyclomatic complexity, function points), OO metrics, #defects

Process metrics: Testing, code review, deployment, agile practices (e.g., #sprints, burndown rate)

Productivity: KLOC, Mean time to repair, #commits

Developer metrics: Skills, followers, biometrics

Estimation: cost metrics and models

Page 26: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Research success?

Page 27: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Success in industry?

• Adoption at large, small companies (e.g., HP)• Integrated in CASE tools• Initial focus on product rather than process• Initial poor use of metrics led to the

Goal Question Metric Approach [Basili et al.]

Page 28: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Lines of Code

§ Easy to calculate, to understand, to visualize

§ Descriptive of the product, and developer productivity

§ Correlates with complexity measures and # of bugs

Page 29: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

“Measuring programming progress by lines of code is like

measuring aircraft building progress by weight.”

Page 30: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Role of data science in software engineering

Metrics (late 1960’s)

Mining software repositories (mid 2000’s)

Software analytics (early 2010’s)

Page 31: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Mining software repositories

“We have all this data, the problem is what to do with it.” [A Software Engineering Researcher]

Mining Software Repositories (MSR) conference series established in 2004

“Outcroppings of past human behaviour.”[McGrath]

Page 32: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Data, data, everywhere…

Program data: runtime traces, program logs, system events, failure logs, performance logs, continuous deployment,…

User data: usage logs, user surveys, user forums, A/B testing, Twitter, blogs, …

Development data: source code versions, bug data, check-in information, test cases and results, communication between developers, social media

Page 33: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

TechniquesAssociation rules and frequency patternsClassificationClusteringText mining/natural language processingSearching and miningQualitative analysis

See papers from the Mining Software Repositories Conference!

Page 34: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Benefits of mining trace data

Low interferenceLow reactivity

Records made by the participantsData is easy to collect

Page 35: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

“Only metric worth counting is defects” [Demarco, 1997]

Why mine and measure information about bugs? Personal discovery, evaluation by managers, understand product status, predict reliability

Page 36: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Bug prediction

• Models to predict bugs show promise(ownership, churn, tangled code changes)

• Poor replication across organizations!

• Poor actionability (practitioners know which modules are buggy!)

• The secret life of bugs [Aranda et al.]

Page 37: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Role of data science in software engineering

Metrics (late 1960’s)

Mining software repositories (mid 2000’s)

Software analytics (early 2010’s)

Page 38: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...
Page 39: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...
Page 40: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Data science movement…

http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science

Page 41: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Goals of software analytics

Improve:quality of the softwareexperience of the usersdeveloper productivity

Dongmei Zhang & Tao Xie, http://research.microsoft.com/en-us/groups/sa/softwareanalyticsinpractice_minitutorial_icse2012.pdf

Page 42: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Data Science Spectrum

Past Present Future Explore trends alerts forecastingAnalyze summarize compare what-ifExperiment model benchmark simulate

The Art and Science of Analyzing Software Data, by Bird, Menzies, Zimmermann, Elsevier 2015.

Page 43: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Software Analytics and its role in Automation• Scaling to 1000’s of developers —

automation is required! [Jacek Czerwonka]

• Goal is to optimize competing concerns ofquality, time, resources

• Data Scientists manage and measure impacts of automation and software analytics [Kim et al., 2016]

Page 44: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...
Page 45: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Does increasing test code coverage increase reliability?

Page 46: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

No!

Wasting time testing simple code may increase the presence of bugs! [Mockus et al.]

Does increasing test code coverage increase reliability?

Page 47: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Role of data science in software engineering

Metrics (late 1960’s)

Mining software repositories (mid 2000’s)

Software analytics (early 2010’s)

Page 48: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Talk outline…History of software analytics in software engineering

Risks of software analytics

Why big data needs thick data

Consider both researchers and practitioners….

Page 49: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Five Risks1) Data and construct trustworthiness

2) Reliability of the results

3) Ethical concerns

4) Unintended and unexpectedconsequences

5) Big data can’t answer big questions

Page 50: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Risk #1: Trustworthiness of the dataData representativeness (construct validity)Data completenessInaccuracies in profiles, exaggerations, skewed opinionsTreating humans as “rational” animals[Harper et al.]

Page 51: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Perils from using GitHub data:A repository is not necessarily a (development) projectMost projects are inactive or have few commitsMost projects are for personal use onlyOnly 10% of projects use pull requestsHistory can be rewritten on GitHubA lot happens outside of GitHub

The Promises and Perils of Mining GitHub, Eirini Kalliamvakou et al., MSR 2014.

Page 52: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Risk #2: Trustworthiness of the resultsResearcher bias [Shepperd et al., 2014]Confusing correlations with cause and effectBig data and small effects [Marcus et al.]

Inappropriate generalizationConclusion instability [Menzies et al.]

Page 53: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

“all models are wrong, but some are useful”[Box, 1976]

http://www.dataists.com/2010/09/a-taxonomy-of-data-science/

Page 54: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Risk #3: Ethical concerns

Private, public, blurred spaces

Surveillance at the level of the individual

Opaque algorithms, opaque biases [Tufecki, CSCW Keynote, 2015]

Page 55: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

http://www.informationweek.com/big-data/big-data-analytics/data-scientists-want-big-data-ethics-standards/d/d-id/1315798)

Page 56: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Risk #4: Unexpected consequences

Negative side effects [Gender studies]

Gaming the gamification

Incentives? handle with care!

Page 57: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Assessing and watching developers

Singer, Filho, Cleary, Treude, Storey, Schneider. Mutual Assessment in the Social Programmer Ecosystem: An Empirical Investigation of Developer Profile Aggregators, CSCW 2013.

Page 58: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Contributing graphs considered harmful

https://github.com/isaacs/github/issues/627

http://www.hanselman.com/blog/GitHubActivityGuiltAndTheCodersFitBit.aspx

Page 59: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Most unwise questions!

Analyze This! 145 Questions for Data Scientists in Software Engineering Andrew Begel and Thomas Zimmermann

Page 60: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Risk #5: Big Data can’t answer Big

Questions

Or

Page 61: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Risk #5: Big Data can’t answer Big

Questions

Or

Page 62: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Risk #5: Big Data can’t answer Big

Questions

alone

Page 63: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Examples of big questions?• What is a good architecture to solve problem x?

[Devanbu]

• What makes a really awesome programmer? [Software managers]

• How to build a great development team? [Google]

• How is program knowledge distributed? [Naur]

• What is the ideal software engineering process? [Facebook, Microsoft, IBM,…]

• What tools/practices support a participatory development process? [Storey et al.]

Page 64: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Five Risks1) Data and construct trustworthiness

2) Reliability of the results

3) Ethical concerns

4) Unintended and unexpectedconsequences

5) Big data can’t answer big questions

Page 65: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Talk outline…History of software analytics in software engineering

Risks of software analytics

Why big data needs thick data, and why thick data needs big data!

Consider both researchers and practitioners….

Page 66: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Data scientists…

“Typically start with the data, rather than starting with the problem.”[Forbes]

“I love data” “I love patterns”[Kim et al., ICSE 2016]

http://www.forbes.com/sites/gilpress/2013/05/28/a-very-short-history-of-data-science/print/

Page 67: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

John Snow’s theory about cholera came from talking to

people [1850’s]

Page 68: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Danger zones…

http://blogs.lse.ac.uk/impactofsocialsciences/2015/02/12/philosophy-of-data-science-emma-uprichard/

“Most big data is social data –the analytics need serious interrogation”

Social Science+

“It doesn’t matter how much or how good our data is if the approach to modelling social systems is backwards.”

Page 69: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

What is “thick” data?

Researcher generated “thick” dataExplanations, motivations, recommendationsQuestions rather than answers Variables for a modelFuture challenges

Limitations: Self reporting, researcher bias, ambiguity in instruments and collected data

Page 70: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Beyond “Mixed Methods”: EthnominingCombines the ethos of ethnographyinterleaved with data mining techniques around behavioral/social data

Storytelling (to support the numbers)

Leverages visualization within tight loops of eliciting/reporting results

http://ethnographymatters.net/blog/2013/04/02/april-2013-ethnomining-and-the-combination-of-qualitative-quantitative-data/

Page 71: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Tagging work items in

Page 72: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

ConcernLines

Page 73: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Research challenges ahead

Big data! (of trace and thick data!)

Rapid pace of change (increased automation, participatory culture)

Studying unstable objects [Rogers]Poor boundaries of study contexts

Page 74: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Kevin Kelly, Futurist: “You’ll be paid in the future based on how well you work with robots.”

Page 75: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Key Takeaway:Big Data needs Thick Data

Page 76: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

Future of data science in software engineering?

Metrics (late 1960’s)

Mining software repositories (mid 2000’s)

Software analytics (early 2010’s)

Big Data meets Thick Data@margaretstorey

Page 77: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

References:“Mad about Measurement”, De Marco, http://ca.wiley.com/WileyCDA/WileyTitle/productCd-0818676450.html

Van Solingen, Rini, et al. "Goal question metric (GQM) approach." Encyclopedia of software engineering (2002).

The Emerging Role of Data Scientists on Software Development Team, Miryung Kim, Thomas Zimmermann, Robert DeLine, and Andrew Begel, ICSE May 2016.

Analyze This! 145 Questions for Data Scientists in Software Engineering, Andrew Begel and Thomas Zimmermann, ICSE June 2014.

Dongmei Zhang & Tao Xie, http://research.microsoft.com/en-us/groups/sa/softwareanalyticsinpractice_minitutorial_icse2012.pdf

Rules of Data Science in SE, see www.slideshare.net/timmenzies/the-art-and-science-of-analyzing-software-data

Audris Mockus, Nachiappan Nagappan, Trung T. Dinh-Trong, Test coverage and post-verification defects: A multiple case study. ESEM 2009: 291-301

Shepperd, Martin, David Bowes, and Tracy Hall. "Researcher bias: The use of machine learning in software defect prediction." Software Engineering, IEEE Transactions on 40.6 (2014): 603-616.

Page 78: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

M. Storey, The Evolution of the Social Programmer, Mining Software Repositories (MSR) 2012 Keynote http://www.slideshare.net/mastorey/msr-2012-keynote-storey-slideshare

M. Storey et al., The (R)evolution of Social Media in Software Engineering, ICSE Future of Software Engineering 2014, http://www.slideshare.net/mastorey/icse2014-fose-social-media

H. Jenkins, K. Clinton, R. Purushotma, A. J. Robison, and M. Weigel. Confronting the challenges of participatory culture: Media education for the 21st century, 2006. http://digitallearning.macfound.org/atf/cf/%7B7E45C7E0-A3E0-4B89-AC9C-E807E1B0AE4E%7D/JENKINS_WHITE_PAPER.PDF

L. Singer, F. F. Filho, B. Cleary, C. Treude, M.-A. Storey, K. Schneider. Mutual Assessment in the Social Programmer Ecosystem: An Empirical Investigation of Developer Profile Aggregators

Treude, C., and M.-A. Storey, “Awareness 2.0: staying aware of projects, developers and tasks using dashboards and feeds,” in ICSE’10: Proc. of the 32nd ACM/IEEE Int. Conference on Software Engineering, ACM.

C. Treude and M.-A. Storey. Work Item Tagging: Communicating Concerns in Collaborative Software Development. In IEEE Transactions on Software Engineering 38, 1 (January/February 2012). pp. 19-34

Page 79: Lies, Damned Lies and Software Analytics: Why Big Data Needs ...

[Marcus2014] Gary Marcus and Ernest Davis, "Eight (No, Nine!) Problems with Big Data", New York Times, April 6, 2014

[Harper2013] Richard Harper, Christian Bird, Thomas Zimmermann, and Brendan Murphy"Dwelling in Software: Aspects of the felt-life of engineers in large software projects", Proceedings of the 13th European Conference on Computer Supported Cooperative Work (ECSCW '13), Springer, September 2013.

P. Naur and B. Randell. Software Engineering: Report of a Conference Sponsored by the NATO Science Committee, Garmisch, Germany, Oct.1968. NATO

Mcgrath, E. "Methodology matters: Doing research in the behavioral and social sciences." Readings in Human-Computer Interaction: Toward the Year 2000 (2nd ed. 1995.

Aranda, Jorge, and Gina Venolia. "The secret life of bugs: Going past the errors and omissions in software repositories." Proceedings of the 31st International Conference on Software Engineering. IEEE Computer Society, 2009.

Ethno-Mining: Integrating Numbers and Words from the Ground Up: http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-125.pdfHow Google builds a really development team, New York Times, 2016.

[Tufekci2015] Zeynep Tufekci, "Algorithms in our Midst: Information, Power and Choice when Software is Everywhere", Proceedings of the 18th ACM Conference on Computer Supported Cooperative Work & Social Computing, pp.1918-1918, ACM 2015.