The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data...

12
The Comment Density of Open Source Software Code Oliver Arafat Siemens AG, Corporate Technology [email protected] Dirk Riehle SAP Research, SAP Labs LLC [email protected]http://dirkriehle.com http://twitter.com/dirkriehle Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press, 2009. Page 195-198. [Link: http://dirkriehle.com/2009/02/04/the-comment-density-of-open-source-software-code/]

Transcript of The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data...

Page 1: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

The Comment Densityof Open Source Software Code

Oliver ArafatSiemens AG, Corporate Technology

[email protected]

Dirk RiehleSAP Research, SAP Labs LLC

[email protected], http://dirkriehle.comhttp://twitter.com/dirkriehle 

Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to Proceedings of the 31st International Conference 

on Software Engineering (ICSE 2009). IEEE Press, 2009. Page 195­198.[Link: http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/]

Page 2: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

OverviewSummary (blog post): The Sweet Spot of Code Commenting in Open Source[Link: http://dirkriehle.com/2009/02/04/the­sweet­spot­of­code­commenting­in­open­source/]

Abstract: The development processes of open source software are different from traditional closed source development processes. Still, open source software is frequently of high quality. Thus, we are investigating how open source software creates high quality and whether it can maintain this quality for ever larger project sizes. In this paper, we look at one particular quality indicator, the density of comments in open source software code. In a large­scale study of more than 5,000 projects, we find that active open source projects document their source code, and we find that the comment density is independent of team and project size, but not of project age. In future work, we intend to correlate comment density with project success or failure. 

Reference: Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press, 2009. Page 195­198. [Link: http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/]

2April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.

Page 3: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

Analysis Process and Tool Chain

3

Raw Data(ohloh.net)

Aggr. Data(RDBMS)

12 3 4

5

SQL, Java

Excel / Calc,R, spec. tools

Raw data source• Local database  (ohloh.net snapshot , crawled sources )• Web services access  (ohloh.net, sourceforge .net, others)

Pre­processing• Database querying using SQL and scripts• Java library for computationally heavyweight filters , aggregation

Aggregated data source• Output of pre ­processing stage for specific analytical tasks• Aggregated data significantly improves analysis speed

Analytical processing• Mines aggregated  (and raw) data for insights , hypothesis testing• At present basic processing  (Excel), machine learning next

Analysis output• Results of analytical processing : averages , distributions , correlations• Presented as models , tables, graphs, charts, etc.

April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.

Page 4: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

Commenting Practice in Open Source

4

y = 0.1596xR2 = 0.872

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09

Project Size in Lines of Code (LoC = CL + SLoC)

Com

men

t Lin

es

April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.

Page 5: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

Comment Density

• Comment density– Percentage of lines of code that are comments– Measure of how well documented code is– Indicator of likelihood of survival

• Definitions– CL = comment lines– SLoC = source lines of code– CD = CL / (CL + SLoC)

5

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1.E+09

Project Size in Lines of Code (LoC = CL + SLoC)

Com

men

t Den

sity

mean = 0.1867median = 0.1674stdev = 0.1088correl = ­0.00787

April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.

Page 6: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

Variation by Programming LanguageJava

y = 0.2798x; R2 = 0.9334; mean = 0.2587; median = 0.2566; stdev = 0.1111; population size = 1085

PERL

y = 0.1602x; R2 = 0.9464; mean = 0.1044; median = 0.0902; stdev = 0.0713; population size = 273

6

y = 0.2798xR2 = 0.9334

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+08

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08

Project Size in Lines of Code (CL + SLoC)

Com

men

t Lin

es

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08

Project Size in Lines of Code (CL + SLoC)

Com

men

t Den

sity

mean = 0.2587median = 0.2566stdev = 0.1111

0.0841

0.2450

0.3327

0.2340

0.0841

0.0165 0.0037 0.0000 0.00000%

10%

20%

30%

40%

50%

60%

70%

[0.0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.4[ [0.4, 0.5[ [0.5, 0.6[ [0.6, 0.7[ [0.7, 0.8[ [0.8, 0.9[

Comment Density

Per

cent

age 

of O

ccur

renc

es

0.5855

0.3382

0.0582

0.0073 0.0109 0.0000 0.0000 0.0000 0.00000%

10%

20%

30%

40%

50%

60%

70%

[0.0, 0.1[ [0.1, 0.2[ [0.2, 0.3[ [0.3, 0.4[ [0.4, 0.5[ [0.5, 0.6[ [0.6, 0.7[ [0.7, 0.8[ [0.8, 0.9[

Comment Density

Per

cent

age 

of O

ccur

renc

es

y = 0.1602xR2 = 0.9464

1.E+00

1.E+01

1.E+02

1.E+03

1.E+04

1.E+05

1.E+06

1.E+07

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08

Project Size in Lines of Code (CL + SLoC)

Co

mm

ent L

ines

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1.E+00 1.E+01 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08

Project Size in Lines of Code (CL + SLoC)

Com

men

t Den

sity

mean = 0.1044median = 0.0902stdev = 0.0713

April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.

More here: How Open Source Comments (by Programming Language)[Link: http://dirkriehle.com/2008/11/10/how­open­source­comments­by­programming­language/]

Page 7: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

Comment Density by Commit Size

7

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 10 20 30 40 50 60 70 80 90 100

Source Lines of Code in Commit [SLoC]

Com

men

t Den

sity

Comment Density by SLoC Size

Total Average Comment Density

sloc size = 1­100mean = 0.2513median = 0.2340stdev = 0.0626

sloc size = 50­100mean = 0.2234median = 0.2190stdev = 0.0169

sloc size = 80­100mean = 0.2224median = 0.2171stdev = 0.0209

April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.

Page 8: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

Comment Density by Team Size

8

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%

0 10 20 30 40 50 60 70 80 90 100

Team Size [Number of Committers]

Com

men

t Den

sity

0%

5%

10%

15%

20%

25%

30%

35%

40%

45%

50%Comment Density by Team Size

Total Average Comment Density

Percent Projects with Team Size

team size = 1­20mean = 0.1914median = 0.1878stdev = 0.0255

team size = 1­50mean = 0.1922median = 0.1906stdev = 0.0425

team size = 1­100mean = 0.1856median = 0.1857stdev = 0.0641correl = ­0.0550

April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.

Page 9: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

Comment Density by Project Age

9

17.50%

17.75%

18.00%

18.25%

18.50%

18.75%

19.00%

19.25%

19.50%

0 3 6 9 12 15 18 21 24 27 30 33 36 39 42 45 48

Project Age [months]

Ave

rage

 Com

men

t Den

sity

correl = ­0.9054

April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.

Page 10: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

Commenting Practice Summary• Successful open source projects follow a consistent commenting practice

– 1 out of 20 commits serves commenting purposes only– 1 out of 5 non­empty lines is a comment line

• Average comment density is independent of project size

• Average comment density is independent of team size

• Average comment density slowly falls with project age

10April 6, 2009. Version 1.1. © 2009 Dirk Riehle. All Rights Reserved.

Page 11: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

References• Dirk Riehle, John Ellenberger, Tamir Menahem, Boris Mikhailovski, Yuri Natchetoi, Barak Naveh, Thomas 

Odenwald. “Bringing Open Source Best Practices into Corporations Using a Software Forge.” IEEE Software, 2009. See: http://dirkriehle.com/2009/02/11/open­collaboration­within­corporations­using­software­forges/ 

• Dirk Riehle. “The Economic Motivation of Open Source: Stakeholder Perspectives.” IEEE Computer, vol. 40, no. 4 (April 2007). Page 25­32. See: http://dirkriehle.com/computer­science/research/2007/computer­2007.html 

• Oliver Arafat, Dirk Riehle. “The Commit Size Distribution of Open Source Software.” In Proceedings of the 42nd Hawaiian International Conference on System Sciences (HICSS 42). IEEE Press, 2009. See: http://dirkriehle.com/2008/09/23/the­commit­size­distribution­of­open­source­software/ 

• Amit Deshpande, Dirk Riehle. “The Total Growth of Open Source.” In Proceedings of the Fourth Conference on Open Source Systems (OSS 2008). Springer Verlag, 2008. Page 197­209. See: http://dirkriehle.com/2008/03/14/the­total­growth­of­open­source/ 

• Amit Deshpande, Dirk Riehle. “Continuous Integration in Open Source Software Development.” In Proceedings of the Fourth Conference on Open Source Systems (OSS 2008). Springer Verlag, 2008. Page 273­280. See: http://dirkriehle.com/2008/03/08/continuous­integration­in­open­source­software­development/ 

• Philipp Hofmann, Dirk Riehle. “Estimating Commit Sizes Efficiently.” In Proceedings of the 5th International Conference on Open Source Systems (OSS 2009). Springer Verlag, 2009. Page 105­115. See: http://dirkriehle.com/2009/02/11/estimating­commit­sizes­efficiently/ 

• Oliver Arafat, Dirk Riehle. “The Comment Density of Open Source Software Code.” In Companion to Proceedings of the 31st International Conference on Software Engineering (ICSE 2009). IEEE Press, 2009. Page 195­198. See: http://dirkriehle.com/2009/02/04/the­comment­density­of­open­source­software­code/  11April 6, 2009. Version 1.1. 

© 2009 Dirk Riehle. All Rights Reserved.

Page 12: The Comment Density of Open Source Software Code · SQL, Java Excel / Calc, R, spec. tools Raw data source •Local database (ohloh.net snapshot, crawled sources ) •Web services

Thank you! Questions?

For any feedback or questions, please email the authors:

Oliver ArafatSiemens AG, Corporate Technology

[email protected]

Dirk RiehleSAP Research, SAP Labs LLC

[email protected], http://dirkriehle.comhttp://twitter.com/dirkriehle