Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science...
-
Upload
thomas-eaton -
Category
Documents
-
view
215 -
download
1
Transcript of Analysis and Modeling of the Open Source Software Community Yongqin Gao, Greg Madey Computer Science...
Analysis and Modeling of the Open Source Software
Community
Yongqin Gao, Greg MadeyComputer Science & Engineering
University of Notre Dame
Vincent FreehComputer Science Dept.
NCSU
NAACSOS ConferencePittsburgh, PAJune 25, 2003
Supported in part by the National Science Foundation – Digital Science & Technology
Overview
What is OSS Free to use, distribution Unlimited user and usage Source code available and modifiable
Potential advantages over commercial software Higher quality Faster development Lower cost
Our goal Understanding the OSS phenomenon
Approach SourceForge is the source of our empirical data Modeling as social network Analysis of topological statistics
Data Collection — Monthly Web crawler (scripts)
Python Perl AWK Sed
Monthly Since Jan 2001 ProjectID DeveloperID Almost 2 million records Relational database
PROJ|DEVELOPER8001|dev3488001|dev89728001|dev99228002|dev276508005|dev313518006|dev124098007|dev199358007|dev42628007|dev367118008|dev8972
Modeling as collaboration network
What is collaboration network A social network representing the collaborating
relationships. Movie actor network and scientist collaboration network
Difference of SourceForge collaboration network Detachment Virtual collaboration Voluntary Global
Bipartite property of collaboration network
SourceForge developer network
15850 dev[46]dev[83] 15850 dev[46]
dev[48]
15850 dev[46]dev[56]
15850 dev[46]dev[58]
6882 dev[58]dev[47]
6882 dev[47]dev[79]
6882 dev[47]dev[52]
6882 dev[47]dev[55]
7028 dev[46]dev[99]
7028 dev[46]dev[51]
7028 dev[46]dev[57]
7597 dev[46]dev[45]
7597 dev[46]dev[72]
7597 dev[46]dev[55]
7597 dev[46]dev[58]
7597 dev[46]dev[61]
7597 dev[46]dev[64]7597 dev[46]
dev[67]
7597 dev[46]dev[70]
9859 dev[46]dev[49]9859 dev[46]
dev[53]
9859 dev[46]dev[54]
9859 dev[46]dev[59]
dev[46]
dev[83] dev[56]
dev[48]
dev[52]
dev[79]
dev[72]
dev[51]
dev[57]
dev[55]
dev[99]
dev[47]
dev[58]
dev[53]
dev[58]
dev[65]
dev[45]
dev[70]
dev[67]
dev[59]
dev[54]
dev[49]
dev[64]
dev[61]
Project 6882
Project 9859
Project 7597
Project 7028
Project 15850
OSS Developer Network (Part)Developers are nodes / Projects are links
24 Developers5 Projects
2 hub Developers1 Cluster
Topological analysis
Statistics inspected Diameter Average degree Clustering coefficient Degree distribution Cluster size distribution Relative size of major cluster Fitness and lift cycle
Evolution of these statistics
Diameter of developer network vs. time
The average of shortest paths between any pairs of vertices
The values for developer network (30,000 – 70,000) are between 6 and 8
Diameter of project network vs. time
The values for project network (20,000 – 50,000) are between 6 and 7
Diameter decreasing with time both for developer network and project network
Average degree vs. time
The values for developer network are between 7 and 8
The values for project network are just between 3 and 4
Cluster size distribution
Cluster distribution of developer network
R2 with major cluster is 0.7426
R2 without major cluster is 0.9799
Relative size of major cluster vs. time
Stable increase of the relative size of the major cluster
Going to slowly converge to some fixed percentage at around 35%
May be an indication of the network evolution
Existence of fitness
Investigation of development of single project can verify the existence of “young upcomer” phenomenon
We tracked the development of every new project in July 2001 until now (total 1660 projects)
Maximal monthly growth per project is 13 while average monthly growth per project is just 0.3639
Summary of results
Power law rules Degree distributions, cluster distribution
Average degree increasing with time Diameter decreasing with time Clustering coefficient decreasing with time Fitness existed in SourceForge Projects have life cycle behaviors
Conclusion
Study of SourceForge collaboration network can help us understanding the OSS community
We investigate not only the topological statistics but also the evolution of these statistics.
Simulation is needed to further investigation of SourceForge collaboration network.
Terminology Degree
The count of edges connected to given vertex Degree distribution
The distribution of degrees throughout a network Cluster
The connected components of the network Diameter
Average length of shortest paths between all pairs of vertices Clustering coefficient (CC)
CCi: Fraction representing the number of links actually present relative to the total possible number of links among the vertices in its neighborhood.
CC: average of all CCi in a network