Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf ·...

Post on 11-Aug-2020

0 views 0 download

Transcript of Topology and Evolution of the Open Source Software Communityoss/Papers/gao_thesis_defense.pdf ·...

Topology andEvolution of the OpenSource SoftwareCommunity

Advisors:

Dr. Vincent W. FreehDr. Kevin Bowyer

Supported in part bythe National Science Foundation – Digital Science & Technology

Yongqin Gao

2

Outline

�Overview• Data collection

• Network modeling

• Topological statistical analysis (real data)

• Simulations

• Publications

• Conclusions

3

Overview (about OSS)

• What is OSS

– Free to use, free to distribute

– Unlimited user and usage

– Source code available and modifiable

• Potential advantages over commercial software– Higher quality

– Faster development

– Lower cost

– Transparent

4

Overview (about our research)

• Our goal– Understanding the OSS phenomenon

• Approach– SourceForge is the source of our empirical data

– Modeling as a social network

– Analysis of topological statistics

– Use simulation to verify and validate the model

5

Outline

• Overview

�Data collection

• Network modeling

• Topological statistical analysis

• Simulations

• Publications

• Conclusions

6

Data Collection — Monthly

• Web crawler (scripts)– Python– Shell– AWK– Sed

• Monthly• Since Jan 2001• ProjectID• DeveloperID• Almost 2 million records• Relational database

PROJ|DEVELOPER8001|dev3488001|dev89728001|dev99228002|dev276508005|dev313518006|dev124098007|dev199358007|dev42628007|dev367118008|dev8972

7

Outline

• Overview

• Data collection

�Network modeling

• Topological statistical analysis (real data)

• Simulations

• Publications

• Conclusions

8

Modeling as CollaborationNetwork

• What is a collaboration network?– A social network representing the collaborating

relationships.– Movie actor network and scientist collaboration

network

• Difference of SourceForge collaborationnetwork– Link detachment– Virtual collaboration– Voluntary– Global

• Bipartite property of collaboration networks

9

Collaboration network -bipartite

Adapted from Newman, Strogatz and Watts, 2001

10

SourceForge DeveloperNetwork

15850 dev[46]dev[83] 15850 dev[46]

dev[48]

15850 dev[46]dev[56]

15850 dev[46]dev[58]

6882 dev[58]dev[47]

6882 dev[47]dev[79]

6882 dev[47]dev[52]

6882 dev[47]dev[55]

7028 dev[46]dev[99]

7028 dev[46]dev[51]

7028 dev[46]dev[57] 7597 dev[46]

dev[45]

7597 dev[46]dev[72]

7597 dev[46]dev[55]

7597 dev[46]dev[58]

7597 dev[46]dev[61]

7597 dev[46]dev[64]7597 dev[46]

dev[67]

7597 dev[46]dev[70]

9859 dev[46]dev[49]9859 dev[46]

dev[53]

9859 dev[46]dev[54]

9859 dev[46]dev[59]

dev[46]

dev[83] dev[56]

dev[48]

dev[52]

dev[79]

dev[72]

dev[51]

dev[57]

dev[55]

dev[99]

dev[47]

Dev[80]

dev[53]

dev[58]

dev[65]

dev[45]

dev[70]

dev[67]

dev[59]

dev[54]

dev[49]

dev[64]

dev[61]

Project 6882

Project 9859

Project 7597

Project 7028

Project 15850

OSS Developer Network (Part)Developers are nodes / Projects are links

24 Developers5 Projects

2 hub Developers1 Cluster

11

Outline

• Overview

• Data collection

• Network modeling

�Topological statistical analysis (real data)

• Simulations

• Publications

• Conclusion

12

Topological Analysis

• Statistics inspected– Diameter

– Average degree

– Clustering coefficient

– Degree distribution

– Cluster size distribution

– Relative size of major cluster

– Fitness and life cycle

• Evolution of these statistics

• Dual networks– developer network and project network

13

Terminology

• Diameter– Average length of shortest paths between all pairs of vertices

• Degree– The count of edges connected to given vertex

• Average degree– Average of the degrees of all vertices in the network

• Cluster– The connected components of the network

• Clustering coefficient (CC)– CCi: Fraction representing the number of links actually present relative

to the total possible number of links among the vertices in itsneighborhood.

– CC: average of all CCi in a network• Degree distribution

– The distribution of degrees throughout a network• Major cluster

– The largest cluster in the network

14

Diameter of DeveloperNetwork vs. Time

• Network sizeincreasedfrom 30,000to 70,000

15

Diameter of ProjectNetwork vs. Time

• Network sizeincreasedfrom 20,000to 50,000.

• Diameterdecreasingwith time bothfor developernetwork andprojectnetwork

16

Clustering Coefficient ofDeveloper Network vs. Time

17

Clustering Coefficient ofProject Network vs. Time

18

Degree Distribution(developers)

19

Degree Distribution(projects)

20

Cluster Size Distribution

• R2 with majorcluster is0.7426

• R2 withoutmajor clusteris 0.9799

21

Relative Size of Major Clustervs. Time

• Increase of therelative size ofthe majorcluster

• Increasing rateis decreasing

• May be anindication ofthe networkevolution

22

Existence of Fitness

• Investigation of development of single projectcan verify the existence of “newcomer”phenomenon

• We tracked the development of every newproject in July 2001 until now (total 1660projects)

• Maximal monthly growth per project is 13while average monthly growth per project isjust 0.3639

23

Life Cycle of Project

24

Summary

25

Summary of Results

• Power law rules– Degree distributions, cluster distribution

• Average degree increasing with time

• Diameter decreasing with time

• Clustering coefficient decreasing with time

• Fitness existed in SourceForge

• Projects have life cycle behaviors

26

Outline

• Overview

• Data collection

• Network modeling

• Topological statistical analysis (real data)

�Simulations

• Publications

• Conclusion

27

Conceptual Framework

Empirical data

Adjustment

Generation

Verification

Validation

Characterization

Description

Model

Simulation

28

Agent-based Modeling

• EBM vs. ABM– Heterogeneous individuals

– Complex network

• Experience environment– Hardware: computer cluster

– Software:• Simulation toolkits: Swarm

• Database: Oracle

• Language: Java, PL/SQL

29

Model for SourceForge

• ABM based on bipartite graph

• Model description– Agent: developer

– Behaviors: Create, join, abandon and idle

– Preference: developer’s and project’s

– Fitness

• Four models in iterations– ER, BA, BA with constant fitness and BA with dynamic

fitness

• Comparison of empirical and simulated data

30

ER Model - Diameter

• Average degreeis decreasingwhile it isincreasing inempirical data

• Diameter isincreasing whileit is decreasingin empirical data

31

ER Model – ClusteringCoefficient

• Clusteringcoefficient isrelatively lowunder 0.3 while itis around 0.7 inempirical data.

32

ER Model – DegreeDistribution

• Degreedistribution isnormaldistributionwhile it ispower law inempirical data

33

ER Model – Cluster SizeDistribution

• power lawdistribution with R2

as 0.6667 (0.9653without the majorcluster) while R2 inempirical data is0.7426 (0.9799without the majorcluster)

• The actualdistribution isdifferent fromempirical data

34

BA Model – Diameter andClustering Coefficient

• Small diameterand highclusteringcoefficient likeempirical data

• Diameter andclusteringcoefficient areboth decreasinglike empiricaldata

35

BA Model – DegreeDistribution

• Power laws in degreedistributions, similar toempirical data (o forsimulated data and xfor empirical data).

• For developerdistribution: simulateddata has R2 as 0.9798and empirical data hasR2 as 0.9714.

• For project distribution:simulated data has R2

as 0.6650 andempirical data has R2

as 0.9838.

36

BA Model with ConstantFitness

• Power laws in degreedistributions, similar toempirical data (o forsimulated data and x forempirical data).

• For developer distribution:simulated data has R2 as0.9742 and empirical datahas R2 as 0.9714.

• For project distribution:simulated data has R2 as0.7253 and empirical datahas R2 as 0.9838.

37

BA Model with DynamicFitness

• Power laws in degreedistribution, similar toempirical data (o forsimulated data and x forempirical data).

• For developer distribution:simulated data has R2 as0.9695 and empirical datahas R2 as 0.9714.

• For project distribution:simulated data has R2 as0.8051 and empirical datahas R2 as 0.9838.

38

Advantage of Dynamic Fitness

• Intuition: Fitness should decreasing with time.

• Statistics: project has life cycle behaviorwhich can not be replicated by BA model withconstant fitness but can be replicated by BAmodel with dynamic fitness

39

Summary

40

Summary of Results

• We use ABM to model and simulate theSourceForge collaboration network.

• Conceptual framework is proposed for agent-based modeling and simulation.

• Case study of this framework: SourceForgestudy through ER, BA, BA with constantfitness and BA with dynamic fitness.

41

Outline

• Overview

• Data collection

• Network modeling

• Topological statistical analysis (real data)

• Simulations

�Publications

• Conclusion

42

Publications To-date

• Yongqin Gao, "Modeling and Simulation of the OSS Community",Seventh Annual Swarm Researchers Meeting (Swarm2003), NotreDame, IN, 2003.

• Yongqin Gao, Vince Freeh, and Greg Madey, "Analysis andModeling of the Open Source Software Community", NAACSOSConference 2003, Pittsburgh.

• Yongqin Gao, Vince Freeh, and Greg Madey, "ConceptualFramework for Agent-based Modeling and Simulation", NAACSOSConference 2003, Pittsburgh.

• Greg Madey, Vincent Freeh, Renee Tynan, Yongqin Gao, ChrisHoffman, "Agent-based Modeling and Simulation of CollaborativeSocial Networks", AMCIS 2003, Tampa, FL.

43

Possible Journals

• Chapter 3– Physica A: statistical mechanics and its

applications

– Journal of Social Structure (JSS)

• Chapter 4– Journal of Artificial Societies and Social

Simulation (JASSS)

– Journal of Statistical Computation and Simulation(JSCS)

44

Outline

• Overview

• Data collection

• Network modeling

• Topological statistical analysis (real data)

• Simulations

• Publications

�Conclusion

45

Conclusion

• Study of SourceForge collaboration networkcan help us understanding the OSScommunity

• We investigate not only the topologicalstatistics but also the evolution of thesestatistics.

• Simulation is used to investigate ofSourceForge collaboration network.

46

Contribution

• Statistical study of the SourceForgecommunity (snapshot and evolution)

• Verification of the approximate method tocalculate the diameter and CC

• Proposal of a model for the SourceForgecommunity

• Improvement of dynamic fitness to BA model

47

Future Work

• Data collection– Database dump from SourceForge (PostgreSQL 8GB)– All the possible attributes– Database schema in UML

• More topology analysis (with more attributes)– Discussion forum– Task assignment– Project management– Active testing

• Behavior-based analysis– Interaction between agents– H. Beyton Young’s model

• Information entropy analysis

48

Acknowledgements

• Committee

• Advisors

• Colleagues

• SourceForge

• NSF

• Others

49

Thank you