The Literary/Rhetorical Analysis Paragraph Beefing up literary analysis.
Beefing IT up for Your Investor? Open Sourcing and Startup ...
Transcript of Beefing IT up for Your Investor? Open Sourcing and Startup ...
Beefing IT up for Your Investor? Open Sourcing and Startup Funding: Evidence from GitHub Annamaria Conti Christian Peukert Maria Roche
Working Paper 22-001
Copyright © 2021 by Annamaria Conti, Christian Peukert, and Maria Roche.
Working papers are in draft form. This working paper is distributed for purposes of comment and discussion only. It may not be reproduced without permission of the copyright holder. Copies of working papers are available from the author.
Funding for this research was provided in part by Harvard Business School and the Swiss National Science Foundation (Project ID: 100013 188998 and 100013 197807).
Beefing IT up for Your Investor? Open Sourcing and Startup Funding: Evidence from GitHub
Annamaria Conti University of Lausanne
Christian Peukert University of Lausanne
Maria Roche Harvard Business School
Working Paper 22-001
Beefing IT up for your Investor?
Open Sourcing and Startup Funding: Evidence from GitHub∗
Annamaria Conti1, Christian Peukert1, and Maria Roche2
1HEC Lausanne
2Harvard Business School
September 28, 2021
Abstract
We study the participation of nascent firms in open-source communities and its implications for
attracting funding. To do so, we exploit unique data on 160,065 US startups linking information
from Crunchbase to firms’ GitHub accounts. Estimating a within-startup model saturated with fixed
effects, we show that startups accelerate their activities on the platform in the twelve months prior to
raising their first financing round. The intensity of their involvement on GitHub declines in the twelve
months after. Startups intensify those activities that rely on external technology sources above and
beyond the technologies they themselves control. Exploiting a shock that reduced the relative cost of
internal collaborations we provide evidence that startups’ decision to integrate external sources of
knowledge in their production function hinges on the relative cost vis-a-vis internal collaboration.
Applying machine learning to classify GitHub projects, we further unveil that the most prevalent
among these external activities are related to software development, data analytics, and integration.
Our results indicate that VCs and renowned investors are the most responsive to these activities.
Keywords: Startups, Technology Strategy, GitHub, Machine Learning, Venture Capital
∗Conti: [email protected], Peukert: [email protected], Roche: [email protected]. We thankKimon Protopapas and Ilia Azizi for excellent research assistance. This manuscript benefited from manyhelpful comments provided at the CAS Platform Seminar Munich, Digital Economy Workshop, the IntellectualProperty & Innovation Virtual seminar series, SCECR, and the West Coast Symposium. We are gratefulfor the suggestions provided by Karim Lakhani and members of the Laboratory for Innovation Science atthe onset of this work, as well as participants of seminars at HEC Paris, and LUISS Guido Carli University.Annamaria Conti and Christian Peukert acknowledge funding from the Swiss National Science Foundation(Project ID: 100013 188998 and 100013 197807). Maria Roche acknowledges funding from the HarvardBusiness School Division of Research and Faculty Development.
1 Introduction
Innovation is no longer the sole prerogative of large corporate and government laboratories,
but can originate from anywhere (Jeppesen and Lakhani, 2010). As a consequence, open-
source communities have become a fundamental point of supply of innovation for established
firms (Chesbrough, 2006b). Access to external sources of technologies through involvement
with open-source communities allows firms to not only reduce innovation costs, but also to
increase the number and the quality of innovative ideas (Lakhani et al., 2013b). While the
open innovation model with its costs and benefits has been widely studied in the context of
established firms, we know little about the involvement of nascent firms with open-source
communities and its implications for attracting funding.
The financing environment fundamentally shapes strategic choices very early in the life of
a new venture (Hellmann and Puri, 2002). A key milestone for firms is raising funding, where
prior literature stresses the importance of both the team (Bernstein et al., 2017; Gompers
et al., 2020) and the underlying technology of the venture (Kaplan et al., 2009) in doing so
successfully. Although studies have provided evidence indicating the importance of technology
in raising funding, technology and, particularly, its idea sources still remain a black box.
In order to open this box, we use novel data on startups’ digital technology activities
provided on the online development and hosting platform GitHub, which has become the
open-source platform “where the world builds software” (GitHub.com). We thereby examine
how intensively startups utilize this platform prior to raising a funding round, the type of
activities in which they engage, and their sources of knowledge. To do so, we exploit data
encompassing 160,065 US startups listed on Crunchbase that were founded between 2005
and 2019 as our initial sample. We then match these firms to their organization accounts
on GitHub using their respective website domains. We access public GitHub records logged
since 2011 from the GHArchive. The combination of these two data sets provides us with
information on the industry, investors, and total amount of funds a firm raises as well as the
type and nature of activities a firm engages in on GitHub.
1
In our first step, we establish a strong, positive correlation between having a GitHub
organization account and raising financing. This association holds even after including
information on the entrepreneurship-specific human capital of the team. Building on these
exploratory correlations, we use our rich panel data to estimate a within startup model that
relates the dynamics of technology strategy to achieving funding milestones. To address
obvious omitted variable concerns, we saturate the model with a wide range of fixed effects
pertaining to the firm, age, industry-year, region-year, and lead-investor. We show that
GitHub activity takes off in the twelve months preceding a first funding round, and then
levels off in the twelve months after. Specifically, our data reveal that an extra twelve
months towards raising a first round increases the probability of being active on GitHub by 4
percentage points, or a 50% increase in the mean.
To shed more light on these initial findings, we show that the effects are driven by
startups operating in the IT/Software sector, where entrepreneurial GitHub activities are
most intense. Additionally, we find that these dynamics dissipate over the course of subsequent
financing rounds suggesting that startups’ technology investments are less relevant for follow-
on investors. Taken together, these results provide an indication that earliest-round investors
do value a startup’s involvement with open source communities to produce and improve its
own technology.
We further unveil crucial heterogeneity among activity and investor types. Specifically,
we find that startups appear to intensify those activities on GitHub that rely on external
technology sources (repositories they do not control themselves) suggesting that shared
development and an integration of non-proprietary technologies can be important for attracting
funds. This finding is surprising as it appears to be in contrast with what a new venture
traditionally sets out to do and key theories in strategic management suggest. Namely, per
established definition in the literature, technology entrepreneurship is understood as the
commercial exploitation of proprietary innovation generated in-house (Shane, 2004) and not as
the joint creation and (open) sourcing of technologies from outside the venture. Furthermore,
2
internal resources that are rare, inimitable, valuable, and non-substitutable are viewed as a
primary source of competitive advantage (Barney, 1991; Wernerfelt, 1984), which would seem
to preclude publicly shared development and integration of non-proprietary technologies.
We enrich our analysis to assess how a startup’s reliance on external technology sources
depends on their relative cost. For this scope, we examine an exogenous change in GitHub’s
pricing model, introduced in October 2015, that led to a substantial decline in the costs
of internal collaborations through the GitHub platform, while leaving the cost of accessing
external technology sources unchanged. With the new pricing model, we observe that startups
rely relatively less on external technology sources and, instead, intensify internal collaborations
through the GitHub platform. This finding suggests that startups’ decision to integrate
external sources of knowledge in their production function crucially depends on their relative
cost: startups opt for these external sources when the cost of accessing them is comparatively
low.
We delve deeper into our findings and apply machine learning methods to classify external
activities by the following functionalities: Software Development/Backend (SD/BE), Machine
Learning (ML), Application Programming Interfaces (API), and User Interfaces (UI). We find
that all functionalities but those related to UI intensify prior to receiving a funding round and
decelerate afterwards. These results suggest that startups rely on external technology sources
to develop, scale, and integrate their digital technologies. The totality of these activities
matters for attracting investments. To complement these findings, we further show that
startups resort to GitHub to access best practices on human resource (HR) management,
highlighting the importance of GitHub for upgrading human capital resources.
One question these results unveil is whether the increased involvement of startups on
GitHub improves a startup’s innovation pipeline or simply acts as a signal for potential
investors (Conti et al., 2013a,b; Hsu and Ziedonis, 2013). The fact that startups – prior to
receiving a first round – intensify their GitHub involvement along the full spectrum of digital
technology production serves as a first indication that signaling is unlikely the sole driver of
3
our findings. To examine this potential mechanism in depth, we test whether startups are
more likely to make their previously private repositories public before raising a financing
round, which should occur if the main purpose of engagement on GitHub is signaling. Our
analyses reveal that startups are as likely to make a repository public before and after raising
a round of financing.
While this evidence speaks against the explanation that signaling is entirely driving our
results, it additionally suggests that a startup’s deceleration of its involvement on GitHub
post-round is unlikely the sole consequence of its investors’ distaste for their investees’ public
engagement. However, we do show that after a startup raises a financing round, the likelihood
that it chooses a permissive license – a license with only minimal restrictions on how software
can be used, modified, and redistributed – significantly declines, which may be interpreted
such that investors are sensitive to appropriability concerns. On the whole, these findings
provide suggestive evidence that startups are, indeed, “beefing up” their IT before they
receive early-stage financing by relying on open-source communities.
To bring our results full circle, we consider heterogeneity in the response of investors to a
startup’s engagement on GitHub. We find that startups intensify their activities on GitHub
prior to raising a first round, particularly when the participating investors in the round are
VCs, as well as more experienced and more successful investors. Although the VC industry
only funds approximately one thousand ventures per year in the United States, VC-financed
startups account for between 30 to 70 percent of startup IPOs (Kaplan and Lerner, 2010;
Ritter, 2016), suggesting that precisely the more selective investors with higher returns to
investment are those that value startups’ activities on GitHub (Hellmann and Puri, 2002).
Taken together, our findings contribute to several streams of literature. The first analyzes
firm commercialization strategies of open source software (Fosfuri et al., 2008), and the role
of open source for firm productivity (Nagle, 2018, 2019; Shah and Nagle, 2019). In general,
these studies find a strong positive relationship between the usage of open source software and
firm productivity. Our study is distinct from this work as it focuses on technology startups.
4
By showing that startups rely on external sources to develop their technologies, we highlight
a novel channel through which startups benefit from using and actively engaging in open
source software endeavors. Their involvement matters for attracting investors, particularly
VCs and successful investors, during startups’ early stages.
Contrary to the common wisdom that startups fully control all aspects of their underlying
technologies, we find that nascent firms extensively rely on external repositories to develop,
scale, and integrate their technologies. Hence, these findings speak to the literature on
firm boundaries and innovation (Di Stefano et al., 2012), in particular to seminal work on
complementarities between internal and external knowledge sources (Arora and Gambardella,
1990; Cohen and Levinthal, 1990) and technology licensing (Arora et al., 2001).
By showing that startups accelerate their contributions on GitHub prior to raising a first
round, we contribute to the entrepreneurial finance literature that investigates whether VCs
invest in the “jockey” (founding team) or the “horse” (technology) (Bernstein et al., 2017;
Gompers et al., 2020; Kaplan et al., 2009). Our study provides evidence that above and
beyond the “jockey” and the “horse”, relying on open-source communities matters. Finally,
our use of machine learning algorithms to classify startups’ activities on GitHub contributes
to a budding line of research that applies sophisticated data techniques to shed light on and
categorize firm strategies (Conti et al., 2020; Guzman and Li, 2019).
The remainder of the paper is organized as follows. Section II provides the conceptual
framework that will guide our empirical analysis. Section III discusses the data. Sections IV
and V present the empirical specification and the results, respectively. Section VI concludes.
2 Conceptual framework
Innovation is widely understood as a process of recombinant search whereby innovators
experiment with new components or new combinations of existing components (Fleming,
2001). Increasingly, scholars suggest that innovating in isolation is extremely difficult (Laursen
and Salter, 2006) and that embracing an “open innovation paradigm” using internal and
external components is crucial for firms to advance their technology pipeline (Chesbrough,
5
2006a; Grimpe and Kaiser, 2010). While openness can increase the number of components
available and the variability of these components – both of which are fundamental for
producing breakthrough innovations – this approach is not exempt from drawbacks (Almirall
and Casadesus-Masanell, 2010). In fact, e.g., relative to a hierarchical organization with direct
internal chains of command, open innovation can heighten coordination costs (Greenstein,
1996) and further give rise to appropriability concerns (David and Greenstein, 1990). Because
firms relying on open-source communities for components to recombine do not control all
intellectual inputs, their ability to appropriate value from their innovations may suffer as a
consequence (Buss and Peukert, 2015; Teece, 1986).
This trade-off between open-sourcing and appropriability is particularly relevant for
technology startups (Gans and Stern, 2003). Because of their youth and small size, startups
are more likely to both gain by drawing from external sources of ideas, and to struggle with
appropriability issues when they do so. As highlighted by the existing literature (Bernstein
et al., 2017; Gompers et al., 2020; Kaplan et al., 2009; Roche et al., 2020), solving this
trade-off is crucial for startups given that beyond the founding team itself, the underlying
technology of a venture is an important predictor for success in financing. Building on these
studies, it remains an open question whether and to what extent startups rely on open-source
communities as sources to produce and improve their own technologies and ultimately attract
funding.
Startups may deal with the trade-off between open-sourcing and appropriability depending
on the type of investors they plan to attract. The existing literature has shown that investor
types differ in their preferences and that VCs tend to value a startup’s technology relatively
more (Conti et al., 2013b). Another source of heterogeneity may be a startup’s position in its
life cycle. For example, Wasserman (2003) has provided evidence that the relative importance
of a startup’s technology decreases as the startup becomes more mature. Following this, we
may detect important heterogeneity in the use of open-source communities as a function of
investor type and development stage of a startup.
6
Finally, another important aspect to consider is the type of technology that a startup
develops. The extant literature has highlighted that digital technology has increasingly
become organized into loosely coupled layers of different technologies (Yoo et al., 2012).
Especially in the case of software, platforms, where control over knowledge is distributed
across multiple firms and heterogeneous communities, take a central position in the innovation
process (Lakhani and von Hippel, 2003). Therefore, it is possible that software startups
rely relatively more on open-source communities than other startups given differences in the
production process of innovation.
In what follows, we set out to empirically assess whether and how startups rely on open-
source communities to attract funds in the context of GitHub. Here, we are in the unique
position of observing how organizations are using technology components both within and
outside of their control.
3 Empirical setting and data
3.1 GitHub and its pricing model
GitHub is a hosting service for software development and collaborative version control
based on Git. With 40 million public repositories in April 2021, GitHub is the largest host
of source code1 and has come to be known as the place “where the world builds software”
(GitHub.com). Individuals and organizations use GitHub to improve and upgrade their own
projects by accessing code and information from other existing projects and to contribute
to other projects. GitHub also has a history of being backed by a number of high-profile
investors (e.g. through a $100m investment by Andreesen & Horowitz) and was acquired by
Microsoft for $7.5b in 2018.2
GitHub offers personal user accounts and organization accounts for teams. Personal
accounts that are linked to organization accounts appear as members. The source code on
GitHub is organized in repositories, that is, folders containing projects.
1See https://GitHub.com/search?q=is:public, accessed April 25, 2021.2See https://techcrunch.com/2018/06/04/microsoft-has-acquired-GitHub-for-7-5b-in-microsoft-stock/,accessed April 25, 2021.
7
The version control system Git allows to save snapshots of files within a repository. When
a user issues a commit, a snapshot of the file’s contents is created and associated with a
timestamp. Commits can be related to repositories of other users by initiating a request to
collaborate on a repository and recommending changes, which is called a pull request. The
owner of the repository can then decide whether to integrate the proposed changes into the
codebase. The owner of a repositories can also add members, i.e. GitHub user accounts, and
grant permissions to view (private) repositories or add contents without having to issue a
pull request. In our analysis below, we use the addition of members to internally controlled
repositories as a measure of increased internal collaboration. There is another, and more
frequently used, way to interact with the repositories of other users, which is called forking.
With forking, a user makes a copy of another user’s repository, which is then integrated into
the initial user’s account. Forks are used to propose changes to someone else’s project by
commit and pull requests, or to use someone else’s project as an input to a new project.
GitHub’s pricing model is based on subscription fees. While either type of account could
always have an unlimited number of public repositories, before October 2015, the subscription
fee included 5 private repositories for individual accounts and 10 private repositories for
organization accounts. Adding additional private repositories was possible but nearly doubled
the monthly subscription fee. After October 2015, GitHub introduced new pricing plans that
included unlimited private repositories for all types of accounts while the fixed subscription
did not increase markedly. For organization accounts, the new price policy implied that it
became drastically cheaper to add members and repositories, effectively lowering the marginal
cost of using GitHub as a development platform for internal (private) repositories.3 We will
make use of this sharp discontinuity in the analysis below.
3For more information refer to https://docs.github.com/en/billing/managing-billing-for-your-github-account/about-per-user-pricing and https://www.infoworld.com/article/3069275/github-ushers-in-unlimited-private-repositories.html.
8
3.2 Data
To build our dataset, we combine data on US startups and their investors, which are
available on Crunchbase, with information on their respective GitHub activities available
from GHArchive and through the GitHub API.
3.3 Crunchbase
Crunchbase serves as our first source of information on startups. This online directory
records fine-grained information on the startups, their founders, and their investors. As
described by Conti and Roche (2021), a considerable portion of the data are entered by
Crunchbase staff, while the remaining part is crowdsourced. Registered members can enter
information to the database, which the Crunchbase staff successively reviews. Relative
to databases such as VentureXpert and VentureSource, Crunchbase has the advantage of
providing a larger coverage of technology startups as it also encompasses startups that did
not raise venture capital. From Crunchbase, we extracted information pertaining to all the
recorded US startups that were founded between 2005 and 2020. This amounts to 160,065
startups, for which we have data encompassing their founding dates, industry and industry
keywords, location, financing rounds and participating investors, as well as exit outcomes.
As shown in Table 1a, approximately half of the startups (46%) are located in California,
Massachusetts, and New York, reflecting the comparative advantage of these regions in
entrepreneurship. Thirty-six percent of them raised at least one round of financing. Addition-
ally, 8% of the startups were acquired as of December 2020 and 1.2% went public through an
IPO.
While Crunchbase does not categorize startups into sectors, it provides industry group
information for each of them. There are approximately 40 distinct industry groups, and,
on average, a startup is assigned three industry group keywords. Using this information,
we computed a variable to measure the relatedness of a startup’s technology to software,
where we expect most of the GitHub activities to be concentrated. This index is defined as
the share of a startup’s industry groups that are related to software. The groups related
9
to software are: Apps, Artificial Intelligence, Consumer Electronics, Data and Analytics,
Design, Financial Services, Gaming, Information Technology, Internet Services, Messaging
and Telecommunications, Mobile, Payments, Platforms, Privacy and Security, and Software.
As shown in Table 1a, the mean of this index is 0.46.
〈 Insert Table 1 about here 〉
3.4 GitHub
We use the GitHub API to collect all organization accounts on GitHub. From these
accounts, we extract the websites of the account owners. We used this information to link
GitHub organization accounts to 10,482 Crunchbase company profiles. We further gather
time-variant information on the public events of all startups through GHArchive, which is a
non-profit service that provides a full record of the public timeline on GitHub since February
2011. This archive includes, among others, time-stamped data on events such as commits,
pull requests, and forks related to all own or external public repositories with which an
organization account interacts. In addition, we observe events related to the addition of new
members of repositories, and the creation of new public repositories or the transitioning of a
private repository to the public domain.
3.5 Classification of GitHub repositories
We further employed the GitHub API to collect meta information on all public repositories
in a startup’s organization account, as well as all personal accounts that are members of a
given organization account.4
Building on this data, we distinguish between a startup’s public activities related to its
own repositories and its engagement with external repositories. As reported in Figure 1,
we observe a stark increase over time in both the number of public activities related to the
startups’ own repositories and the number of engagements with external repositories.
〈 Insert Figure 1 about here 〉4Note that account-specific meta-information is time-invariant. We collected it through separate crawls inOctober 2020, January 2021, and March 2021.
10
Descriptive statistics reported in Table 1b show the top five events through which startups
engage with external and internal repositories. In the case of external repositories, the share
of forks is very large (96%). In contrast, contributions to external repositories in the form
of pushing (that is, making commits accessible to others), commenting on issues, and pull
requests are rare (1% or less). This distribution provides some evidence that startups access
external repositories to upgrade their technologies and not just to provide comments or
contribute to other users’ projects. As for the distribution of internal events, it appears to be
relatively more spread out. The most popular events concern adding members to repositories
(46%) and making repositories public (17%), but pushing and creating new events are also
frequent.
Using supervised and unsupervised machine learning methods which we describe in detail
in the Appendix, we classify the public repositories of all organizations, as well as the
external repositories with which the organizations interacted through commits, pull requests
or forks according to their type. We distinguish between repositories that pertain to software
development/back end (SD/BE), machine learning (ML), application programming interface
(API), and user interface (UI). By doing so, we consider a comprehensive set of aspects that
are relevant for the development of a digital technology. Because founders increasingly rely on
GitHub to upgrade their human resources, we generate an additional category encompassing
public repositories related to best practices on HR management. For example, these include
guidelines for coding interviews and templates for company policies on issues such as working
from home, diversity and inclusion.
We report a descriptive representation of the output obtained from the algorithm in Figure
2. Here, we display the most common words for each of the categories we consider. Figure
3, instead, captures the relevance of each category across industry groups. As shown, there
is variation across industry groups in the relevance of the different categories we consider.
For instance, the share of commits, pull requests, and forks related to UI repositories is
smallest for startups in privacy/security and highest in more consumer-facing internet services.
11
Startups in software/IT have the highest share of interactions with API-related repositories,
but the lowest share with productivity-related repositories. This suggests that firm- and
industry-specific unobservables are important factors to control for in our econometric models.
〈 Insert Figures 2 and 3 about here 〉
4 Empirical specification
We begin our empirical investigation by descriptively assessing the correlation between a
startup having an account on GitHub – which is our proxy for a startup’s involvement in an
open-source community – and attracting funds, from VCs and other investors. In a similar
vein as Bernstein et al. (2017), we also evaluate how the effect of a startup’s involvement
on GitHub compares to the effect of a startup’s human capital – as defined by whether a
startup’s founder or CXO5 is highly ranked on Crunchbase’s list of top people.6 To achieve
these goals, we estimate the following linear probability model:
Yifjs = α + βiGitHubi + γiTopTeami + ηf + νj + ψs + εifjs, (1)
where Yifjs is an indicator that takes the value 1 if a startup i, founded in year f , developing
a technology in industry group j, and located in region s, raised at least one financing round
as of January 2021 and zero otherwise. The variable GitHubi is an indicator that takes
the value 1 if a startup has a GitHub account, while TopTeami is an indicator identifying
prominent founders and CXOs. The latter measure equals one if an employee is ranked
among the top 500 by Crunchbase and zero otherwise. In this regression, we include fixed
effects for a startup’s founding year (ηf) and for whether the startup is located in either
Massachusetts, New York, or California (ψs). We additionally include industry group fixed
effects (νj). The industry groups we consider are Information Technology, Software, Data
5By CXOs we refer to Chief Executive Officers (CEOs), Chief Technology Officers (CTOs), Chief FinancialOfficers (CFOs), and Chief Marketing Officers (CMOs).
6See here: https://www.crunchbase.com/discover/people.
12
Analytics, Internet Services, and Artificial Intelligence. As mentioned earlier, a startup can
be described by more than one industry group.
In the second part of the empirical analysis, we evaluate how a startup’s participation on
GitHub varies just before and after raising a round of financing. If investors value a startup’s
open innovation model, then we should observe that the startup is more active on GitHub in
the months preceding a given round and less active in the subsequent months. To test this
conjecture, we estimate a within-startup model, where we assess how the probability that a
startup engages in public activity on GitHub in month t varies during the twelve months
preceding and succeeding a startup’s financing round. To pursue this analysis, we restrict
the sample to startups that raised at least one round of financing and for which we could
identify a public GitHub account. Table 1c reports descriptive statistics for this sample. The
main model we estimate is:
Yitjs = α + βi,tPosti,t + γitτit + δitPosti,t × τit + ωi + +µit + νjt + ψst + ρit + εitjs, (2)
where Yitjs is an indicator that takes the value 1 if a startup i, developing a technology in
industry group j and located in region s, engages in at least one public activity – across
own and external repositories – on GitHub in month t and zero otherwise. In extensions to
our baseline analyses, we examine variants of this outcome to better understand the type
of GitHub activities in which a startup may engage. The indicator Postit is equal to one
for the twelve months that follow a startup’s given financing round, and zero in the twelve
months preceding it. While the focus of our analysis is on a startup’s first financing round,
we will also examine the reaction of startups as they relate to subsequent financing rounds.
The variable τit is the count of months to/from a given round. The coefficients of interest are
γ, which is associated with τit and δ, which is associated with the interaction between Postit
and τit. The γ represents the effect of an extra month towards a given financing round on
the likelihood that a startup engages in a public activity on GitHub, while δ represents the
13
effect of an extra month from a given financing round.
Ideally, we would be able to isolate the effect of engagement on GitHub from other
confounding factors such as a startup’s technology, the life cycle of a given technology, and
technology shocks. This is crucial given that these simultaneously occurring features may be
responsible for any relationship we detect. To get as close as possible to an ideal experiment,
we saturate the model with a fine-grained set of relevant fixed effects. In particular, ωi
is a focal startup’s fixed effect, which controls for unvarying differences across our sample
companies. For instance, it is possible that some startups are intrinsically more likely to
rely on GitHub, given the characteristics of the technologies they develop. A startup’s age
fixed effect is denoted by µi. The inclusion of this fixed effect addresses the concern that any
variation in the engagement of a startup on GitHub may be driven by a startup’s position in
the life cycle rather than by its intent to attract funding. In a more stringent specification,
we add age times startup and Postit times startup fixed effects given that life cycle effects
may vary by company and round raised. Postit times startup fixed effects also control for
potential investor requests post-investment, which may systematically vary from one startup
to another. We additionally include industry group times year fixed effects (νjt) to mitigate
the concern that the coefficient δ may be confounded by technology shocks, which could
happen at around the time a startup raises a financing round and induce the startup to
rely more intensely on GitHub. Moreover, we include location times year fixed effects (ψst),
distinguishing between California, Massachusetts, and New York (that is, the US technology
and entrepreneurial hubs) and all other states. The rationale is that possible technology
shocks happening in any given period should originate from these hubs. Finally, we include
fixed effects for whether a startup raised a subsequent round in any of the t months following
the observed round (ρit).
5 Results
This section explores (i) the correlation between having a GitHub account and obtaining
funds as well as (ii) the effect of raising a financing round on the evolution of a startup’s
14
activities on GitHub.
5.1 Raising a financing round
The results from estimating Eq. (1) are reported in Table 2. For this analysis, we consider
the full set of 160,065 startups. We cluster standard errors by founding year. The dependent
variable is the likelihood of having raised a financing round as of December 2020. As shown
in column (1), startups with a GitHub account are 0.21 percentage points more likely to have
raised a financing round. The magnitude of this correlation is smaller than the one between
having a prominent founder or CXO and raising a financing round (0.35 percentage points),
but remains economically relevant. While these correlations must be interpreted with caution
because there are several confounding factors that could bias our estimates, they suggest
that investors value a startup’s involvement with GitHub in addition to its founding team.
These descriptive relationships are confirmed in column (2) where we interact the GitHub
indicator with the TopTeam dummy to try to net out the impact of those confounding
factors that could be related with both a startup’s participation on GitHub and its founding
and/or top management team. Here, our results imply that having a GitHub account is
associated with a 0.21 percentage points higher likelihood of raising a financing round for
those startups without top founders and CXOs, which is equivalent to a 58% increase relative
to the unconditioned probability of raising a round. For comparison, having a prominent
founder and/or CXO is associated with a 0.39 percentage points higher likelihood of raising a
financing round for those startups without a GitHub account, equivalent to a 109% increase
relative to the unconditioned probability of raising a round. The implied magnitudes remain
similar in column (3) where we replace industry group fixed effects with the share of industry
groups that are related to software.
Overall, these results are in line with prior work (Bernstein et al., 2017; Gompers et al.,
2020; Kaplan et al., 2009) and suggest that while investors value a startup’s team, they also
positively consider its technology investments. In contrast to these studies, our focus in
the next section will be on startups’ engagement with open-source communities. We will
15
assess whether new ventures accelerate these investments prior to raising a financing round,
if so, the typologies of investments in which they engage, and heterogeneity in investor type
preferences.
〈 Insert Table 2 about here 〉
5.2 Startup GitHub activities before and after raising a financing round
In this section, we examine how a startup’s activities on GitHub vary before and after
a given financing round. For this analysis, we consider the sample of 10,482 startups that
raised at least one financing round and had an account on GitHub. We begin by reporting
the main results and successively explore the mechanisms driving them.
5.2.1 Main results
Do investors value a startup’s engagement on GitHub? As we mentioned earlier, while such
engagement allows a firm to upgrade its technology, it is not exempt from drawbacks, which
include appropriability threats and coordination costs. If investors were to value a startup’s
participation on GitHub, we should observe the startup to become more active on GitHub
during the months preceding a given round relative to the subsequent months. If, instead,
investors do not value a startup’s reliance on the GitHub community, or at least do not value
it as much as they would value other startup aspects, then a startup may not intensify its
engagement on GitHub prior to raising a round. To assess startup open-source strategies
prior to raising a financing round, we first provide visual evidence for the relationship in
the raw data, and then we estimate Eq. (2). We report the results in Figure 4 and Table 3.
Here, we specifically focus on a funded startup’s first financing round. In the next section,
we extend the analysis to a startup’s subsequent rounds. We cluster standard errors at the
startup level.
Figure 4 offers a first glance into the raw relationship between GitHub engagement and
the time before and after raising a first round of financing. We depict startups’ average
likelihood of having an activity recorded by months prior to and following a first round. As
16
shown, the proportion of startups engaging in public activity on GitHub increases from 0.06
to 0.1 in the twelve months preceding a first financing round and remains stable around 0.1
to 0.11 in the subsequent twelve months.
〈 Insert Figure 4 about here 〉
In Table 3 we move from the raw pattern to the parametric relationship. Column (1)
includes startup and year fixed effects in addition to the fixed effects for whether a startup
raised a second round in any of the twelve months following the first round. We further
control for a startup’s position in the life cycle by including the natural logarithm of a
startup’s age. The positive coefficient associated with the indicator Postit suggests that, all
else equal, startups become more active on GitHub after they raise their first financing round.
The coefficient associated with the time trend τit is positive, indicating that the likelihood
that a startup engages in a public activity on GitHub increases as the startup gets closer
to raising its first round. The magnitude of the coefficient can be interpreted such that an
extra month in a startup’s life cycle is associated with a 0.34 percentage points increment in
the likelihood that a startup engages in a public activity on GitHub. This implies that an
extra twelve months towards raising a first round increases the probability of being active on
GitHub by 4 percentage points. This effect represents a 50% increase relative to the outcome
mean. Moreover, the negative coefficient of the interaction between Postit and τit suggests
that a startup’s participation in GitHub becomes less intensive the further in time it is from
the first round. The magnitude of the coefficient suggests that an extra month away from
a startup’s first round reduces the positive slope of the probability that a startup’s engage
in a public activity on GitHub by 0.2 percentage points. This effect entails that an extra
twelve months away from a first round lowers the positive increment in the probability of
being active on GitHub by 2.4 percentage points.
In column (2) of Table 3, we add region by year and industry group by year fixed effects.
As shown, the magnitudes of the coefficients remain very similar to those displayed in column
(1). In column (3), we replace our industry group keywords with the share of a startup’s
17
industry group keywords that are related to software. Having interacted the newly generated
variable with year fixed effects, we show that the results remain invariant.
In column (4), we estimate a similar model as the one in column (2), this time replacing
the natural logarithm of a startup’s age with age fixed effects. This specification, which is
our preferred one, delivers very similar results to those reported in columns (1) to (3).7 In
column (5), we estimate a more stringent specification than in column (4), allowing for the
possibility that life cycle effects may vary by startup and by whether the startup has raised a
first round or not. For this purpose, we include age times startup and Postit times startup
fixed effects. The magnitudes of the effects are stronger than those displayed in the previous
columns and the results continue to show that a startup’s engagement on GitHub intensifies
prior to raising a first round and fades afterwards.
Finally, in column (6) of Table 3, we add lead investor fixed effects to the specification in
column (5). The reason for including these investor fixed effects is that a startup’s technology
trajectory may vary depending on the type of investors it attracts. To identify lead investors,
we considered the categorization of lead investors provided by Crunchbase. When this was
missing, we considered the investor who backed the highest number of a focal startup’s
financing rounds as the lead investor. By adopting this procedure, we identified lead investors
for 62% of the startups. Reassuringly, the results continue to hold.8
〈 Insert Table 3 about here 〉
While the findings presented so far suggest that investors value a startup’s participation
on GitHub, all else equal, it is possible that such participation reflects a startup’s technology
life cycle rather than its reliance on an open-source community. The inclusion of startup
and startup by age fixed effects should at least in part control for a startup’s technology
aspects, but some of them may remain unaccounted for. To further address this concern,
7Figure 9 graphically displays the results from this specification, having substituted τit with indicators foreach month preceding and following a given round. We observe an acceleration in a startup’s participation inGitHub as the first round approaches and a deceleration thereafter.
8Results available upon request show that the effects do not change by adding lead investor times age andlead investor times Postit fixed effects.
18
we delve deeper into the type of GitHub activities in which a startup engages. For this
purpose, we modify the dependent variable in Eq.(2) and examine whether a startup engages
in public activities related to its own repositories. This measure proxies for a startup’s
internal technology investments. Additionally, we investigate whether a startup engages with
external repositories. Since a startup does not have direct control over these repositories,
this additional outcome provides an indication of whether a startup builds on repositories
controlled by other GitHub users to develop its technologies. The results from this analysis
are reported Table 4 and in Figure 5. They show that, prior to raising a first round, startups
accelerate their engagement with external repositories more than they do for their internal
repositories. These findings support our interpretation that, all else equal, investors value a
startup’s engagement with open source communities and provide an indication that reliance
on external knowledge is at least as important as internally developed knowledge to upgrade
a startup’s technology and attract investments.9
〈 Insert Table 4 and Figure 5 about here 〉
We enrich our analysis by further assessing how a startup’s reliance on external technology
sources depends on their relative cost. For this purpose, we exploit a source of exogenous
variation in GitHub’s pricing model which occurred in October 2015. While the old pricing
model was repository based, implying that users had to pay for each new private repository,
the new pricing scheme was a user-based one. Upon paying their subscription, companies
would get unlimited private repositories allowing their employees to more efficiently organize
their collaborations within the company. While the marginal costs of internal collaborations
declined, the new pricing scheme left the marginal cost of interacting with external repositories
unchanged. As displayed in Figure 6, the decline in the marginal cost of internal collaborations
had a considerable positive impact on the number of such internal collaborations – proxied
9In results left unreported, we find that the results remain similar when we use the ratio of external to internalactivities as an outcome. With this specification we de-trend the trajectory of a startup’s technology. Further,we obtain similar results when we only retain the most common events through which startups access externalrepositories, that is, the forks.
19
by member events –, starting from the moment the new pricing model was implemented.
Conversely, startups’ engagement with external repositories or other internal repositories
unrelated to internal collaborations remained unchanged.
〈 Insert Figure 6 about here 〉
The important feature of this pricing reform is that it is uncorrelated with the technology
life cycle of a startup. As a result, we can assess how exogenous variations in the relative
cost of accessing external sources of knowledge affects the startups’ willingness to rely on
open-source communities to develop their technologies and attract funds. For this scope, we
modify Eq. (2) including an indicator – NewPriceSchemet – identifying the period following
the introduction of the new pricing scheme. We additionally introduce interaction terms
between NewPriceSchemet and the following variables: Postit, τit, and Postit × τit.
The results are reported in Table 5 and Figure 7. The dependent variables we consider
are: an indicator that equals one if a startup engages in a member event in month t and zero
otherwise (column 1); an indicator for whether a startup engages in a non-member, internal
event (column 2); and an indicator for whether a startup engages with an external repository
(column 3). To ensure that the introduction of the new pricing scheme is not correlated with
a startup’s characteristics and the life cycle of its technology, we control for the full set of
fixed effects.
As shown, prior to the pricing reform, startups would engage with external repositories
to develop their technologies on the verge of raising their first financing round. Conversely,
the number of member events was close to zero and invariant in the months preceding and
following the financing round. After the pricing reform, we observe that firms accelerate
their internal collaborations, but less so their engagement with external repositories, in the
months preceding their first financing round. Overall, these findings suggest that a startup’s
decision to integrate external sources of knowledge in the production of its technologies
crucially depends on their relative cost: startups opt for these external sources when the cost
of accessing them is comparatively low.
20
〈 Insert Table 5 and Figure 7 about here 〉
Delving deeper into the heterogeneity of a startup’s digital technology investments, we
next categorize a startup’s activities on GitHub according to whether they pertain to SD/BE,
ML, API, and UI. By doing so, we consider all the aspects related to the production and
development of a digital technology, ranging from the back end, data analysis, the front end,
and the interconnection between front end and back end. Since several practitioner reports
have pointed out that founders increasingly rely on GitHub to upgrade their human resources,
we additionally examine whether a startup resorts to GitHub to access best practices on HR
management (Jain, 2018).
The results are reported in Table 6 and graphically depicted in Figure 8. In Panel A of
Table 6, we focus on GitHub public activities related to a startup’s own repositories, while in
Panel B we examine a startup’s engagement with external repositories. In column (1) we show
that, prior to raising a first round, startups intensify both their internal activities and their
engagement with external repositories related to SD/BE. However, while a startup’s external
engagement significantly fades after raising a first round, the increment in the likelihood of
investing in SD/BE does not significantly change after raising a first round. In column (2),
we show that a startup’s investment in ML prior to raising a first round predominantly rests
on a startup’s engagement with external repositories. In column (3), where we consider a
startup’s investment in API, we again find a steady increase in the likelihood that a startup’s
engages with both internal and external repositories. However, the engagement with external
repositories fades after the round is raised. In column (4), we do not observe significant
variations in the likelihood that a startup engages in a UI activity, pre- and post-round, both
in Panel A and in Panel B. This may be due to the fact that the importance of UI rises
towards the later phases of a startup’s digital development, which may take place well after a
first round is raised. Finally, we show that startups rely on external repositories to access best
practices on HR management. These patterns are graphically depicted in Figure 6. Overall,
these results suggest that startups accelerate their investments in the different aspects of
21
their digital technologies in order to attract funds by, especially, relying on external sources
of knowledge.
〈 Insert Table 6 and Figure 8 about here 〉
One possibility is that the increased involvement of startups on GitHub simply acts as a
signal for potential investors rather than it helps improve a startup’s innovation stack (as we
have interpreted our findings so far). The fact that startups – prior to receiving a first round
– intensify their GitHub involvement along the full spectrum of digital technology production
provides initial evidence that signaling is unlikely the sole driver of our findings.
To provide further insight on the matter, we test whether startups are more likely to
make their new or previously private repositories public before raising a financing round,
which should occur if the main purpose of engagement on GitHub is signaling. In Table
7, we examine the likelihood of making at least one new or previously private repository
public, as measured by “Public Events” on the GitHub timeline of an organization account.
Across all specifications, which we produce by estimating the same models as displayed in
Table 3, we observe an increasing trend in turning private repositories public over time. Most
importantly, we do not detect any significant change in the trend after receiving a first round
of financing. These results suggest that startups are at least as likely to make their new or
previously private repositories public before and after raising a financing round.
〈 Insert Table 7 about here 〉
While the evidence reported in Table 7 speaks against the potential explanation that
our main results are entirely driven by signaling, it additionally suggests that a startup’s
deceleration of its involvement on GitHub post-round is unlikely to be the sole consequence of
its investors’ distaste for their investees’ public engagement. Nevertheless, it is still possible
that investors may require their startups to modify the modalities of participation on GitHub.
To explore this issue further, we examine the types of licenses that startups choose for their
22
new repositories before and after raising a financing round. In particular, we distinguish
between more or less permissive strategies based on whether or not they grant use rights,
including the right to re-license.10 The results from this analysis are reported in Table 8.
Here, we show that startups become significantly less likely to adopt permissive licenses for
the new repositories they create, the further in time they are from the first round. This result
suggests that investors may be sensitive to appropriability concerns, inducing their portfolio
startups to engage on GitHub through less permissive licenses.
On the whole, these findings – together with our baseline results reported in Table 3 and in
Figure 4 – indicate that startups are, indeed, “beefing up” their IT before they receive early-
stage financing by relying on open-source communities rather than merely showcasing their
technology to potential investors. They additionally indicate that – post-round – startups
become increasingly aware of appropriability issues, choosing less permissive open source
licenses for their repositories.
〈 Insert Table 8 about here 〉
5.2.2 Mechanisms
In this section, we explore the mechanisms driving our results. We begin by assessing
whether a startup’s engagement on GitHub varies depending on the type of technology a
startup develops. We then evaluate whether such engagement is also detected prior to raising
a second or a third financing round. Finally, we examine heterogeneity in startup response
depending on the type of investors participating in a given round.
In Table 9, we investigate how the engagement on GitHub prior to raising a first financing
round varies among startups that we classify as being software intensive according to the
industry groups specified on Crunchbase. The rationale for this analysis is that GitHub is
primarily used by companies for which software development is a core business activity and,
thus, we may expect stronger effects for these companies. Consistent with this conjecture,
10Examples of permissive licenses are BSD, MIT, Apache, and CC-BY. For a comprehensive list, refer tohttps://gist.github.com/nicolasdao/a7adda51f2f185e8d2700e1573d8a633.
23
the results in columns (1) and (2) show that “software startups” become relatively more
involved in GitHub prior to raising funds than the other startups. After both sets of startups
raise their first round, their activities on the GitHub platform decelerate.
〈 Insert Table 9 about here 〉
The relevance of a startup’s engagement on GitHub may vary depending on whether
a startup seeks to raise a first financing round or subsequent rounds. While investors
participating in a startup’s first round may value a startup’s technology investments relatively
more, investors participating in follow-on rounds may prefer other aspects, such as a startup’s
marketing efforts (Wasserman, 2003). As a result, a startup’s involvement with the GitHub
community may matter more during a startup’s earliest rounds and progressively fade as
additional rounds are raised. To assess this conjecture, we re-estimate Eq. (2) for a startup’s
second and third round, respectively. The results are reported in Table 10 and in Figure 9.
We adopt the same equation specification as reported in column (4) of Table 3. As shown,
the increasing trend in the likelihood that a startup engages in a public activity on GitHub
exhibits the steepest slope in the twelve months prior to raising a first round and is flat in
the twelve months prior to raising a third round.
〈 Insert Table 10 and Figure 9 about here 〉
We next assess how a startup’s engagement on GitHub varies with the type of investor
participating in a startup’s first round. Specifically, we generate an indicator for whether
at least one of the participating investors in a startup’s first round is a VC. The reason
for focusing on VCs is that they have been found to positively value a startup’s technology
relative to other investors (Conti et al., 2013b) and to be especially active in screening and
nurturing their portfolio startups (Bernstein et al., 2016; Sørensen, 2007). To investigate this
potential source of heterogeneity, we modify Eq. (2) adding the interactions between V C and
Postit and V C and τit, as well as the triple interaction between V C, Postit, and τit. The
results are reported in column (1) of Table 11.
24
As shown, the likelihood that a startup engages in a public activity on GitHub increases
as the startup approaches the first round date, regardless of whether a VC participates in
the round or not. However the increment is significantly larger when a VC participates in
the round. Post-round, the slope of the time trend declines for both VC-led and non-VC-led
rounds. This suggests that VCs value, more than other investors, the fact that their portfolio
startups rely on open-source communities to develop their digital technologies.
A related aspect we consider is whether startups react differently depending on how
successful the investors participating in a startup’s first round are. We examine two measures
of investor success. The first is the number of investments an investor made in the five years
prior to an observed startup’s first round. The second is the number of successful investments
made during the same time period. A successful investment is the backing of a startup
that ultimately exits via an acquisition or an IPO. For each observed round, we retained
the maximum number of investments (or successful investments) made by the investors
participating in the round. The results are reported in columns (2) and (3) of Table 11. As
shown, the slope of the likelihood that a startup engages in a public activity on GitHub
prior to raising a first financing round is steeper the larger the number of past (successful)
investments made by the most successful investor in a round.
〈 Insert Table 11 about here 〉
6 Discussion and Conclusions
In this paper, we analyze unique data linking Crunchbase profiles to accounts on the
software development platform GitHub. We investigate the role of open-source technology
investments among nascent firms in raising funds. Our results suggest that participation in
open-source communities plays an important role in achieving funding milestones. Specifically,
we observe a stark increase in activities on GitHub prior to a firm raising its first financing
round; an effect that dissipates in months thereafter. These results, which we obtain by
controlling for fixed differences among startups and their technologies, life cycle effects,
25
and technology shocks provide an indication that earliest-round investors value a startup’s
involvement with open source communities.
Particularly noteworthy is that startups intensify those GitHub activities that rely on
external technology sources above and beyond the technologies they themselves control,
particularly when the marginal cost of accessing these sources is reduced by a exogenous price
change for using the GitHub platform. The most prevalent among these external activities
are related to software development, data analytics, and integration of their technologies.
Additionally, startups resort to GitHub to access best practices on HR management, suggesting
that investors also value the ability of startups to upgrade their HR resources through GitHub.
We provide further evidence showing that the increased involvement of startups on
GitHub improves a startup’s innovation pipeline and does not merely act as a signal for
potential investors. Startups appear to be at least as likely to make new or previously private
repositories public before the first financing round as they are after the first round. In addition
to suggesting that signaling is not the only driver of our findings, this evidence speaks against
the potential explanation that startups decelerate their involvement on GitHub post-round
because of investors’ distaste for their investees’ public engagement. However, we do find that
after a startup raises a financing round, the likelihood that it chooses a permissive license for
its repositories significantly declines, providing an indication that investors may be sensitive
to appropriability concerns. Finally, our results further indicate that VCs and renowned
investors are the most responsive to these GitHub activities.
This study contributes to increasing our understanding of the different channels through
which technology can be sourced, what and how technology aspects matter for attracting
funding, and investor preferences. As such, our findings extend the literature that has
investigated how open innovation influences a firm’s ability to innovate and appropriate
the benefits of innovation (Dahlander and Gann, 2010), and the role of open source for
firm productivity (Nagle, 2018, 2019; Shah and Nagle, 2019). We highlight a novel channel
through which startups benefit from using and actively engaging in open source software
26
endeavors. In our context, technology startups rely heavily on external sources to develop –
“beef up” – their own technologies and their involvement in open-source communities matters
for attracting investors, particularly VCs and successful investors, during startups’ early
stages. Further, we contribute to the entrepreneurial finance literature that has investigated
whether VCs invest in the founding team or the technology (Bernstein et al., 2017; Gompers
et al., 2020; Kaplan et al., 2009). We provide evidence that open-source engagement in
development, scale and integration, matter more than investing in the user interface, at least
for raising the first round of financing. Finally, our use of machine learning algorithms to
classify startups’ activities on GitHub builds on an emerging line of research that applies
sophisticated data techniques to categorize firm strategies (Conti et al., 2020; Guzman and
Li, 2019).
The external validity of our approach may be limited given that we focus on a specific
open-source platform. However, GitHub is the largest host of source code with over 40 million
public repositories to date and anecdotal evidence suggests that investors take public GitHub
activities into consideration in their due diligence efforts (Jain, 2018). Although our results
are based on particular activities on a specific online platform, we believe they have broader
implications, especially for early stage ventures.
In conclusion, this paper contributes to our understanding of the impact of early-stage tech-
stack investments to achieving funding milestones. By opening the technology “black-box”, we
reveal important nuances that have been largely overlooked in the literature. Particularly our
finding that nascent firms extensively rely on external sources to develop, scale, and integrate
their own technologies may provide promising avenues for future research as it appears
contradictory to the classical notion of firm creation (Coase, 1937; Lippman and Rumelt,
2003; Zenger et al., 2011), organizational boundaries (Felin and Zenger, 2014; Lakhani et al.,
2013a), resources (Barney, 1991; Wernerfelt, 1984), and the need for intellectual property
protection to spur innovation (Baldwin and Hippel, 2011; Schumpeter, 1934).
27
References
Almirall, E. and Casadesus-Masanell, R. (2010). Open versus closed innovation: A model of
discovery and divergence. Academy of Management Review, 35(1):27–47.
Arora, A., Fosfuri, A., and Gambardella, A. (2001). Markets for technology and their
implications for corporate strategy. Industrial and Corporate Change, 10(2):419–451.
Arora, A. and Gambardella, A. (1990). Complementarity and external linkages: The strategies
of the large firms in Biotechnology. Journal of Industrial Economics, pages 361–379.
Baldwin, C. and Hippel, E. v. (2011). Modeling a paradigm shift: From producer innovation
to user and open collaborative innovation. Organization Science, 22(6):1399–1417.
Barney, J. (1991). Firm resources and sustained competitive advantage. Journal of Manage-
ment, 17(1):99–120.
Bernstein, S., Giroud, X., and Townsend, R. R. (2016). The impact of venture capital
monitoring. Journal of Finance, 71(4):1591–1622.
Bernstein, S., Korteweg, A., and Laws, K. (2017). Attracting early-stage investors: Evidence
from a randomized field experiment. Journal of Finance, 72(2):509–538.
Buss, P. and Peukert, C. (2015). R&D outsourcing and intellectual property infringement.
Research Policy, 44(4):977–989.
Chesbrough, H. (2006a). New puzzles and new findings. In Chesbrough, H., Vanhaverbeke, W.,
and West, I., editors, Open Innovation: Researching a New Paradigm. Oxford University
Press, Oxford, UK.
Chesbrough, H. (2006b). Open business models: How to thrive in the new innovation landscape.
Harvard Business Press.
Coase, R. H. (1937). The nature of the firm. Economica, 4(16):386–405.
Cohen, W. M. and Levinthal, D. A. (1990). Absorptive capacity: A new perspective on
learning and innovation. Administrative Science Quarterly, pages 128–152.
Conti, A., Guzman, J., and Rabi, R. (2020). Information frictions in the market for startup
28
acquisitions. Available at SSRN 3678676.
Conti, A. and Roche, M. P. (2021). Lowering the bar? External conditions, opportunity
costs, and high-tech start-up outcomes. Organization Science.
Conti, A., Thursby, J., and Thursby, M. (2013a). Patents as signals for startup financing.
Journal of Industrial Economics, 61(3):592–622.
Conti, A., Thursby, M., and Rothaermel, F. T. (2013b). Show me the right stuff: Signals for
high-tech startups. Journal of Economics & Management Strategy, 22(2):341–364.
Dahlander, L. and Gann, D. M. (2010). How open is innovation? Research Policy, 39(6):699–
709.
David, P. A. and Greenstein, S. (1990). The economics of compatibility standards: An
introduction to recent research. Economics of Innovation and New Technology, 1(1-2):3–41.
Di Stefano, G., Gambardella, A., and Verona, G. (2012). Technology push and demand
pull perspectives in innovation studies: Current findings and future research directions.
Research Policy, 41(8):1283–1295.
Felin, T. and Zenger, T. R. (2014). Closed or open innovation? Problem solving and the
governance choice. Research Policy, 43(5):914–925.
Fleming, L. (2001). Recombinant uncertainty in technological search. Management Science,
47(1):117–132.
Fosfuri, A., Giarratana, M. S., and Luzzi, A. (2008). The penguin has entered the building:
The commercialization of open source software products. Organization Science, 19(2):292–
305.
Gans, J. S. and Stern, S. (2003). The product market and the market for “ideas”: Commer-
cialization strategies for technology entrepreneurs. Research Policy, 32(2):333–350.
Gompers, P. A., Gornall, W., Kaplan, S. N., and Strebulaev, I. A. (2020). How do venture
capitalists make decisions? Journal of Financial Economics, 135(1):169–190.
Greenstein, S. (1996). Invisible hands versus invisible advisors: Coordination mechanisms in
29
economic networks. In Noam, E. and Nishuilleabhain, A., editors, Public Networks, Public
Objectives. Elsevier Science, Amsterdam.
Grimpe, C. and Kaiser, U. (2010). Balancing internal and external knowledge acquisition: the
gains and pains from R&D outsourcing. Journal of Management Studies, 47(8):1483–1509.
Guzman, J. and Li, A. (2019). Measuring founding strategy. Available at SSRN 3489585.
Hellmann, T. and Puri, M. (2002). Venture capital and the professionalization of start-up
firms: Empirical evidence. Journal of Finance, 57(1):169–197.
Hsu, D. H. and Ziedonis, R. H. (2013). Resources as dual sources of advantage: Implications
for valuing entrepreneurial-firm patents. Strategic Management Journal, 34(7):761–781.
Jain, V. (2018). Investor due diligence: Beyond the obvious. https://startupflux.com/
investor-due-diligence-beyond-the-obvious/amp/. Accessed: 2021-6-29.
Jeppesen, L. B. and Lakhani, K. R. (2010). Marginality and problem-solving effectiveness in
broadcast search. Organization Science, 21(5):1016–1033.
Kaplan, S. N. and Lerner, J. (2010). It ain’t broke: The past, present, and future of venture
capital. Journal of Applied Corporate Finance, 22(2):36–47.
Kaplan, S. N., Sensoy, B. A., and Stromberg, P. (2009). Should investors bet on the jockey
or the horse? Evidence from the evolution of firms from early business plans to public
companies. Journal of Finance, 64(1):75–115.
Lakhani, K. R., Boudreau, K. J., Loh, P.-R., Backstrom, L., Baldwin, C., Lonstein, E., Lydon,
M., MacCormack, A., Arnaout, R. A., and Guinan, E. C. (2013a). Prize-based contests can
provide solutions to computational biology problems. Nature Biotechnology, 31(2):108–111.
Lakhani, K. R., Lifshitz-Assaf, H., and Tushman, M. L. (2013b). Open innovation and
organizational boundaries: Task decomposition, knowledge distribution and the locus of
innovation. In Handbook of Economic Organization. Edward Elgar Publishing.
Lakhani, K. R. and von Hippel, E. (2003). How open source software works: “free” user-to-user
assistance. Research Policy, 32(6):923–943.
Laursen, K. and Salter, A. (2006). Open for innovation: the role of openness in explaining
30
innovation performance among uk manufacturing firms. Strategic Management Journal,
27(2):131–150.
Lippman, S. A. and Rumelt, R. P. (2003). A bargaining perspective on resource advantage.
Strategic Management Journal, 24(11):1069–1086.
Nagle, F. (2018). Learning by contributing: Gaining competitive advantage through contri-
bution to crowdsourced public goods. Organization Science, 29(4):569–587.
Nagle, F. (2019). Open source software and firm productivity. Management Science,
65(3):1191–1215.
Ritter, J. R. (2016). Initial public offerings: VC-backed IPO statistics through 2015.
Roche, M. P., Conti, A., and Rothaermel, F. T. (2020). Different founders, different venture
outcomes: A comparative analysis of academic and non-academic startups. Research Policy,
49(10):104062.
Schumpeter, J. A. (1934). The theory of economic development; An inquiry into profits,
capital, credit, interest, and the business cycle. Harvard University Press, Cambridge, Mass.
Shah, S. and Nagle, F. (2019). Why do user communities matter for strategy? Harvard
Business School Strategy Unit Working Paper, (19-126).
Shane, S. A. (2004). Academic Entrepreneurship: University Spinoffs and Wealth Creation.
Cheltenham: Edward Elgar Publishing.
Sørensen, M. (2007). How smart is smart money? A two-sided matching model of venture
capital. Journal of Finance, 62(6):2725–2762.
Teece, D. J. (1986). Profiting from technological innovation: Implications for integration,
collaboration, licensing and public policy. Research Policy, 15(6):285–305.
Wasserman, N. (2003). Founder-CEO succession and the paradox of entrepreneurial success.
Organization Science, 14(2):149–172.
Wernerfelt, B. (1984). A resource-based view of the firm. Strategic Management Journal,
5(2):171–180.
31
Yoo, Y., Boland Jr, R. J., Lyytinen, K., and Majchrzak, A. (2012). Organizing for innovation
in the digitized world. Organization Science, 23(5):1398–1408.
Zenger, T. R., Felin, T., and Bigelow, L. (2011). Theories of the firm–market boundary.
Academy of Management Annals, 5(1):89–133.
32
Figure 1: GitHub activities related to own and external repositories over time
Notes: In this figure, we distinguish between a startup’s public activities related to its own repositories andits engagement with external repositories.
33
Fig
ure
2:R
epos
itor
ycl
assi
fica
tion
:w
ord
clou
ds
Software
(Backend),
ISoftware
(Backend),
IIM
achin
eLea
rnin
g
API
Use
rIn
terface
(UI)
Human
Reso
urces(H
R)
Notes:
This
figu
redis
pla
ys
the
mos
tco
mm
onw
ords
for
each
ofth
eca
tego
ries
we
hav
eco
nsi
der
ed.
Not
eth
atth
ese
wor
dcl
ouds
are
bas
edon
the
full
clas
sifi
edsa
mp
le,
for
wh
ich
we
use
dth
eh
and
-cod
edtr
ain
ing
sam
ple
asst
arti
ng
poi
nt
for
k-n
eare
stn
eigh
bor
s.T
oco
nst
ruct
this
figu
re,
wor
dco
unts
for
agi
ven
cate
gory
wer
ed
isco
unte
dif
the
wor
ds
also
ap
pea
red
freq
uen
tly
inoth
erca
tegori
es.
34
Figure 3: GitHub activity types by industry groups
0 10 20 30 40 50percent
Software/IT
Mobile
Internet Services
Privacy/Security
Data/Analytics
Other
ML UIAPI Productivity
Notes: This figure reports the relevance of each technology category across industry groups.
35
Figure 4: Raw Data – Trends over time
.04
.06
.08
.1.1
2
-10 -5 0 5 10time
90% CI Fitted valuesFitted values (mean) activity_binary
Notes: This figure depicts the raw relationship between time to and from a founding round and startups’average activity on GitHub. Activity binary equals one if the startup logged an activity on GitHub. Weinclude 90% confidence intervals.
36
Figure 5: GitHub activity over time for external and internal activities
-.04
-.02
0.0
2.0
4C
oeffi
cien
t
12 6 2 2 6 12time
90%CI External90%CI Internal
By External vs. Internal
Notes: This figure displays how the engagement on GitHub prior to raising a first financing round and aftervaries among startups depending on whether the activities are external or internal. The solid line presentsthe results for all external activities and the dotted line displays results for internal activities. The respectivestarting point coefficients are displayed in t-12. Confidence intervals are at the 90% level.
37
Figure 6: GitHub activities at around the change of the GitHub’s pricing scheme
1600
018
000
2000
022
000
2400
0O
ther
Inte
rnal
020
0040
0060
0080
00M
embe
r Eve
nts
2500
030
000
3500
040
000
4500
050
000
Exte
rnal
2015m1 2015m7 2016m1 2016m7 2017m1
External Member Events Other Internal
Notes: This figure displays the general changes of activity on GitHub depending on type at around thetime of the GitHub’s change in the pricing scheme. In this graph we distinguish between external, internalwithout new member activity (”Other Internal”) and internal new member activity only (”Member Events”).
Figure 7: The effect of the change in the GitHub’s pricing scheme
-.02
-.01
0.0
1C
oeffi
cien
t
12 6 2 2 6 12time
90%CI Before price change90%CI After price change
A: Internal Member Events
-.03
-.02
-.01
0.0
1C
oeffi
cien
t
12 6 2 2 6 12time
90%CI Before price change90%CI After price change
B: Other Internal Events
-.06
-.04
-.02
0.0
2C
oeffi
cien
t
12 6 2 2 6 12time
90%CI Before price change90%CI After price change
C: External Events
Notes: This figure displays the effects of the change in the GitHub’s pricing scheme from a repository-basedone to a user-based one. With the new pricing model, paying customers can create unlimited privaterepositories. We distinguish between internal member events (A), internal events without new memberactivity (B), and external events (C). Confidence intervals are at the 90% level.
38
Figure 8: GitHub external activity types over time
-0.0
080.
011
Coe
ffici
ent
12 126 2 2 6 -0.
002
0.00
2C
oeffi
cien
t
12 6 2 2 6 12ML
-0.0
070.
006
Coe
ffici
ent
12 6 2 2 6 12API
-0.0
030.
005
Coe
ffici
ent
12 6 2 2 6 12UI
SD/BE
Notes: This figure displays the results reported in Table 5, Panel B, over time, where we categorize astartup’s external activities on GitHub according to whether they pertain to SD/BE, ML, API, and UI.Confidence intervals are at the 90% level.
39
Figure 9: GitHub activity by round
-.04
-.02
0.0
2.0
4C
oeffi
cien
t
12 6 2 2 6 12time
90%CI Round 190%CI Round 290%CI Round 3
By Round
Notes: This figure displays how the engagement on GitHub prior to raising a first financing round and aftervaries among startups depending on the round in question. The respective starting point coefficients aredisplayed in t-12. Confidence intervals are at the 90% level.
40
Table 1a: Summary statistics - Full sample of startups
Obs Mean Std. Dev. Min Max p50Raised funds 160065 0.357 0.479 0 1 0IPO 160065 0.012 0.107 0 1 0Acquired 160065 0.081 0.273 0 1 0GitHub 160065 0.093 0.29 0 1 0Top Team 160065 0.002 0.039 0 1 0AI 160065 0.039 0.194 0 1 0Data Analytics 160065 0.098 0.297 0 1 0Information Technology 160065 0.182 0.386 0 1 0Internet Services 160065 0.200 0.400 0 1 0Software 160065 0.356 0.479 0 1 0N. Industry Groups 160065 2.917 1.729 1 19 3Software share 154885 0.464 0.387 0 1 .5California 160065 0.296 0.457 0 1 0Massachusetts 160065 0.045 0.207 0 1 0New York 160065 0.128 0.334 0 1 0
Table 1b: Top 5 external and internal GitHub events
External InternalType Total Share Type Total ShareForkEvent 194518 0.96 MemberEvent 180338 0.46WatchEvent 3486 0.02 PublicEvent 64355 0.17PushEvent 1797 0.01 PushEvent 57196 0.15IssueCommentEvent 1316 0.01 CreateEvent 41648 0.11PullRequestEvent 645 0.003 TeamAddEvent 21001 0.05Total Events 203565 Total Events 388001
41
Table 1c: Summary statistics - Startups that raised funds and have a GitHub account
Obs Mean Std. Dev. Min Max p50Engage with internal repos 10482 0.014 0.116 0 1 0Engage with external repos 10482 0.025 0.157 0 1 0Software (back end) 10482 0.008 0.091 0 1 0ML 10482 0 0.020 0 1 0API 10482 0.007 0.084 0 1 0UI 10482 0.002 0.045 0 1 0HR 10482 0 0.020 0 1 0AI 10482 0.110 0.313 0 1 0Data Analytics 10482 0.227 0.419 0 1 0Information Technology 10482 0.265 0.441 0 1 0Internet Services 10482 0.284 0.451 0 1 0Software 10482 0.615 0.487 0 1 1N. Industry Groups 10482 3.711 1.803 1 15 3Software share 10482 0.610 0.332 0 1 0.667California 10482 0.434 0.496 0 1 0Massachusetts 10482 0.059 0.236 0 1 0New York 10482 0.159 0.366 0 1 0
Notes The proportions reported for the various GitHub activities are relative the period starting twelvemonths prior to a startup’s first financing round and ending twelve months after.
42
Table 2: Raising a financing round: Determinants
(1) (2) (3)Raising a financing round
GitHubi 0.208∗∗∗ 0.209∗∗∗ 0.235∗∗∗
(0.0102) (0.0103) (0.0101)
Top Teami 0.351∗∗∗ 0.387∗∗∗ 0.362∗∗∗
(0.0161) (0.0347) (0.0155)
GitHubi × Top Teami -0.0760(0.0608)
Software sharei -0.0232∗∗∗
(0.00409)
Founding year FE Y Y YIndustry Group FE Y Y
Observations 160065 160065 154885Mean DV 0.357 0.357 0.363
Notes : This table reports the results from estimating variants of Eq.(1). The dependent variable is the likelihood a startup raised at leastone financing round as of January 2021. The variable GitHubi is anindicator that takes the value 1 if a startup has a GitHub account,while TopTeami is an indicator identifying prominent founders andCXOs. The latter measure equals one if an employee is ranked amongthe top 500 by Crunchbase and zero otherwise. We include fixedeffects for a startup’s founding year and for whether the startup islocated in Massachusetts, New York, or California. In columns (1)and (2), we include industry group fixed effects. The industry groupswe consider are Information Technology, Software, Data Analytics, In-ternet Services, and Artificial Intelligence. In column (3), we replaceour industry groups fixed effects with the share of a startup’s indus-try group keywords that are related to software (Software sharei).The keywords related to software are: Apps, Artificial Intelligence,Consumer Electronics, Data and Analytics, Design, Financial Ser-vices, Gaming, Hardware, Information Technology, Internet Services,Messaging and Telecommunications, Mobile, Payments, Platforms,Privacy and Security, and Software. Standard errors - reported inparentheses - are clustered at the startup founding year level. Signifi-cance noted as: *p<0.10; **p<0.05; ***p<0.01.
43
Table 3: Startup engagement on GitHub
(1) (2) (3) (4) (5) (6)Engaging in a public activity on GitHub
Postit 0.0418∗∗∗ 0.0431∗∗∗ 0.0430∗∗∗ 0.0456∗∗∗
(0.00550) (0.00550) (0.00550) (0.00554)
τit 0.00340∗∗∗ 0.00345∗∗∗ 0.00344∗∗∗ 0.00365∗∗∗ 0.00421∗∗∗ 0.00445∗∗∗
(0.000247) (0.000248) (0.000248) (0.000249) (0.000262) (0.000356)
Postit × τit -0.00198∗∗∗ -0.00208∗∗∗ -0.00207∗∗∗ -0.00242∗∗∗ -0.00346∗∗∗ -0.00297∗∗∗
(0.000377) (0.000377) (0.000378) (0.000381) (0.000474) (0.000643)
Ageit 0.0743∗∗∗ 0.0672∗∗∗ 0.0659∗∗∗
(0.00673) (0.00674) (0.00673)
Startup FE Y Y Y Y Y YYear × Region FE Y Y Y Y YYear × Industry Group FE Y Y Y YYear × Software sharei YStartup Age FE YStartup Age × Startup FE Y YPostit × Startup FE Y YLead Investor FE Y
Observations 246462 246462 246462 246462 244617 150972R2 0.273 0.274 0.274 0.275 0.423 0.424
Notes: This table reports the results from estimating variants of Eq. (2). Postit is equal to one for the twelvemonths that succeed a startup’s first financing round, and zero in the twelve points preceding it. The variable τit,instead, is the count of months to/from a startup’s first round. In column (1), we control for the natural logarithmof a startup’s age. We additionally include startup and year fixed effects, as well as fixed effects controlling forwhether a startup raised a second round in any of the twelve months following the first round. In column (2), weadd region by year and industry group by year fixed effects. The industry groups we consider are InformationTechnology, Software, Data Analytics, Internet Services, and Artificial Intelligence. In column (3), we replaceour industry groups with the share of a startup’s industry group keywords that are related to software (Softwaresharei). The keywords related to software are: Apps, Artificial Intelligence, Consumer Electronics, Data andAnalytics, Design, Financial Services, Gaming, Hardware, Information Technology, Internet Services, Messagingand Telecommunications, Mobile, Payments, Platforms, Privacy and Security, and Software. In column (4), weestimate a similar model as the one in column (3), this time replacing the natural logarithm of a startup’s age withage fixed effects. In column (5), we estimate a similar model as the one in column (4) where we add age by startupand Postit by startup fixed effects. Finally, in column (6), we add lead investor fixed effects to the specification incolumn (5). Standard errors - reported in parentheses - are clustered at the startup level. Significance noted as:*p<0.10; **p<0.05; ***p<0.01.
44
Table 4: Startup engagement on GitHub - Engagement with own repositories vs. engagementwith external repositories
(1) (2)Engaging with:
Own repositories External repositories
Postit 0.0124∗∗∗ 0.0385∗∗∗
(0.00365) (0.00509)
τit 0.00136∗∗∗ 0.00295∗∗∗
(0.000160) (0.000226)
Postit × τit -0.000525∗∗ -0.00213∗∗∗
(0.000247) (0.000349)
Startup FE Y YYear × Region FE Y YYear × Industry Group FE Y YStartup Age FE Y Y
Observations 246462 246462R2 0.237 0.243
Notes : This table reports the results from estimating variants of Eq. (2). The outcomein column (1) is whether a startup engages in public activities related to its ownrepositories in t. This measure proxies a startup’s internal technology investments.The outcome in column (2) is whether a startup engages with external repositories int. This outcome provides an indication of whether a startup builds on repositoriescontrolled by other GitHub users to develop its technologies. Postit is equal to onefor the twelve months that succeed a startup’s first financing round, and zero in thetwelve points preceding it. The variable τit, instead, is the count of months to/froma startup’s first round. We include startup and startup age fixed effects, as well asfixed effects controlling for whether a startup raised a subsequent round in any of thetwelve months following a first round. Additionally, we include region by year andindustry group by year fixed effects. Standard errors - reported in parentheses - areclustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.
45
Table 5: Startup reaction to an exogenous change in the GitHub pricing scheme
(1) (2) (3)Member event Other internal event External event
Postit × τit 0.000201 -0.00134∗∗∗ -0.00255∗∗∗
(0.000131) (0.000468) (0.000625)
τit 0.000125∗∗∗ 0.00189∗∗∗ 0.00402∗∗∗
(0.0000456) (0.000232) (0.000315)
NewPriceSchemet 0.00507 0.00775∗∗ 0.00605(0.00346) (0.00357) (0.00581)
Postit × NewPriceSchemet 0.0167∗∗ -0.00647 0.0184(0.00806) (0.00955) (0.0143)
NewPriceSchemet × τit 0.000902∗∗∗ -0.000886∗∗∗ -0.00140∗∗∗
(0.000232) (0.000322) (0.000481)
Postit × NewPriceSchemet × τit -0.000860∗∗ 0.000690 -0.000519(0.000394) (0.000585) (0.000830)
Startup FE Y Y YYear × Region FE Y Y YYear × Industry Group FE Y Y YStartup Age × Startup FE Y Y YPostit × Startup FE Y Y Y
Observations 244856 244856 244856R2 0.295 0.390 0.400
Notes: In this table, we assess how the introduction if a new GitHub pricing scheme in October 2015 affected the startups’willingness to rely on open-source communities to develop their technologies and attract funds. We modify Eq. (2) includingan indicator -NewPriceSchemet- identifying the period following the introduction of the new pricing scheme. We additionallyintroduce interaction terms between NewPriceSchemet and the following variables: Postit, τit, and Postit×τit. The dependentvariables we consider are: an indicator that equals one if a startup engages in a member event in month t and zero otherwise(column 1); an indicator for whether a startup engages in a non-member, internal event (column 2); and an indicator for whethera startup engages with an external repository (column 3). To ensure that the introduction of the new pricing scheme is notcorrelated with a startup’s characteristics and the life cycle of its technology, we control for the full set of fixed effects. Standarderrors - reported in parentheses - are clustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.
46
Table 6: Startup engagement on GitHub - By GitHub activity
Panel A: Own repositories(1) (2) (3) (4) (5)
SD/BE ML API UI HR
Postit 0.00336∗ 0.000710 0.00157 -0.000480 0.000151(0.00182) (0.000554) (0.00163) (0.000834) (0.000273)
τit 0.000407∗∗∗ 0.0000247 0.000431∗∗∗ 0.0000497∗ -0.0000117(0.0000769) (0.0000189) (0.0000682) (0.0000270) (0.0000129)
Postit × τit -0.000128 -0.0000357 -0.000115 0.0000668 0.00000127(0.000118) (0.0000333) (0.000109) (0.0000518) (0.0000178)
Observations 246462 246462 246462 246462 246462R2 0.123 0.0704 0.121 0.0761 0.0486
Panel B: External repositories(1) (2) (3) (4) (5)
SD/BE ML API UI HR
Postit 0.0102∗∗∗ 0.00116∗ 0.00956∗∗∗ 0.00170 0.00140∗∗∗
(0.00252) (0.000698) (0.00238) (0.00132) (0.000484)
τit 0.000449∗∗∗ 0.000116∗∗∗ 0.000354∗∗∗ 0.0000902 0.0000686∗∗∗
(0.000114) (0.0000330) (0.000100) (0.0000558) (0.0000231)
Postit × τit -0.000559∗∗∗ -0.000123∗∗∗ -0.000545∗∗∗ -0.0000373 -0.000110∗∗∗
(0.000164) (0.0000469) (0.000157) (0.0000863) (0.0000312)
Observations 246462 246462 246462 246462 246462R2 0.136 0.0664 0.122 0.0966 0.0533
Notes: This table reports the results from estimating variants of Eq. (2). The outcome in: column (1) is whether astartup engages in public activities related to software development/back end (SD/BE); column (2) is whether a startupengages in public activities related to machine learning (ML); column (3) is whether a startup engages in public activitiesrelated to Application Programming Interface (API); column (4) is whether a startup engages in public activities relatedto user interface (UI); column (5) is whether a startup engages in public activities related to human resources (HR). InPanel A, we consider public activities on GitHub related to a startup’s own repositories. In Panel B, we focus on astartup’s engagement with external repositories. Postit is equal to one for the twelve months that succeed a startup’sfirst financing round, and zero in the twelve points preceding it. The variable τit, instead, is the count of months to/froma startup’s first round. We include startup and startup age fixed effects, as well as fixed effects controlling for whethera startup raised a subsequent round in any of the twelve months following a first round. Additionally, we include regionby year and industry group by year fixed effects. Standard errors - reported in parentheses - are clustered at the startuplevel. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.
47
Table 7: Startup engagement on GitHub - Making new or existing private repositories public
(1) (2) (3) (4) (5) (6)Making a repository public
Postit 0.00519∗ 0.00561∗∗ 0.00561∗∗ 0.00604∗∗
(0.00269) (0.00269) (0.00269) (0.00271)
τit 0.000815∗∗∗ 0.000838∗∗∗ 0.000834∗∗∗ 0.000876∗∗∗ 0.000937∗∗∗ 0.000952∗∗∗
(0.000117) (0.000117) (0.000118) (0.000118) (0.000120) (0.000160)
Postit × τit -0.000103 -0.000139 -0.000139 -0.000204 -0.000185 -0.0000330(0.000179) (0.000179) (0.000179) (0.000180) (0.000227) (0.000319)
Ageit 0.0175∗∗∗ 0.0149∗∗∗ 0.0141∗∗∗
(0.00304) (0.00304) (0.00303)
Startup FE Y Y Y Y Y YYear × Region FE Y Y Y Y YYear × Industry Group FE Y Y Y YYear × Software sharei YStartup Age FE YStartup Age × Startup FE Y YPostit × Startup FE Y YLead Investor FE Y
Observations 246462 246462 246462 246462 244617 150972R2 0.273 0.274 0.274 0.275 0.423 0.424
Notes: This table reports the results from estimating variants of Eq. (2). The dependent variable is an indicator forwhether a startup make a new or a previously private repository public. Postit is equal to one for the twelve monthsthat succeed a startup’s first financing round, and zero in the twelve points preceding it. The variable τit, instead, isthe count of months to/from a startup’s first round. In column (1), we control for the natural logarithm of a startup’sage. We additionally include startup and year fixed effects, as well as fixed effects controlling for whether a startupraised a second round in any of the twelve months following the first round. In column (2), we add region by yearand industry group by year fixed effects. The industry groups we consider are Information Technology, Software,Data Analytics, Internet Services, and Artificial Intelligence. In column (3), we replace our industry groups with theshare of a startup’s industry group keywords that are related to software (Software sharei). The keywords related tosoftware are: Apps, Artificial Intelligence, Consumer Electronics, Data and Analytics, Design, Financial Services,Gaming, Hardware, Information Technology, Internet Services, Messaging and Telecommunications, Mobile, Payments,Platforms, Privacy and Security, and Software. In column (4), we estimate a similar model as the one in column(3), this time replacing the natural logarithm of a startup’s age with age fixed effects. In column (5), we estimatea similar model as the one in column (4) where we add age by startup and Postit by startup fixed effects. Finally,in column (6), we add lead investor fixed effects to the specification in column (5). Standard errors - reported inparentheses - are clustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.
48
Table 8: Startup engagement on GitHub - New repositories with permissive licenses
(1) (2) (3) (4) (5) (6)New repositories with permissive licenses
Postit 0.0196∗∗∗ 0.0203∗∗∗ 0.0201∗∗∗ 0.0213∗∗∗
(0.00410) (0.00409) (0.00410) (0.00413)
τit 0.00131∗∗∗ 0.00134∗∗∗ 0.00133∗∗∗ 0.00143∗∗∗ 0.00182∗∗∗ 0.00197∗∗∗
(0.000175) (0.000175) (0.000175) (0.000176) (0.000180) (0.000244)
Postit × τit -0.000747∗∗∗ -0.000803∗∗∗ -0.000789∗∗∗ -0.000945∗∗∗ -0.00155∗∗∗ -0.00162∗∗∗
(0.000278) (0.000277) (0.000277) (0.000280) (0.000341) (0.000465)
Ageit 0.0481∗∗∗ 0.0446∗∗∗ 0.0446∗∗∗
(0.00464) (0.00464) (0.00463)
Startup FE Y Y Y Y Y YYear × Region FE Y Y Y Y YYear × Industry Group FE Y Y Y YYear × Software sharei YStartup Age FE YStartup Age × Startup FE Y YPostit × Startup FE Y YLead Investor FE Y
Observations 246762 246761 246762 246761 244856 151129R2 0.187 0.188 0.187 0.188 0.337 0.331
Notes: This table reports the results from estimating variants of Eq. (2). The dependent variable is an indicator forwhether a startup chooses a permissive license (BSD, MIT, Apache, CC-BY) for at least one new repository. Postit isequal to one for the twelve months that succeed a startup’s first financing round, and zero in the twelve points precedingit. The variable τit, instead, is the count of months to/from a startup’s first round. In column (1), we control for thenatural logarithm of a startup’s age. We additionally include startup and year fixed effects, as well as fixed effectscontrolling for whether a startup raised a second round in any of the twelve months following the first round. In column(2), we add region by year and industry group by year fixed effects. The industry groups we consider are InformationTechnology, Software, Data Analytics, Internet Services, and Artificial Intelligence. In column (3), we replace ourindustry groups with the share of a startup’s industry group keywords that are related to software (Software sharei).The keywords related to software are: Apps, Artificial Intelligence, Consumer Electronics, Data and Analytics, Design,Financial Services, Gaming, Hardware, Information Technology, Internet Services, Messaging and Telecommunications,Mobile, Payments, Platforms, Privacy and Security, and Software. In column (4), we estimate a similar model as theone in column (3), this time replacing the natural logarithm of a startup’s age with age fixed effects. In column (5),we estimate a similar model as the one in column (4) where we add age by startup and Postit by startup fixed effects.Finally, in column (6), we add lead investor fixed effects to the specification in column (5). Standard errors - reported inparentheses - are clustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.
49
Table 9: Startup engagement on GitHub - Distinguishing between startups developingsoftware and others
(1) (2)Engaging in a public activity on GitHub
Postit 0.0260∗∗∗ 0.0115(0.00775) (0.00941)
Postit × Softwarej 0.0318∗∗∗
(0.0104)
τit 0.00260∗∗∗ 0.00142∗∗∗
(0.000352) (0.000401)
τit × Softwarej 0.00171∗∗∗
(0.000481)
Postit × τit -0.00122∗∗ -0.000207(0.000514) (0.000630)
Postit × τit × Softwarej -0.00196∗∗∗
(0.000691)
Postit × Software sharei 0.0562∗∗∗
(0.0148)
τit × Software sharei 0.00367∗∗∗
(0.000625)
Postit × τit × Software sharei -.0036455∗∗∗
(0.000996)
Startup FE Y YYear × Region FE Y YYear × Industry Group FE Y YStartup Age FE Y Y
Observations 246462 246462R2 0.275 0.276
Notes : This table reports the results from estimating variants of Eq. (2). In column (1), weinteract Postit, τit, and Postit × τit with an indicator that equals 1 if a startups is assignedto the ”software” industry group by Crunchbase and zero otherwise. In column (2), weinteract Postit, τit, and Postit × τit with the share of a startup’s industry keywords thatare related to software. The keywords related to software are: Apps, Artificial Intelligence,Consumer Electronics, Data and Analytics, Design, Financial Services, Gaming, Hardware,Information Technology, Internet Services, Messaging and Telecommunications, Mobile,Payments, Platforms, Privacy and Security, and Software. The models in both columnsinclude startup, age, region by year, and industry group by year fixed effects. We alsoinclude fixed effects controlling for whether a startup raised a second round in any of thetwelve months following the first round. Standard errors - reported in parentheses - areclustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.
50
Table 10: Startup engagement on GitHub - By round
(1) (2) (3)Engaging in a public activity on GitHub
Round I Round II Round III
Postit 0.0456∗∗∗ 0.0311∗∗∗ 0.0188∗
(0.00554) (0.00474) (0.00998)
τit 0.00365∗∗∗ 0.00334∗∗∗ -0.0000236(0.000249) (0.000719) (0.00270)
Postit × τit -0.00242∗∗∗ -0.00213∗∗∗ 0.000321(0.000381) (0.000738) (0.00271)
Startup FE Y Y YYear × Region FE Y Y YYear × Industry Group FE Y Y YStartup Age FE Y Y Y
Observations 246462 168695 115082R2 0.275 0.302 0.318
Notes: This table reports the results from estimating variants of Eq. (2). Postit isequal to one for the twelve months that succeed a startup’s given financing round,and zero in the twelve points preceding it. The variable τit, instead, is the count ofmonths to/from a startup’s round. In column (1), we examine a startup’s engagementon GitHub before and after a first financing round. In column (2), we examine astartup’s engagement on GitHub before and after a second financing round. Finally, incolumn (3), we examine a startup’s engagement on GitHub before and after a thirdfinancing round. We include startup and startup age fixed effects, as well as fixedeffects controlling for whether a startup raised a subsequent round in any of the twelvemonths following the examined round. Additionally, we include region by year andindustry group by year fixed effects. Standard errors - reported in parentheses - areclustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.
51
Table 11: Startup engagement on GitHub - By investor type
(1) (2) (3)Engaging in a public activity on GitHub
VC Round Successful Investor Round
Postit 0.0449∗∗∗ 0.0412∗∗∗ 0.0334∗∗∗
(0.00674) (0.00727) (0.00668)
Postit × VCit -0.000350(0.0108)
Postit × Success Inv.it 0.000866 0.00159(0.00217)) ((0.00289)
τit 0.00280∗∗∗ 0.00250∗∗∗ 0.00265∗∗∗
(0.000287) (0.000305) (0.000289)
τit × VCit 0.00190∗∗∗
(0.000419)
τit × Success Inv.it 0.000416∗∗∗ 0.000545∗∗∗
(0.0000840) (0.000113)
Postit × τit -0.00235∗∗∗ -0.00227∗∗∗ -0.00182∗∗∗
(0.000456) (0.000494) (0.000455)
Postit × τit × VCit -0.00000433(0.000726)
Postit × τit× Success Inv.it 0.00000681 0.0000485(0.000146) (0.000194)
Startup FE Y Y YYear × Region FE Y Y YYear × Industry Group FE Y Y YStartup Age FE Y Y Y
Observations 246462 246462 246462R2 0.276 0.276 0.276
Notes: This table reports the results from estimating variants of Eq. (2). Postit is equalto one for the twelve months that succeed a startup’s first financing round, and zero inthe twelve points preceding it. The variable τit, instead, is the count of months to/froma startup’s first round. We interact Postit, τit, and Postit × τit with: an indicatorthat equals 1 if a startup had a VC participating in its first round (column (1)); themaximum number of investments in which any of a startup’s investors participated inthe five years preceding t (column (2)); the maximum number of successful investmentsmade by a startup’s investors in the five years preceding t (column (3)). A successfulinvestment is one made in startups that ultimately exited via an acquisition or an IPO.We include startup and startup age fixed effects, as well as fixed effects controlling forwhether a startup raised a subsequent round in any of the twelve months following afirst round. Additionally, we include region by year and industry group by year fixedeffects. Standard errors - reported in parentheses - are clustered at the startup level.Significance noted as: *p<0.10; **p<0.05; ***p<0.01.
52
7 Appendix
As mentioned in the main text, we classify the public repositories of all organizations, as
well as the external repositories with which the organizations interacts through commits, pull
requests or forks according to their type. We distinguish between repositories that pertain to
software development/back end (SD/BE), machine learning (ML), application programming
interface (API), and user interface (UI). To do so, we implement a combination of the TF-IDF
vectorization and K-Nearest Neighbors prediction methods.
7.0.1 TF-IDF vectorization
The first step in the categorization process is vectorization using the TF-IDF method.
Each text document (that is, a collection of all the text in a single repository) is transformed
into a vector of numbers. These numbers are the ”scores” that the TF-IDF method assigns
to each word in a document. A TF-IDF score is defined as the frequency with which a given
word occurs in a certain document divided by the fraction of documents in which the word
is present.11 Thus, if a word is frequently used in a given document then it will have a
high score. Similarly, if a word is used in very few other documents it will also have a high
score. This vectorization step is fundamental as the vectors that are generated can be used
to calculate the similarity between two repositories by comparing the vectors that represent
them.
7.0.2 K-Nearest Neighbors prediction
Using the vectors produced by the TF-IDF method, we are able to compare any two
repositories by calculating the dot product of the vectors that represent them. This calculation
yields a similarity score between 0 and 1 (0 meaning totally different and 1 meaning totally
similar).
Given that we can compare any two repositories, it is possible to classify any repository
by examining similar repositories that have already been classified. To do so, we manually
11More precisely, the formula used is: tf ∗ log(idf) where tf is the frequency of the word in a given documentand idf is the inverse of the number of documents where the word appears.
53
classify a subset of 150 repositories. We then use the KNN algorithm to classify the rest of
the repositories using the 150 classified repositories as training data. The KNN algorithm
consists of finding the k (in this case 5) most similar repositories in the training data and
selecting the most common category to classify any new repository (each of the 5 similar
repositories “votes” on a category for the new repository).
In order to accurately classify the large number of repositories we have available, we
iteratively expand the training set applying the following steps:
1. Using the current training set, classify the rest of the repositories using KNN;
2. Retain only the repositories with the top N most confident classifications, and manually
review them;
3. Add these repositories to the training set, and repeat the procedure until all repositories
have been classified.
We increase the number N throughout the procedure as the training set grows, and with
it, the accuracy of KNN. We calculate confidence using the number of “votes”: if all 5 similar
repositories “agree” on a category, then confidence is 100%. Conversely, if only 2 out of the 5
similar repositories “agree”, then the confidence is low.12
12Due to the high number of categories, the “voting” process was weighted by distance, meaning votes fromvery similar documents counted more than votes from less similar documents.
54