Beefing IT up for Your Investor? Open Sourcing and Startup ...

Beefing IT up for Your Investor? Open Sourcing and Startup Funding: Evidence from GitHub Annamaria Conti Christian Peukert Maria Roche

Working Paper 22-001

Copyright © 2021 by Annamaria Conti, Christian Peukert, and Maria Roche.

Working papers are in draft form. This working paper is distributed for purposes of comment and discussion only. It may not be reproduced without permission of the copyright holder. Copies of working papers are available from the author.

Funding for this research was provided in part by Harvard Business School and the Swiss National Science Foundation (Project ID: 100013 188998 and 100013 197807).

Beefing IT up for Your Investor? Open Sourcing and Startup Funding: Evidence from GitHub

Annamaria Conti University of Lausanne

Christian Peukert University of Lausanne

Maria Roche Harvard Business School

Working Paper 22-001

Beefing IT up for your Investor?

Open Sourcing and Startup Funding: Evidence from GitHub∗

Annamaria Conti1, Christian Peukert1, and Maria Roche2

1HEC Lausanne

2Harvard Business School

September 28, 2021

Abstract

We study the participation of nascent firms in open-source communities and its implications for

attracting funding. To do so, we exploit unique data on 160,065 US startups linking information

from Crunchbase to firms’ GitHub accounts. Estimating a within-startup model saturated with fixed

effects, we show that startups accelerate their activities on the platform in the twelve months prior to

raising their first financing round. The intensity of their involvement on GitHub declines in the twelve

months after. Startups intensify those activities that rely on external technology sources above and

beyond the technologies they themselves control. Exploiting a shock that reduced the relative cost of

internal collaborations we provide evidence that startups’ decision to integrate external sources of

knowledge in their production function hinges on the relative cost vis-a-vis internal collaboration.

Applying machine learning to classify GitHub projects, we further unveil that the most prevalent

among these external activities are related to software development, data analytics, and integration.

Our results indicate that VCs and renowned investors are the most responsive to these activities.

Keywords: Startups, Technology Strategy, GitHub, Machine Learning, Venture Capital

∗Conti: [email protected], Peukert: [email protected], Roche: [email protected]. We thankKimon Protopapas and Ilia Azizi for excellent research assistance. This manuscript benefited from manyhelpful comments provided at the CAS Platform Seminar Munich, Digital Economy Workshop, the IntellectualProperty & Innovation Virtual seminar series, SCECR, and the West Coast Symposium. We are gratefulfor the suggestions provided by Karim Lakhani and members of the Laboratory for Innovation Science atthe onset of this work, as well as participants of seminars at HEC Paris, and LUISS Guido Carli University.Annamaria Conti and Christian Peukert acknowledge funding from the Swiss National Science Foundation(Project ID: 100013 188998 and 100013 197807). Maria Roche acknowledges funding from the HarvardBusiness School Division of Research and Faculty Development.

1 Introduction

Innovation is no longer the sole prerogative of large corporate and government laboratories,

but can originate from anywhere (Jeppesen and Lakhani, 2010). As a consequence, open-

source communities have become a fundamental point of supply of innovation for established

firms (Chesbrough, 2006b). Access to external sources of technologies through involvement

with open-source communities allows firms to not only reduce innovation costs, but also to

increase the number and the quality of innovative ideas (Lakhani et al., 2013b). While the

open innovation model with its costs and benefits has been widely studied in the context of

established firms, we know little about the involvement of nascent firms with open-source

communities and its implications for attracting funding.

The financing environment fundamentally shapes strategic choices very early in the life of

a new venture (Hellmann and Puri, 2002). A key milestone for firms is raising funding, where

prior literature stresses the importance of both the team (Bernstein et al., 2017; Gompers

et al., 2020) and the underlying technology of the venture (Kaplan et al., 2009) in doing so

successfully. Although studies have provided evidence indicating the importance of technology

in raising funding, technology and, particularly, its idea sources still remain a black box.

In order to open this box, we use novel data on startups’ digital technology activities

provided on the online development and hosting platform GitHub, which has become the

open-source platform “where the world builds software” (GitHub.com). We thereby examine

how intensively startups utilize this platform prior to raising a funding round, the type of

activities in which they engage, and their sources of knowledge. To do so, we exploit data

encompassing 160,065 US startups listed on Crunchbase that were founded between 2005

and 2019 as our initial sample. We then match these firms to their organization accounts

on GitHub using their respective website domains. We access public GitHub records logged

since 2011 from the GHArchive. The combination of these two data sets provides us with

information on the industry, investors, and total amount of funds a firm raises as well as the

type and nature of activities a firm engages in on GitHub.

1

In our first step, we establish a strong, positive correlation between having a GitHub

organization account and raising financing. This association holds even after including

information on the entrepreneurship-specific human capital of the team. Building on these

exploratory correlations, we use our rich panel data to estimate a within startup model that

relates the dynamics of technology strategy to achieving funding milestones. To address

obvious omitted variable concerns, we saturate the model with a wide range of fixed effects

pertaining to the firm, age, industry-year, region-year, and lead-investor. We show that

GitHub activity takes off in the twelve months preceding a first funding round, and then

levels off in the twelve months after. Specifically, our data reveal that an extra twelve

months towards raising a first round increases the probability of being active on GitHub by 4

percentage points, or a 50% increase in the mean.

To shed more light on these initial findings, we show that the effects are driven by

startups operating in the IT/Software sector, where entrepreneurial GitHub activities are

most intense. Additionally, we find that these dynamics dissipate over the course of subsequent

financing rounds suggesting that startups’ technology investments are less relevant for follow-

on investors. Taken together, these results provide an indication that earliest-round investors

do value a startup’s involvement with open source communities to produce and improve its

own technology.

We further unveil crucial heterogeneity among activity and investor types. Specifically,

we find that startups appear to intensify those activities on GitHub that rely on external

technology sources (repositories they do not control themselves) suggesting that shared

development and an integration of non-proprietary technologies can be important for attracting

funds. This finding is surprising as it appears to be in contrast with what a new venture

traditionally sets out to do and key theories in strategic management suggest. Namely, per

established definition in the literature, technology entrepreneurship is understood as the

commercial exploitation of proprietary innovation generated in-house (Shane, 2004) and not as

the joint creation and (open) sourcing of technologies from outside the venture. Furthermore,

2

internal resources that are rare, inimitable, valuable, and non-substitutable are viewed as a

primary source of competitive advantage (Barney, 1991; Wernerfelt, 1984), which would seem

to preclude publicly shared development and integration of non-proprietary technologies.

We enrich our analysis to assess how a startup’s reliance on external technology sources

depends on their relative cost. For this scope, we examine an exogenous change in GitHub’s

pricing model, introduced in October 2015, that led to a substantial decline in the costs

of internal collaborations through the GitHub platform, while leaving the cost of accessing

external technology sources unchanged. With the new pricing model, we observe that startups

rely relatively less on external technology sources and, instead, intensify internal collaborations

through the GitHub platform. This finding suggests that startups’ decision to integrate

external sources of knowledge in their production function crucially depends on their relative

cost: startups opt for these external sources when the cost of accessing them is comparatively

low.

We delve deeper into our findings and apply machine learning methods to classify external

activities by the following functionalities: Software Development/Backend (SD/BE), Machine

Learning (ML), Application Programming Interfaces (API), and User Interfaces (UI). We find

that all functionalities but those related to UI intensify prior to receiving a funding round and

decelerate afterwards. These results suggest that startups rely on external technology sources

to develop, scale, and integrate their digital technologies. The totality of these activities

matters for attracting investments. To complement these findings, we further show that

startups resort to GitHub to access best practices on human resource (HR) management,

highlighting the importance of GitHub for upgrading human capital resources.

One question these results unveil is whether the increased involvement of startups on

GitHub improves a startup’s innovation pipeline or simply acts as a signal for potential

investors (Conti et al., 2013a,b; Hsu and Ziedonis, 2013). The fact that startups – prior to

receiving a first round – intensify their GitHub involvement along the full spectrum of digital

technology production serves as a first indication that signaling is unlikely the sole driver of

3

our findings. To examine this potential mechanism in depth, we test whether startups are

more likely to make their previously private repositories public before raising a financing

round, which should occur if the main purpose of engagement on GitHub is signaling. Our

analyses reveal that startups are as likely to make a repository public before and after raising

a round of financing.

While this evidence speaks against the explanation that signaling is entirely driving our

results, it additionally suggests that a startup’s deceleration of its involvement on GitHub

post-round is unlikely the sole consequence of its investors’ distaste for their investees’ public

engagement. However, we do show that after a startup raises a financing round, the likelihood

that it chooses a permissive license – a license with only minimal restrictions on how software

can be used, modified, and redistributed – significantly declines, which may be interpreted

such that investors are sensitive to appropriability concerns. On the whole, these findings

provide suggestive evidence that startups are, indeed, “beefing up” their IT before they

receive early-stage financing by relying on open-source communities.

To bring our results full circle, we consider heterogeneity in the response of investors to a

startup’s engagement on GitHub. We find that startups intensify their activities on GitHub

prior to raising a first round, particularly when the participating investors in the round are

VCs, as well as more experienced and more successful investors. Although the VC industry

only funds approximately one thousand ventures per year in the United States, VC-financed

startups account for between 30 to 70 percent of startup IPOs (Kaplan and Lerner, 2010;

Ritter, 2016), suggesting that precisely the more selective investors with higher returns to

investment are those that value startups’ activities on GitHub (Hellmann and Puri, 2002).

Taken together, our findings contribute to several streams of literature. The first analyzes

firm commercialization strategies of open source software (Fosfuri et al., 2008), and the role

of open source for firm productivity (Nagle, 2018, 2019; Shah and Nagle, 2019). In general,

these studies find a strong positive relationship between the usage of open source software and

firm productivity. Our study is distinct from this work as it focuses on technology startups.

4

By showing that startups rely on external sources to develop their technologies, we highlight

a novel channel through which startups benefit from using and actively engaging in open

source software endeavors. Their involvement matters for attracting investors, particularly

VCs and successful investors, during startups’ early stages.

Contrary to the common wisdom that startups fully control all aspects of their underlying

technologies, we find that nascent firms extensively rely on external repositories to develop,

scale, and integrate their technologies. Hence, these findings speak to the literature on

firm boundaries and innovation (Di Stefano et al., 2012), in particular to seminal work on

complementarities between internal and external knowledge sources (Arora and Gambardella,

1990; Cohen and Levinthal, 1990) and technology licensing (Arora et al., 2001).

By showing that startups accelerate their contributions on GitHub prior to raising a first

round, we contribute to the entrepreneurial finance literature that investigates whether VCs

invest in the “jockey” (founding team) or the “horse” (technology) (Bernstein et al., 2017;

Gompers et al., 2020; Kaplan et al., 2009). Our study provides evidence that above and

beyond the “jockey” and the “horse”, relying on open-source communities matters. Finally,

our use of machine learning algorithms to classify startups’ activities on GitHub contributes

to a budding line of research that applies sophisticated data techniques to shed light on and

categorize firm strategies (Conti et al., 2020; Guzman and Li, 2019).

The remainder of the paper is organized as follows. Section II provides the conceptual

framework that will guide our empirical analysis. Section III discusses the data. Sections IV

and V present the empirical specification and the results, respectively. Section VI concludes.

2 Conceptual framework

Innovation is widely understood as a process of recombinant search whereby innovators

experiment with new components or new combinations of existing components (Fleming,

2001). Increasingly, scholars suggest that innovating in isolation is extremely difficult (Laursen

and Salter, 2006) and that embracing an “open innovation paradigm” using internal and

external components is crucial for firms to advance their technology pipeline (Chesbrough,

5

2006a; Grimpe and Kaiser, 2010). While openness can increase the number of components

available and the variability of these components – both of which are fundamental for

producing breakthrough innovations – this approach is not exempt from drawbacks (Almirall

and Casadesus-Masanell, 2010). In fact, e.g., relative to a hierarchical organization with direct

internal chains of command, open innovation can heighten coordination costs (Greenstein,

1996) and further give rise to appropriability concerns (David and Greenstein, 1990). Because

firms relying on open-source communities for components to recombine do not control all

intellectual inputs, their ability to appropriate value from their innovations may suffer as a

consequence (Buss and Peukert, 2015; Teece, 1986).

This trade-off between open-sourcing and appropriability is particularly relevant for

technology startups (Gans and Stern, 2003). Because of their youth and small size, startups

are more likely to both gain by drawing from external sources of ideas, and to struggle with

appropriability issues when they do so. As highlighted by the existing literature (Bernstein

et al., 2017; Gompers et al., 2020; Kaplan et al., 2009; Roche et al., 2020), solving this

trade-off is crucial for startups given that beyond the founding team itself, the underlying

technology of a venture is an important predictor for success in financing. Building on these

studies, it remains an open question whether and to what extent startups rely on open-source

communities as sources to produce and improve their own technologies and ultimately attract

funding.

Startups may deal with the trade-off between open-sourcing and appropriability depending

on the type of investors they plan to attract. The existing literature has shown that investor

types differ in their preferences and that VCs tend to value a startup’s technology relatively

more (Conti et al., 2013b). Another source of heterogeneity may be a startup’s position in its

life cycle. For example, Wasserman (2003) has provided evidence that the relative importance

of a startup’s technology decreases as the startup becomes more mature. Following this, we

may detect important heterogeneity in the use of open-source communities as a function of

investor type and development stage of a startup.

6

Finally, another important aspect to consider is the type of technology that a startup

develops. The extant literature has highlighted that digital technology has increasingly

become organized into loosely coupled layers of different technologies (Yoo et al., 2012).

Especially in the case of software, platforms, where control over knowledge is distributed

across multiple firms and heterogeneous communities, take a central position in the innovation

process (Lakhani and von Hippel, 2003). Therefore, it is possible that software startups

rely relatively more on open-source communities than other startups given differences in the

production process of innovation.

In what follows, we set out to empirically assess whether and how startups rely on open-

source communities to attract funds in the context of GitHub. Here, we are in the unique

position of observing how organizations are using technology components both within and

outside of their control.

3 Empirical setting and data

3.1 GitHub and its pricing model

GitHub is a hosting service for software development and collaborative version control

based on Git. With 40 million public repositories in April 2021, GitHub is the largest host

of source code1 and has come to be known as the place “where the world builds software”

(GitHub.com). Individuals and organizations use GitHub to improve and upgrade their own

projects by accessing code and information from other existing projects and to contribute

to other projects. GitHub also has a history of being backed by a number of high-profile

investors (e.g. through a $100m investment by Andreesen & Horowitz) and was acquired by

Microsoft for $7.5b in 2018.2

GitHub offers personal user accounts and organization accounts for teams. Personal

accounts that are linked to organization accounts appear as members. The source code on

GitHub is organized in repositories, that is, folders containing projects.

1See https://GitHub.com/search?q=is:public, accessed April 25, 2021.2See https://techcrunch.com/2018/06/04/microsoft-has-acquired-GitHub-for-7-5b-in-microsoft-stock/,accessed April 25, 2021.

7

https://GitHub.com/search?q=is:public

https://techcrunch.com/2018/06/04/microsoft-has-acquired-GitHub-for-7-5b-in-microsoft-stock/

The version control system Git allows to save snapshots of files within a repository. When

a user issues a commit, a snapshot of the file’s contents is created and associated with a

timestamp. Commits can be related to repositories of other users by initiating a request to

collaborate on a repository and recommending changes, which is called a pull request. The

owner of the repository can then decide whether to integrate the proposed changes into the

codebase. The owner of a repositories can also add members, i.e. GitHub user accounts, and

grant permissions to view (private) repositories or add contents without having to issue a

pull request. In our analysis below, we use the addition of members to internally controlled

repositories as a measure of increased internal collaboration. There is another, and more

frequently used, way to interact with the repositories of other users, which is called forking.

With forking, a user makes a copy of another user’s repository, which is then integrated into

the initial user’s account. Forks are used to propose changes to someone else’s project by

commit and pull requests, or to use someone else’s project as an input to a new project.

GitHub’s pricing model is based on subscription fees. While either type of account could

always have an unlimited number of public repositories, before October 2015, the subscription

fee included 5 private repositories for individual accounts and 10 private repositories for

organization accounts. Adding additional private repositories was possible but nearly doubled

the monthly subscription fee. After October 2015, GitHub introduced new pricing plans that

included unlimited private repositories for all types of accounts while the fixed subscription

did not increase markedly. For organization accounts, the new price policy implied that it

became drastically cheaper to add members and repositories, effectively lowering the marginal

cost of using GitHub as a development platform for internal (private) repositories.3 We will

make use of this sharp discontinuity in the analysis below.

3For more information refer to https://docs.github.com/en/billing/managing-billing-for-your-github-account/about-per-user-pricing and https://www.infoworld.com/article/3069275/github-ushers-in-unlimited-private-repositories.html.

8

3.2 Data

To build our dataset, we combine data on US startups and their investors, which are

available on Crunchbase, with information on their respective GitHub activities available

from GHArchive and through the GitHub API.

3.3 Crunchbase

Crunchbase serves as our first source of information on startups. This online directory

records fine-grained information on the startups, their founders, and their investors. As

described by Conti and Roche (2021), a considerable portion of the data are entered by

Crunchbase staff, while the remaining part is crowdsourced. Registered members can enter

information to the database, which the Crunchbase staff successively reviews. Relative

to databases such as VentureXpert and VentureSource, Crunchbase has the advantage of

providing a larger coverage of technology startups as it also encompasses startups that did

not raise venture capital. From Crunchbase, we extracted information pertaining to all the

recorded US startups that were founded between 2005 and 2020. This amounts to 160,065

startups, for which we have data encompassing their founding dates, industry and industry

keywords, location, financing rounds and participating investors, as well as exit outcomes.

As shown in Table 1a, approximately half of the startups (46%) are located in California,

Massachusetts, and New York, reflecting the comparative advantage of these regions in

entrepreneurship. Thirty-six percent of them raised at least one round of financing. Addition-

ally, 8% of the startups were acquired as of December 2020 and 1.2% went public through an

IPO.

While Crunchbase does not categorize startups into sectors, it provides industry group

information for each of them. There are approximately 40 distinct industry groups, and,

on average, a startup is assigned three industry group keywords. Using this information,

we computed a variable to measure the relatedness of a startup’s technology to software,

where we expect most of the GitHub activities to be concentrated. This index is defined as

the share of a startup’s industry groups that are related to software. The groups related

9

to software are: Apps, Artificial Intelligence, Consumer Electronics, Data and Analytics,

Design, Financial Services, Gaming, Information Technology, Internet Services, Messaging

and Telecommunications, Mobile, Payments, Platforms, Privacy and Security, and Software.

As shown in Table 1a, the mean of this index is 0.46.

〈 Insert Table 1 about here 〉

3.4 GitHub

We use the GitHub API to collect all organization accounts on GitHub. From these

accounts, we extract the websites of the account owners. We used this information to link

GitHub organization accounts to 10,482 Crunchbase company profiles. We further gather

time-variant information on the public events of all startups through GHArchive, which is a

non-profit service that provides a full record of the public timeline on GitHub since February

2011. This archive includes, among others, time-stamped data on events such as commits,

pull requests, and forks related to all own or external public repositories with which an

organization account interacts. In addition, we observe events related to the addition of new

members of repositories, and the creation of new public repositories or the transitioning of a

private repository to the public domain.

3.5 Classification of GitHub repositories

We further employed the GitHub API to collect meta information on all public repositories

in a startup’s organization account, as well as all personal accounts that are members of a

given organization account.4

Building on this data, we distinguish between a startup’s public activities related to its

own repositories and its engagement with external repositories. As reported in Figure 1,

we observe a stark increase over time in both the number of public activities related to the

startups’ own repositories and the number of engagements with external repositories.

〈 Insert Figure 1 about here 〉4Note that account-specific meta-information is time-invariant. We collected it through separate crawls inOctober 2020, January 2021, and March 2021.

10

Descriptive statistics reported in Table 1b show the top five events through which startups

engage with external and internal repositories. In the case of external repositories, the share

of forks is very large (96%). In contrast, contributions to external repositories in the form

of pushing (that is, making commits accessible to others), commenting on issues, and pull

requests are rare (1% or less). This distribution provides some evidence that startups access

external repositories to upgrade their technologies and not just to provide comments or

contribute to other users’ projects. As for the distribution of internal events, it appears to be

relatively more spread out. The most popular events concern adding members to repositories

(46%) and making repositories public (17%), but pushing and creating new events are also

frequent.

Using supervised and unsupervised machine learning methods which we describe in detail

in the Appendix, we classify the public repositories of all organizations, as well as the

external repositories with which the organizations interacted through commits, pull requests

or forks according to their type. We distinguish between repositories that pertain to software

development/back end (SD/BE), machine learning (ML), application programming interface

(API), and user interface (UI). By doing so, we consider a comprehensive set of aspects that

are relevant for the development of a digital technology. Because founders increasingly rely on

GitHub to upgrade their human resources, we generate an additional category encompassing

public repositories related to best practices on HR management. For example, these include

guidelines for coding interviews and templates for company policies on issues such as working

from home, diversity and inclusion.

We report a descriptive representation of the output obtained from the algorithm in Figure

2. Here, we display the most common words for each of the categories we consider. Figure

3, instead, captures the relevance of each category across industry groups. As shown, there

is variation across industry groups in the relevance of the different categories we consider.

For instance, the share of commits, pull requests, and forks related to UI repositories is

smallest for startups in privacy/security and highest in more consumer-facing internet services.

11

Startups in software/IT have the highest share of interactions with API-related repositories,

but the lowest share with productivity-related repositories. This suggests that firm- and

industry-specific unobservables are important factors to control for in our econometric models.

〈 Insert Figures 2 and 3 about here 〉

4 Empirical specification

We begin our empirical investigation by descriptively assessing the correlation between a

startup having an account on GitHub – which is our proxy for a startup’s involvement in an

open-source community – and attracting funds, from VCs and other investors. In a similar

vein as Bernstein et al. (2017), we also evaluate how the effect of a startup’s involvement

on GitHub compares to the effect of a startup’s human capital – as defined by whether a

startup’s founder or CXO5 is highly ranked on Crunchbase’s list of top people.6 To achieve

these goals, we estimate the following linear probability model:

Yifjs = α + βiGitHubi + γiTopTeami + ηf + νj + ψs + εifjs, (1)

where Yifjs is an indicator that takes the value 1 if a startup i, founded in year f , developing

a technology in industry group j, and located in region s, raised at least one financing round

as of January 2021 and zero otherwise. The variable GitHubi is an indicator that takes

the value 1 if a startup has a GitHub account, while TopTeami is an indicator identifying

prominent founders and CXOs. The latter measure equals one if an employee is ranked

among the top 500 by Crunchbase and zero otherwise. In this regression, we include fixed

effects for a startup’s founding year (ηf) and for whether the startup is located in either

Massachusetts, New York, or California (ψs). We additionally include industry group fixed

effects (νj). The industry groups we consider are Information Technology, Software, Data

5By CXOs we refer to Chief Executive Officers (CEOs), Chief Technology Officers (CTOs), Chief FinancialOfficers (CFOs), and Chief Marketing Officers (CMOs).

6See here: https://www.crunchbase.com/discover/people.

12

https://www.crunchbase.com/discover/people

Analytics, Internet Services, and Artificial Intelligence. As mentioned earlier, a startup can

be described by more than one industry group.

In the second part of the empirical analysis, we evaluate how a startup’s participation on

GitHub varies just before and after raising a round of financing. If investors value a startup’s

open innovation model, then we should observe that the startup is more active on GitHub in

the months preceding a given round and less active in the subsequent months. To test this

conjecture, we estimate a within-startup model, where we assess how the probability that a

startup engages in public activity on GitHub in month t varies during the twelve months

preceding and succeeding a startup’s financing round. To pursue this analysis, we restrict

the sample to startups that raised at least one round of financing and for which we could

identify a public GitHub account. Table 1c reports descriptive statistics for this sample. The

main model we estimate is:

Yitjs = α + βi,tPosti,t + γitτit + δitPosti,t × τit + ωi + +µit + νjt + ψst + ρit + εitjs, (2)

where Yitjs is an indicator that takes the value 1 if a startup i, developing a technology in

industry group j and located in region s, engages in at least one public activity – across

own and external repositories – on GitHub in month t and zero otherwise. In extensions to

our baseline analyses, we examine variants of this outcome to better understand the type

of GitHub activities in which a startup may engage. The indicator Postit is equal to one

for the twelve months that follow a startup’s given financing round, and zero in the twelve

months preceding it. While the focus of our analysis is on a startup’s first financing round,

we will also examine the reaction of startups as they relate to subsequent financing rounds.

The variable τit is the count of months to/from a given round. The coefficients of interest are

γ, which is associated with τit and δ, which is associated with the interaction between Postit

and τit. The γ represents the effect of an extra month towards a given financing round on

the likelihood that a startup engages in a public activity on GitHub, while δ represents the

13

effect of an extra month from a given financing round.

Ideally, we would be able to isolate the effect of engagement on GitHub from other

confounding factors such as a startup’s technology, the life cycle of a given technology, and

technology shocks. This is crucial given that these simultaneously occurring features may be

responsible for any relationship we detect. To get as close as possible to an ideal experiment,

we saturate the model with a fine-grained set of relevant fixed effects. In particular, ωi

is a focal startup’s fixed effect, which controls for unvarying differences across our sample

companies. For instance, it is possible that some startups are intrinsically more likely to

rely on GitHub, given the characteristics of the technologies they develop. A startup’s age

fixed effect is denoted by µi. The inclusion of this fixed effect addresses the concern that any

variation in the engagement of a startup on GitHub may be driven by a startup’s position in

the life cycle rather than by its intent to attract funding. In a more stringent specification,

we add age times startup and Postit times startup fixed effects given that life cycle effects

may vary by company and round raised. Postit times startup fixed effects also control for

potential investor requests post-investment, which may systematically vary from one startup

to another. We additionally include industry group times year fixed effects (νjt) to mitigate

the concern that the coefficient δ may be confounded by technology shocks, which could

happen at around the time a startup raises a financing round and induce the startup to

rely more intensely on GitHub. Moreover, we include location times year fixed effects (ψst),

distinguishing between California, Massachusetts, and New York (that is, the US technology

and entrepreneurial hubs) and all other states. The rationale is that possible technology

shocks happening in any given period should originate from these hubs. Finally, we include

fixed effects for whether a startup raised a subsequent round in any of the t months following

the observed round (ρit).

5 Results

This section explores (i) the correlation between having a GitHub account and obtaining

funds as well as (ii) the effect of raising a financing round on the evolution of a startup’s

14

activities on GitHub.

5.1 Raising a financing round

The results from estimating Eq. (1) are reported in Table 2. For this analysis, we consider

the full set of 160,065 startups. We cluster standard errors by founding year. The dependent

variable is the likelihood of having raised a financing round as of December 2020. As shown

in column (1), startups with a GitHub account are 0.21 percentage points more likely to have

raised a financing round. The magnitude of this correlation is smaller than the one between

having a prominent founder or CXO and raising a financing round (0.35 percentage points),

but remains economically relevant. While these correlations must be interpreted with caution

because there are several confounding factors that could bias our estimates, they suggest

that investors value a startup’s involvement with GitHub in addition to its founding team.

These descriptive relationships are confirmed in column (2) where we interact the GitHub

indicator with the TopTeam dummy to try to net out the impact of those confounding

factors that could be related with both a startup’s participation on GitHub and its founding

and/or top management team. Here, our results imply that having a GitHub account is

associated with a 0.21 percentage points higher likelihood of raising a financing round for

those startups without top founders and CXOs, which is equivalent to a 58% increase relative

to the unconditioned probability of raising a round. For comparison, having a prominent

founder and/or CXO is associated with a 0.39 percentage points higher likelihood of raising a

financing round for those startups without a GitHub account, equivalent to a 109% increase

relative to the unconditioned probability of raising a round. The implied magnitudes remain

similar in column (3) where we replace industry group fixed effects with the share of industry

groups that are related to software.

Overall, these results are in line with prior work (Bernstein et al., 2017; Gompers et al.,

2020; Kaplan et al., 2009) and suggest that while investors value a startup’s team, they also

positively consider its technology investments. In contrast to these studies, our focus in

the next section will be on startups’ engagement with open-source communities. We will

15

assess whether new ventures accelerate these investments prior to raising a financing round,

if so, the typologies of investments in which they engage, and heterogeneity in investor type

preferences.


5.2 Startup GitHub activities before and after raising a financing round

In this section, we examine how a startup’s activities on GitHub vary before and after

a given financing round. For this analysis, we consider the sample of 10,482 startups that

raised at least one financing round and had an account on GitHub. We begin by reporting

the main results and successively explore the mechanisms driving them.

5.2.1 Main results

Do investors value a startup’s engagement on GitHub? As we mentioned earlier, while such

engagement allows a firm to upgrade its technology, it is not exempt from drawbacks, which

include appropriability threats and coordination costs. If investors were to value a startup’s

participation on GitHub, we should observe the startup to become more active on GitHub

during the months preceding a given round relative to the subsequent months. If, instead,

investors do not value a startup’s reliance on the GitHub community, or at least do not value

it as much as they would value other startup aspects, then a startup may not intensify its

engagement on GitHub prior to raising a round. To assess startup open-source strategies

prior to raising a financing round, we first provide visual evidence for the relationship in

the raw data, and then we estimate Eq. (2). We report the results in Figure 4 and Table 3.

Here, we specifically focus on a funded startup’s first financing round. In the next section,

we extend the analysis to a startup’s subsequent rounds. We cluster standard errors at the

startup level.

Figure 4 offers a first glance into the raw relationship between GitHub engagement and

the time before and after raising a first round of financing. We depict startups’ average

likelihood of having an activity recorded by months prior to and following a first round. As

16

shown, the proportion of startups engaging in public activity on GitHub increases from 0.06

to 0.1 in the twelve months preceding a first financing round and remains stable around 0.1

to 0.11 in the subsequent twelve months.

〈 Insert Figure 4 about here 〉

In Table 3 we move from the raw pattern to the parametric relationship. Column (1)

includes startup and year fixed effects in addition to the fixed effects for whether a startup

raised a second round in any of the twelve months following the first round. We further

control for a startup’s position in the life cycle by including the natural logarithm of a

startup’s age. The positive coefficient associated with the indicator Postit suggests that, all

else equal, startups become more active on GitHub after they raise their first financing round.

The coefficient associated with the time trend τit is positive, indicating that the likelihood

that a startup engages in a public activity on GitHub increases as the startup gets closer

to raising its first round. The magnitude of the coefficient can be interpreted such that an

extra month in a startup’s life cycle is associated with a 0.34 percentage points increment in

the likelihood that a startup engages in a public activity on GitHub. This implies that an

extra twelve months towards raising a first round increases the probability of being active on

GitHub by 4 percentage points. This effect represents a 50% increase relative to the outcome

mean. Moreover, the negative coefficient of the interaction between Postit and τit suggests

that a startup’s participation in GitHub becomes less intensive the further in time it is from

the first round. The magnitude of the coefficient suggests that an extra month away from

a startup’s first round reduces the positive slope of the probability that a startup’s engage

in a public activity on GitHub by 0.2 percentage points. This effect entails that an extra

twelve months away from a first round lowers the positive increment in the probability of

being active on GitHub by 2.4 percentage points.

In column (2) of Table 3, we add region by year and industry group by year fixed effects.

As shown, the magnitudes of the coefficients remain very similar to those displayed in column

(1). In column (3), we replace our industry group keywords with the share of a startup’s

17

industry group keywords that are related to software. Having interacted the newly generated

variable with year fixed effects, we show that the results remain invariant.

In column (4), we estimate a similar model as the one in column (2), this time replacing

the natural logarithm of a startup’s age with age fixed effects. This specification, which is

our preferred one, delivers very similar results to those reported in columns (1) to (3).7 In

column (5), we estimate a more stringent specification than in column (4), allowing for the

possibility that life cycle effects may vary by startup and by whether the startup has raised a

first round or not. For this purpose, we include age times startup and Postit times startup

fixed effects. The magnitudes of the effects are stronger than those displayed in the previous

columns and the results continue to show that a startup’s engagement on GitHub intensifies

prior to raising a first round and fades afterwards.

Finally, in column (6) of Table 3, we add lead investor fixed effects to the specification in

column (5). The reason for including these investor fixed effects is that a startup’s technology

trajectory may vary depending on the type of investors it attracts. To identify lead investors,

we considered the categorization of lead investors provided by Crunchbase. When this was

missing, we considered the investor who backed the highest number of a focal startup’s

financing rounds as the lead investor. By adopting this procedure, we identified lead investors

for 62% of the startups. Reassuringly, the results continue to hold.8


While the findings presented so far suggest that investors value a startup’s participation

on GitHub, all else equal, it is possible that such participation reflects a startup’s technology

life cycle rather than its reliance on an open-source community. The inclusion of startup

and startup by age fixed effects should at least in part control for a startup’s technology

aspects, but some of them may remain unaccounted for. To further address this concern,

7Figure 9 graphically displays the results from this specification, having substituted τit with indicators foreach month preceding and following a given round. We observe an acceleration in a startup’s participation inGitHub as the first round approaches and a deceleration thereafter.

8Results available upon request show that the effects do not change by adding lead investor times age andlead investor times Postit fixed effects.

18

we delve deeper into the type of GitHub activities in which a startup engages. For this

purpose, we modify the dependent variable in Eq.(2) and examine whether a startup engages

in public activities related to its own repositories. This measure proxies for a startup’s

internal technology investments. Additionally, we investigate whether a startup engages with

external repositories. Since a startup does not have direct control over these repositories,

this additional outcome provides an indication of whether a startup builds on repositories

controlled by other GitHub users to develop its technologies. The results from this analysis

are reported Table 4 and in Figure 5. They show that, prior to raising a first round, startups

accelerate their engagement with external repositories more than they do for their internal

repositories. These findings support our interpretation that, all else equal, investors value a

startup’s engagement with open source communities and provide an indication that reliance

on external knowledge is at least as important as internally developed knowledge to upgrade

a startup’s technology and attract investments.9

〈 Insert Table 4 and Figure 5 about here 〉

We enrich our analysis by further assessing how a startup’s reliance on external technology

sources depends on their relative cost. For this purpose, we exploit a source of exogenous

variation in GitHub’s pricing model which occurred in October 2015. While the old pricing

model was repository based, implying that users had to pay for each new private repository,

the new pricing scheme was a user-based one. Upon paying their subscription, companies

would get unlimited private repositories allowing their employees to more efficiently organize

their collaborations within the company. While the marginal costs of internal collaborations

declined, the new pricing scheme left the marginal cost of interacting with external repositories

unchanged. As displayed in Figure 6, the decline in the marginal cost of internal collaborations

had a considerable positive impact on the number of such internal collaborations – proxied

9In results left unreported, we find that the results remain similar when we use the ratio of external to internalactivities as an outcome. With this specification we de-trend the trajectory of a startup’s technology. Further,we obtain similar results when we only retain the most common events through which startups access externalrepositories, that is, the forks.

19

by member events –, starting from the moment the new pricing model was implemented.

Conversely, startups’ engagement with external repositories or other internal repositories

unrelated to internal collaborations remained unchanged.

〈 Insert Figure 6 about here 〉

The important feature of this pricing reform is that it is uncorrelated with the technology

life cycle of a startup. As a result, we can assess how exogenous variations in the relative

cost of accessing external sources of knowledge affects the startups’ willingness to rely on

open-source communities to develop their technologies and attract funds. For this scope, we

modify Eq. (2) including an indicator – NewPriceSchemet – identifying the period following

the introduction of the new pricing scheme. We additionally introduce interaction terms

between NewPriceSchemet and the following variables: Postit, τit, and Postit × τit.

The results are reported in Table 5 and Figure 7. The dependent variables we consider

are: an indicator that equals one if a startup engages in a member event in month t and zero

otherwise (column 1); an indicator for whether a startup engages in a non-member, internal

event (column 2); and an indicator for whether a startup engages with an external repository

(column 3). To ensure that the introduction of the new pricing scheme is not correlated with

a startup’s characteristics and the life cycle of its technology, we control for the full set of

fixed effects.

As shown, prior to the pricing reform, startups would engage with external repositories

to develop their technologies on the verge of raising their first financing round. Conversely,

the number of member events was close to zero and invariant in the months preceding and

following the financing round. After the pricing reform, we observe that firms accelerate

their internal collaborations, but less so their engagement with external repositories, in the

months preceding their first financing round. Overall, these findings suggest that a startup’s

decision to integrate external sources of knowledge in the production of its technologies

crucially depends on their relative cost: startups opt for these external sources when the cost

of accessing them is comparatively low.

20


Delving deeper into the heterogeneity of a startup’s digital technology investments, we

next categorize a startup’s activities on GitHub according to whether they pertain to SD/BE,

ML, API, and UI. By doing so, we consider all the aspects related to the production and

development of a digital technology, ranging from the back end, data analysis, the front end,

and the interconnection between front end and back end. Since several practitioner reports

have pointed out that founders increasingly rely on GitHub to upgrade their human resources,

we additionally examine whether a startup resorts to GitHub to access best practices on HR

management (Jain, 2018).

The results are reported in Table 6 and graphically depicted in Figure 8. In Panel A of

Table 6, we focus on GitHub public activities related to a startup’s own repositories, while in

Panel B we examine a startup’s engagement with external repositories. In column (1) we show

that, prior to raising a first round, startups intensify both their internal activities and their

engagement with external repositories related to SD/BE. However, while a startup’s external

engagement significantly fades after raising a first round, the increment in the likelihood of

investing in SD/BE does not significantly change after raising a first round. In column (2),

we show that a startup’s investment in ML prior to raising a first round predominantly rests

on a startup’s engagement with external repositories. In column (3), where we consider a

startup’s investment in API, we again find a steady increase in the likelihood that a startup’s

engages with both internal and external repositories. However, the engagement with external

repositories fades after the round is raised. In column (4), we do not observe significant

variations in the likelihood that a startup engages in a UI activity, pre- and post-round, both

in Panel A and in Panel B. This may be due to the fact that the importance of UI rises

towards the later phases of a startup’s digital development, which may take place well after a

first round is raised. Finally, we show that startups rely on external repositories to access best

practices on HR management. These patterns are graphically depicted in Figure 6. Overall,

these results suggest that startups accelerate their investments in the different aspects of

21

their digital technologies in order to attract funds by, especially, relying on external sources

of knowledge.


One possibility is that the increased involvement of startups on GitHub simply acts as a

signal for potential investors rather than it helps improve a startup’s innovation stack (as we

have interpreted our findings so far). The fact that startups – prior to receiving a first round

– intensify their GitHub involvement along the full spectrum of digital technology production

provides initial evidence that signaling is unlikely the sole driver of our findings.

To provide further insight on the matter, we test whether startups are more likely to

make their new or previously private repositories public before raising a financing round,

which should occur if the main purpose of engagement on GitHub is signaling. In Table

7, we examine the likelihood of making at least one new or previously private repository

public, as measured by “Public Events” on the GitHub timeline of an organization account.

Across all specifications, which we produce by estimating the same models as displayed in

Table 3, we observe an increasing trend in turning private repositories public over time. Most

importantly, we do not detect any significant change in the trend after receiving a first round

of financing. These results suggest that startups are at least as likely to make their new or

previously private repositories public before and after raising a financing round.


While the evidence reported in Table 7 speaks against the potential explanation that

our main results are entirely driven by signaling, it additionally suggests that a startup’s

deceleration of its involvement on GitHub post-round is unlikely to be the sole consequence of

its investors’ distaste for their investees’ public engagement. Nevertheless, it is still possible

that investors may require their startups to modify the modalities of participation on GitHub.

To explore this issue further, we examine the types of licenses that startups choose for their

22

new repositories before and after raising a financing round. In particular, we distinguish

between more or less permissive strategies based on whether or not they grant use rights,

including the right to re-license.10 The results from this analysis are reported in Table 8.

Here, we show that startups become significantly less likely to adopt permissive licenses for

the new repositories they create, the further in time they are from the first round. This result

suggests that investors may be sensitive to appropriability concerns, inducing their portfolio

startups to engage on GitHub through less permissive licenses.

On the whole, these findings – together with our baseline results reported in Table 3 and in

Figure 4 – indicate that startups are, indeed, “beefing up” their IT before they receive early-

stage financing by relying on open-source communities rather than merely showcasing their

technology to potential investors. They additionally indicate that – post-round – startups

become increasingly aware of appropriability issues, choosing less permissive open source

licenses for their repositories.


5.2.2 Mechanisms

In this section, we explore the mechanisms driving our results. We begin by assessing

whether a startup’s engagement on GitHub varies depending on the type of technology a

startup develops. We then evaluate whether such engagement is also detected prior to raising

a second or a third financing round. Finally, we examine heterogeneity in startup response

depending on the type of investors participating in a given round.

In Table 9, we investigate how the engagement on GitHub prior to raising a first financing

round varies among startups that we classify as being software intensive according to the

industry groups specified on Crunchbase. The rationale for this analysis is that GitHub is

primarily used by companies for which software development is a core business activity and,

thus, we may expect stronger effects for these companies. Consistent with this conjecture,

10Examples of permissive licenses are BSD, MIT, Apache, and CC-BY. For a comprehensive list, refer tohttps://gist.github.com/nicolasdao/a7adda51f2f185e8d2700e1573d8a633.

23

the results in columns (1) and (2) show that “software startups” become relatively more

involved in GitHub prior to raising funds than the other startups. After both sets of startups

raise their first round, their activities on the GitHub platform decelerate.


The relevance of a startup’s engagement on GitHub may vary depending on whether

a startup seeks to raise a first financing round or subsequent rounds. While investors

participating in a startup’s first round may value a startup’s technology investments relatively

more, investors participating in follow-on rounds may prefer other aspects, such as a startup’s

marketing efforts (Wasserman, 2003). As a result, a startup’s involvement with the GitHub

community may matter more during a startup’s earliest rounds and progressively fade as

additional rounds are raised. To assess this conjecture, we re-estimate Eq. (2) for a startup’s

second and third round, respectively. The results are reported in Table 10 and in Figure 9.

We adopt the same equation specification as reported in column (4) of Table 3. As shown,

the increasing trend in the likelihood that a startup engages in a public activity on GitHub

exhibits the steepest slope in the twelve months prior to raising a first round and is flat in

the twelve months prior to raising a third round.


We next assess how a startup’s engagement on GitHub varies with the type of investor

participating in a startup’s first round. Specifically, we generate an indicator for whether

at least one of the participating investors in a startup’s first round is a VC. The reason

for focusing on VCs is that they have been found to positively value a startup’s technology

relative to other investors (Conti et al., 2013b) and to be especially active in screening and

nurturing their portfolio startups (Bernstein et al., 2016; Sørensen, 2007). To investigate this

potential source of heterogeneity, we modify Eq. (2) adding the interactions between V C and

Postit and V C and τit, as well as the triple interaction between V C, Postit, and τit. The

results are reported in column (1) of Table 11.

24

As shown, the likelihood that a startup engages in a public activity on GitHub increases

as the startup approaches the first round date, regardless of whether a VC participates in

the round or not. However the increment is significantly larger when a VC participates in

the round. Post-round, the slope of the time trend declines for both VC-led and non-VC-led

rounds. This suggests that VCs value, more than other investors, the fact that their portfolio

startups rely on open-source communities to develop their digital technologies.

A related aspect we consider is whether startups react differently depending on how

successful the investors participating in a startup’s first round are. We examine two measures

of investor success. The first is the number of investments an investor made in the five years

prior to an observed startup’s first round. The second is the number of successful investments

made during the same time period. A successful investment is the backing of a startup

that ultimately exits via an acquisition or an IPO. For each observed round, we retained

the maximum number of investments (or successful investments) made by the investors

participating in the round. The results are reported in columns (2) and (3) of Table 11. As

shown, the slope of the likelihood that a startup engages in a public activity on GitHub

prior to raising a first financing round is steeper the larger the number of past (successful)

investments made by the most successful investor in a round.


6 Discussion and Conclusions

In this paper, we analyze unique data linking Crunchbase profiles to accounts on the

software development platform GitHub. We investigate the role of open-source technology

investments among nascent firms in raising funds. Our results suggest that participation in

open-source communities plays an important role in achieving funding milestones. Specifically,

we observe a stark increase in activities on GitHub prior to a firm raising its first financing

round; an effect that dissipates in months thereafter. These results, which we obtain by

controlling for fixed differences among startups and their technologies, life cycle effects,

25

and technology shocks provide an indication that earliest-round investors value a startup’s

involvement with open source communities.

Particularly noteworthy is that startups intensify those GitHub activities that rely on

external technology sources above and beyond the technologies they themselves control,

particularly when the marginal cost of accessing these sources is reduced by a exogenous price

change for using the GitHub platform. The most prevalent among these external activities

are related to software development, data analytics, and integration of their technologies.

Additionally, startups resort to GitHub to access best practices on HR management, suggesting

that investors also value the ability of startups to upgrade their HR resources through GitHub.

We provide further evidence showing that the increased involvement of startups on

GitHub improves a startup’s innovation pipeline and does not merely act as a signal for

potential investors. Startups appear to be at least as likely to make new or previously private

repositories public before the first financing round as they are after the first round. In addition

to suggesting that signaling is not the only driver of our findings, this evidence speaks against

the potential explanation that startups decelerate their involvement on GitHub post-round

because of investors’ distaste for their investees’ public engagement. However, we do find that

after a startup raises a financing round, the likelihood that it chooses a permissive license for

its repositories significantly declines, providing an indication that investors may be sensitive

to appropriability concerns. Finally, our results further indicate that VCs and renowned

investors are the most responsive to these GitHub activities.

This study contributes to increasing our understanding of the different channels through

which technology can be sourced, what and how technology aspects matter for attracting

funding, and investor preferences. As such, our findings extend the literature that has

investigated how open innovation influences a firm’s ability to innovate and appropriate

the benefits of innovation (Dahlander and Gann, 2010), and the role of open source for

firm productivity (Nagle, 2018, 2019; Shah and Nagle, 2019). We highlight a novel channel

through which startups benefit from using and actively engaging in open source software

26

endeavors. In our context, technology startups rely heavily on external sources to develop –

“beef up” – their own technologies and their involvement in open-source communities matters

for attracting investors, particularly VCs and successful investors, during startups’ early

stages. Further, we contribute to the entrepreneurial finance literature that has investigated

whether VCs invest in the founding team or the technology (Bernstein et al., 2017; Gompers

et al., 2020; Kaplan et al., 2009). We provide evidence that open-source engagement in

development, scale and integration, matter more than investing in the user interface, at least

for raising the first round of financing. Finally, our use of machine learning algorithms to

classify startups’ activities on GitHub builds on an emerging line of research that applies

sophisticated data techniques to categorize firm strategies (Conti et al., 2020; Guzman and

Li, 2019).

The external validity of our approach may be limited given that we focus on a specific

open-source platform. However, GitHub is the largest host of source code with over 40 million

public repositories to date and anecdotal evidence suggests that investors take public GitHub

activities into consideration in their due diligence efforts (Jain, 2018). Although our results

are based on particular activities on a specific online platform, we believe they have broader

implications, especially for early stage ventures.

In conclusion, this paper contributes to our understanding of the impact of early-stage tech-

stack investments to achieving funding milestones. By opening the technology “black-box”, we

reveal important nuances that have been largely overlooked in the literature. Particularly our

finding that nascent firms extensively rely on external sources to develop, scale, and integrate

their own technologies may provide promising avenues for future research as it appears

contradictory to the classical notion of firm creation (Coase, 1937; Lippman and Rumelt,

2003; Zenger et al., 2011), organizational boundaries (Felin and Zenger, 2014; Lakhani et al.,

2013a), resources (Barney, 1991; Wernerfelt, 1984), and the need for intellectual property

protection to spur innovation (Baldwin and Hippel, 2011; Schumpeter, 1934).

27

References

Almirall, E. and Casadesus-Masanell, R. (2010). Open versus closed innovation: A model of

discovery and divergence. Academy of Management Review, 35(1):27–47.

Arora, A., Fosfuri, A., and Gambardella, A. (2001). Markets for technology and their

implications for corporate strategy. Industrial and Corporate Change, 10(2):419–451.

Arora, A. and Gambardella, A. (1990). Complementarity and external linkages: The strategies

of the large firms in Biotechnology. Journal of Industrial Economics, pages 361–379.

Baldwin, C. and Hippel, E. v. (2011). Modeling a paradigm shift: From producer innovation

to user and open collaborative innovation. Organization Science, 22(6):1399–1417.

Barney, J. (1991). Firm resources and sustained competitive advantage. Journal of Manage-

ment, 17(1):99–120.

Bernstein, S., Giroud, X., and Townsend, R. R. (2016). The impact of venture capital

monitoring. Journal of Finance, 71(4):1591–1622.

Bernstein, S., Korteweg, A., and Laws, K. (2017). Attracting early-stage investors: Evidence

from a randomized field experiment. Journal of Finance, 72(2):509–538.

Buss, P. and Peukert, C. (2015). R&D outsourcing and intellectual property infringement.

Research Policy, 44(4):977–989.

Chesbrough, H. (2006a). New puzzles and new findings. In Chesbrough, H., Vanhaverbeke, W.,

and West, I., editors, Open Innovation: Researching a New Paradigm. Oxford University

Press, Oxford, UK.

Chesbrough, H. (2006b). Open business models: How to thrive in the new innovation landscape.

Harvard Business Press.

Coase, R. H. (1937). The nature of the firm. Economica, 4(16):386–405.

Cohen, W. M. and Levinthal, D. A. (1990). Absorptive capacity: A new perspective on

learning and innovation. Administrative Science Quarterly, pages 128–152.

Conti, A., Guzman, J., and Rabi, R. (2020). Information frictions in the market for startup

28

acquisitions. Available at SSRN 3678676.

Conti, A. and Roche, M. P. (2021). Lowering the bar? External conditions, opportunity

costs, and high-tech start-up outcomes. Organization Science.

Conti, A., Thursby, J., and Thursby, M. (2013a). Patents as signals for startup financing.

Journal of Industrial Economics, 61(3):592–622.

Conti, A., Thursby, M., and Rothaermel, F. T. (2013b). Show me the right stuff: Signals for

high-tech startups. Journal of Economics & Management Strategy, 22(2):341–364.

Dahlander, L. and Gann, D. M. (2010). How open is innovation? Research Policy, 39(6):699–

709.

David, P. A. and Greenstein, S. (1990). The economics of compatibility standards: An

introduction to recent research. Economics of Innovation and New Technology, 1(1-2):3–41.

Di Stefano, G., Gambardella, A., and Verona, G. (2012). Technology push and demand

pull perspectives in innovation studies: Current findings and future research directions.

Research Policy, 41(8):1283–1295.

Felin, T. and Zenger, T. R. (2014). Closed or open innovation? Problem solving and the

governance choice. Research Policy, 43(5):914–925.

Fleming, L. (2001). Recombinant uncertainty in technological search. Management Science,

47(1):117–132.

Fosfuri, A., Giarratana, M. S., and Luzzi, A. (2008). The penguin has entered the building:

The commercialization of open source software products. Organization Science, 19(2):292–

305.

Gans, J. S. and Stern, S. (2003). The product market and the market for “ideas”: Commer-

cialization strategies for technology entrepreneurs. Research Policy, 32(2):333–350.

Gompers, P. A., Gornall, W., Kaplan, S. N., and Strebulaev, I. A. (2020). How do venture

capitalists make decisions? Journal of Financial Economics, 135(1):169–190.

Greenstein, S. (1996). Invisible hands versus invisible advisors: Coordination mechanisms in

29

economic networks. In Noam, E. and Nishuilleabhain, A., editors, Public Networks, Public

Objectives. Elsevier Science, Amsterdam.

Grimpe, C. and Kaiser, U. (2010). Balancing internal and external knowledge acquisition: the

gains and pains from R&D outsourcing. Journal of Management Studies, 47(8):1483–1509.

Guzman, J. and Li, A. (2019). Measuring founding strategy. Available at SSRN 3489585.

Hellmann, T. and Puri, M. (2002). Venture capital and the professionalization of start-up

firms: Empirical evidence. Journal of Finance, 57(1):169–197.

Hsu, D. H. and Ziedonis, R. H. (2013). Resources as dual sources of advantage: Implications

for valuing entrepreneurial-firm patents. Strategic Management Journal, 34(7):761–781.

Jain, V. (2018). Investor due diligence: Beyond the obvious. https://startupflux.com/

investor-due-diligence-beyond-the-obvious/amp/. Accessed: 2021-6-29.

Jeppesen, L. B. and Lakhani, K. R. (2010). Marginality and problem-solving effectiveness in

broadcast search. Organization Science, 21(5):1016–1033.

Kaplan, S. N. and Lerner, J. (2010). It ain’t broke: The past, present, and future of venture

capital. Journal of Applied Corporate Finance, 22(2):36–47.

Kaplan, S. N., Sensoy, B. A., and Stromberg, P. (2009). Should investors bet on the jockey

or the horse? Evidence from the evolution of firms from early business plans to public

companies. Journal of Finance, 64(1):75–115.

Lakhani, K. R., Boudreau, K. J., Loh, P.-R., Backstrom, L., Baldwin, C., Lonstein, E., Lydon,

M., MacCormack, A., Arnaout, R. A., and Guinan, E. C. (2013a). Prize-based contests can

provide solutions to computational biology problems. Nature Biotechnology, 31(2):108–111.

Lakhani, K. R., Lifshitz-Assaf, H., and Tushman, M. L. (2013b). Open innovation and

organizational boundaries: Task decomposition, knowledge distribution and the locus of

innovation. In Handbook of Economic Organization. Edward Elgar Publishing.

Lakhani, K. R. and von Hippel, E. (2003). How open source software works: “free” user-to-user

assistance. Research Policy, 32(6):923–943.

Laursen, K. and Salter, A. (2006). Open for innovation: the role of openness in explaining

30

https://startupflux.com/investor-due-diligence-beyond-the-obvious/amp/

https://startupflux.com/investor-due-diligence-beyond-the-obvious/amp/

innovation performance among uk manufacturing firms. Strategic Management Journal,

27(2):131–150.

Lippman, S. A. and Rumelt, R. P. (2003). A bargaining perspective on resource advantage.

Strategic Management Journal, 24(11):1069–1086.

Nagle, F. (2018). Learning by contributing: Gaining competitive advantage through contri-

bution to crowdsourced public goods. Organization Science, 29(4):569–587.

Nagle, F. (2019). Open source software and firm productivity. Management Science,

65(3):1191–1215.

Ritter, J. R. (2016). Initial public offerings: VC-backed IPO statistics through 2015.

Roche, M. P., Conti, A., and Rothaermel, F. T. (2020). Different founders, different venture

outcomes: A comparative analysis of academic and non-academic startups. Research Policy,

49(10):104062.

Schumpeter, J. A. (1934). The theory of economic development; An inquiry into profits,

capital, credit, interest, and the business cycle. Harvard University Press, Cambridge, Mass.

Shah, S. and Nagle, F. (2019). Why do user communities matter for strategy? Harvard

Business School Strategy Unit Working Paper, (19-126).

Shane, S. A. (2004). Academic Entrepreneurship: University Spinoffs and Wealth Creation.

Cheltenham: Edward Elgar Publishing.

Sørensen, M. (2007). How smart is smart money? A two-sided matching model of venture

capital. Journal of Finance, 62(6):2725–2762.

Teece, D. J. (1986). Profiting from technological innovation: Implications for integration,

collaboration, licensing and public policy. Research Policy, 15(6):285–305.

Wasserman, N. (2003). Founder-CEO succession and the paradox of entrepreneurial success.

Organization Science, 14(2):149–172.

Wernerfelt, B. (1984). A resource-based view of the firm. Strategic Management Journal,

5(2):171–180.

31

Yoo, Y., Boland Jr, R. J., Lyytinen, K., and Majchrzak, A. (2012). Organizing for innovation

in the digitized world. Organization Science, 23(5):1398–1408.

Zenger, T. R., Felin, T., and Bigelow, L. (2011). Theories of the firm–market boundary.

Academy of Management Annals, 5(1):89–133.

32

Figure 1: GitHub activities related to own and external repositories over time

Notes: In this figure, we distinguish between a startup’s public activities related to its own repositories andits engagement with external repositories.

33

Fig

ure

2:R

epos

itor

ycl

assi

fica

tion

:w

ord

clou

ds

Software

(Backend),

ISoftware

(Backend),

IIM

achin

eLea

rnin

g

API

Use

rIn

terface

(UI)

Human

Reso

urces(H

R)

Notes:

This

figu

redis

pla

ys

the

mos

tco

mm

onw

ords

for

each

ofth

eca

tego

ries

we

hav

eco

nsi

der

ed.

Not

eth

atth

ese

wor

dcl

ouds

are

bas

edon

the

full

clas

sifi

edsa

mp

le,

for

wh

ich

we

use

dth

eh

and

-cod

edtr

ain

ing

sam

ple

asst

arti

ng

poi

nt

for

k-n

eare

stn

eigh

bor

s.T

oco

nst

ruct

this

figu

re,

wor

dco

unts

for

agi

ven

cate

gory

wer

ed

isco

unte

dif

the

wor

ds

also

ap

pea

red

freq

uen

tly

inoth

erca

tegori

es.

34

Figure 3: GitHub activity types by industry groups

0 10 20 30 40 50percent

Software/IT

Mobile

Internet Services

Privacy/Security

Data/Analytics

Other

ML UIAPI Productivity

Notes: This figure reports the relevance of each technology category across industry groups.

35

Figure 4: Raw Data – Trends over time

.04

.06

.08

.1.1

2

-10 -5 0 5 10time

90% CI Fitted valuesFitted values (mean) activity_binary

Notes: This figure depicts the raw relationship between time to and from a founding round and startups’average activity on GitHub. Activity binary equals one if the startup logged an activity on GitHub. Weinclude 90% confidence intervals.

36

Figure 5: GitHub activity over time for external and internal activities

-.04

-.02

0.0

2.0

4C

oeffi

cien

t

12 6 2 2 6 12time

90%CI External90%CI Internal

By External vs. Internal

Notes: This figure displays how the engagement on GitHub prior to raising a first financing round and aftervaries among startups depending on whether the activities are external or internal. The solid line presentsthe results for all external activities and the dotted line displays results for internal activities. The respectivestarting point coefficients are displayed in t-12. Confidence intervals are at the 90% level.

37

Figure 6: GitHub activities at around the change of the GitHub’s pricing scheme

1600

018

000

2000

022

000

2400

0O

ther

Inte

rnal

020

0040

0060

0080

00M

embe

r Eve

nts

2500

030

000

3500

040

000

4500

050

000

Exte

rnal

2015m1 2015m7 2016m1 2016m7 2017m1

External Member Events Other Internal

Notes: This figure displays the general changes of activity on GitHub depending on type at around thetime of the GitHub’s change in the pricing scheme. In this graph we distinguish between external, internalwithout new member activity (”Other Internal”) and internal new member activity only (”Member Events”).

Figure 7: The effect of the change in the GitHub’s pricing scheme

-.02

-.01

0.0

1C

oeffi

cien

t

12 6 2 2 6 12time

90%CI Before price change90%CI After price change

A: Internal Member Events

-.03

-.02

-.01

0.0

1C

oeffi

cien

t

12 6 2 2 6 12time


B: Other Internal Events

-.06

-.04

-.02

0.0

2C

oeffi

cien

t

12 6 2 2 6 12time


C: External Events

Notes: This figure displays the effects of the change in the GitHub’s pricing scheme from a repository-basedone to a user-based one. With the new pricing model, paying customers can create unlimited privaterepositories. We distinguish between internal member events (A), internal events without new memberactivity (B), and external events (C). Confidence intervals are at the 90% level.

38

Figure 8: GitHub external activity types over time

-0.0

080.

011

Coe

ffici

ent

12 126 2 2 6 -0.

002

0.00

2C

oeffi

cien

t

12 6 2 2 6 12ML

-0.0

070.

006

Coe

ffici

ent

12 6 2 2 6 12API

-0.0

030.

005

Coe

ffici

ent

12 6 2 2 6 12UI

SD/BE

Notes: This figure displays the results reported in Table 5, Panel B, over time, where we categorize astartup’s external activities on GitHub according to whether they pertain to SD/BE, ML, API, and UI.Confidence intervals are at the 90% level.

39

Figure 9: GitHub activity by round

-.04

-.02

0.0

2.0

4C

oeffi

cien

t

12 6 2 2 6 12time

90%CI Round 190%CI Round 290%CI Round 3

By Round

Notes: This figure displays how the engagement on GitHub prior to raising a first financing round and aftervaries among startups depending on the round in question. The respective starting point coefficients aredisplayed in t-12. Confidence intervals are at the 90% level.

40

Table 1a: Summary statistics - Full sample of startups

Obs Mean Std. Dev. Min Max p50Raised funds 160065 0.357 0.479 0 1 0IPO 160065 0.012 0.107 0 1 0Acquired 160065 0.081 0.273 0 1 0GitHub 160065 0.093 0.29 0 1 0Top Team 160065 0.002 0.039 0 1 0AI 160065 0.039 0.194 0 1 0Data Analytics 160065 0.098 0.297 0 1 0Information Technology 160065 0.182 0.386 0 1 0Internet Services 160065 0.200 0.400 0 1 0Software 160065 0.356 0.479 0 1 0N. Industry Groups 160065 2.917 1.729 1 19 3Software share 154885 0.464 0.387 0 1 .5California 160065 0.296 0.457 0 1 0Massachusetts 160065 0.045 0.207 0 1 0New York 160065 0.128 0.334 0 1 0

Table 1b: Top 5 external and internal GitHub events

External InternalType Total Share Type Total ShareForkEvent 194518 0.96 MemberEvent 180338 0.46WatchEvent 3486 0.02 PublicEvent 64355 0.17PushEvent 1797 0.01 PushEvent 57196 0.15IssueCommentEvent 1316 0.01 CreateEvent 41648 0.11PullRequestEvent 645 0.003 TeamAddEvent 21001 0.05Total Events 203565 Total Events 388001

41

Table 1c: Summary statistics - Startups that raised funds and have a GitHub account

Obs Mean Std. Dev. Min Max p50Engage with internal repos 10482 0.014 0.116 0 1 0Engage with external repos 10482 0.025 0.157 0 1 0Software (back end) 10482 0.008 0.091 0 1 0ML 10482 0 0.020 0 1 0API 10482 0.007 0.084 0 1 0UI 10482 0.002 0.045 0 1 0HR 10482 0 0.020 0 1 0AI 10482 0.110 0.313 0 1 0Data Analytics 10482 0.227 0.419 0 1 0Information Technology 10482 0.265 0.441 0 1 0Internet Services 10482 0.284 0.451 0 1 0Software 10482 0.615 0.487 0 1 1N. Industry Groups 10482 3.711 1.803 1 15 3Software share 10482 0.610 0.332 0 1 0.667California 10482 0.434 0.496 0 1 0Massachusetts 10482 0.059 0.236 0 1 0New York 10482 0.159 0.366 0 1 0

Notes The proportions reported for the various GitHub activities are relative the period starting twelvemonths prior to a startup’s first financing round and ending twelve months after.

42

Table 2: Raising a financing round: Determinants

(1) (2) (3)Raising a financing round

GitHubi 0.208∗∗∗ 0.209∗∗∗ 0.235∗∗∗

(0.0102) (0.0103) (0.0101)

Top Teami 0.351∗∗∗ 0.387∗∗∗ 0.362∗∗∗

(0.0161) (0.0347) (0.0155)

GitHubi × Top Teami -0.0760(0.0608)

Software sharei -0.0232∗∗∗

(0.00409)

Founding year FE Y Y YIndustry Group FE Y Y

Observations 160065 160065 154885Mean DV 0.357 0.357 0.363

Notes : This table reports the results from estimating variants of Eq.(1). The dependent variable is the likelihood a startup raised at leastone financing round as of January 2021. The variable GitHubi is anindicator that takes the value 1 if a startup has a GitHub account,while TopTeami is an indicator identifying prominent founders andCXOs. The latter measure equals one if an employee is ranked amongthe top 500 by Crunchbase and zero otherwise. We include fixedeffects for a startup’s founding year and for whether the startup islocated in Massachusetts, New York, or California. In columns (1)and (2), we include industry group fixed effects. The industry groupswe consider are Information Technology, Software, Data Analytics, In-ternet Services, and Artificial Intelligence. In column (3), we replaceour industry groups fixed effects with the share of a startup’s indus-try group keywords that are related to software (Software sharei).The keywords related to software are: Apps, Artificial Intelligence,Consumer Electronics, Data and Analytics, Design, Financial Ser-vices, Gaming, Hardware, Information Technology, Internet Services,Messaging and Telecommunications, Mobile, Payments, Platforms,Privacy and Security, and Software. Standard errors - reported inparentheses - are clustered at the startup founding year level. Signifi-cance noted as: *p<0.10; **p<0.05; ***p<0.01.

43

Table 3: Startup engagement on GitHub

(1) (2) (3) (4) (5) (6)Engaging in a public activity on GitHub

Postit 0.0418∗∗∗ 0.0431∗∗∗ 0.0430∗∗∗ 0.0456∗∗∗

(0.00550) (0.00550) (0.00550) (0.00554)

τit 0.00340∗∗∗ 0.00345∗∗∗ 0.00344∗∗∗ 0.00365∗∗∗ 0.00421∗∗∗ 0.00445∗∗∗

(0.000247) (0.000248) (0.000248) (0.000249) (0.000262) (0.000356)

Postit × τit -0.00198∗∗∗ -0.00208∗∗∗ -0.00207∗∗∗ -0.00242∗∗∗ -0.00346∗∗∗ -0.00297∗∗∗

(0.000377) (0.000377) (0.000378) (0.000381) (0.000474) (0.000643)

Ageit 0.0743∗∗∗ 0.0672∗∗∗ 0.0659∗∗∗

(0.00673) (0.00674) (0.00673)

Startup FE Y Y Y Y Y YYear × Region FE Y Y Y Y YYear × Industry Group FE Y Y Y YYear × Software sharei YStartup Age FE YStartup Age × Startup FE Y YPostit × Startup FE Y YLead Investor FE Y

Observations 246462 246462 246462 246462 244617 150972R2 0.273 0.274 0.274 0.275 0.423 0.424

Notes: This table reports the results from estimating variants of Eq. (2). Postit is equal to one for the twelvemonths that succeed a startup’s first financing round, and zero in the twelve points preceding it. The variable τit,instead, is the count of months to/from a startup’s first round. In column (1), we control for the natural logarithmof a startup’s age. We additionally include startup and year fixed effects, as well as fixed effects controlling forwhether a startup raised a second round in any of the twelve months following the first round. In column (2), weadd region by year and industry group by year fixed effects. The industry groups we consider are InformationTechnology, Software, Data Analytics, Internet Services, and Artificial Intelligence. In column (3), we replaceour industry groups with the share of a startup’s industry group keywords that are related to software (Softwaresharei). The keywords related to software are: Apps, Artificial Intelligence, Consumer Electronics, Data andAnalytics, Design, Financial Services, Gaming, Hardware, Information Technology, Internet Services, Messagingand Telecommunications, Mobile, Payments, Platforms, Privacy and Security, and Software. In column (4), weestimate a similar model as the one in column (3), this time replacing the natural logarithm of a startup’s age withage fixed effects. In column (5), we estimate a similar model as the one in column (4) where we add age by startupand Postit by startup fixed effects. Finally, in column (6), we add lead investor fixed effects to the specification incolumn (5). Standard errors - reported in parentheses - are clustered at the startup level. Significance noted as:*p<0.10; **p<0.05; ***p<0.01.

44

Table 4: Startup engagement on GitHub - Engagement with own repositories vs. engagementwith external repositories

(1) (2)Engaging with:

Own repositories External repositories

Postit 0.0124∗∗∗ 0.0385∗∗∗

(0.00365) (0.00509)

τit 0.00136∗∗∗ 0.00295∗∗∗

(0.000160) (0.000226)

Postit × τit -0.000525∗∗ -0.00213∗∗∗

(0.000247) (0.000349)

Startup FE Y YYear × Region FE Y YYear × Industry Group FE Y YStartup Age FE Y Y

Observations 246462 246462R2 0.237 0.243

Notes : This table reports the results from estimating variants of Eq. (2). The outcomein column (1) is whether a startup engages in public activities related to its ownrepositories in t. This measure proxies a startup’s internal technology investments.The outcome in column (2) is whether a startup engages with external repositories int. This outcome provides an indication of whether a startup builds on repositoriescontrolled by other GitHub users to develop its technologies. Postit is equal to onefor the twelve months that succeed a startup’s first financing round, and zero in thetwelve points preceding it. The variable τit, instead, is the count of months to/froma startup’s first round. We include startup and startup age fixed effects, as well asfixed effects controlling for whether a startup raised a subsequent round in any of thetwelve months following a first round. Additionally, we include region by year andindustry group by year fixed effects. Standard errors - reported in parentheses - areclustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.

45

Table 5: Startup reaction to an exogenous change in the GitHub pricing scheme

(1) (2) (3)Member event Other internal event External event

Postit × τit 0.000201 -0.00134∗∗∗ -0.00255∗∗∗

(0.000131) (0.000468) (0.000625)

τit 0.000125∗∗∗ 0.00189∗∗∗ 0.00402∗∗∗

(0.0000456) (0.000232) (0.000315)

NewPriceSchemet 0.00507 0.00775∗∗ 0.00605(0.00346) (0.00357) (0.00581)

Postit × NewPriceSchemet 0.0167∗∗ -0.00647 0.0184(0.00806) (0.00955) (0.0143)

NewPriceSchemet × τit 0.000902∗∗∗ -0.000886∗∗∗ -0.00140∗∗∗

(0.000232) (0.000322) (0.000481)

Postit × NewPriceSchemet × τit -0.000860∗∗ 0.000690 -0.000519(0.000394) (0.000585) (0.000830)

Startup FE Y Y YYear × Region FE Y Y YYear × Industry Group FE Y Y YStartup Age × Startup FE Y Y YPostit × Startup FE Y Y Y

Observations 244856 244856 244856R2 0.295 0.390 0.400

Notes: In this table, we assess how the introduction if a new GitHub pricing scheme in October 2015 affected the startups’willingness to rely on open-source communities to develop their technologies and attract funds. We modify Eq. (2) includingan indicator -NewPriceSchemet- identifying the period following the introduction of the new pricing scheme. We additionallyintroduce interaction terms between NewPriceSchemet and the following variables: Postit, τit, and Postit×τit. The dependentvariables we consider are: an indicator that equals one if a startup engages in a member event in month t and zero otherwise(column 1); an indicator for whether a startup engages in a non-member, internal event (column 2); and an indicator for whethera startup engages with an external repository (column 3). To ensure that the introduction of the new pricing scheme is notcorrelated with a startup’s characteristics and the life cycle of its technology, we control for the full set of fixed effects. Standarderrors - reported in parentheses - are clustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.

46

Table 6: Startup engagement on GitHub - By GitHub activity

Panel A: Own repositories(1) (2) (3) (4) (5)

SD/BE ML API UI HR

Postit 0.00336∗ 0.000710 0.00157 -0.000480 0.000151(0.00182) (0.000554) (0.00163) (0.000834) (0.000273)

τit 0.000407∗∗∗ 0.0000247 0.000431∗∗∗ 0.0000497∗ -0.0000117(0.0000769) (0.0000189) (0.0000682) (0.0000270) (0.0000129)

Postit × τit -0.000128 -0.0000357 -0.000115 0.0000668 0.00000127(0.000118) (0.0000333) (0.000109) (0.0000518) (0.0000178)

Observations 246462 246462 246462 246462 246462R2 0.123 0.0704 0.121 0.0761 0.0486

Panel B: External repositories(1) (2) (3) (4) (5)

SD/BE ML API UI HR

Postit 0.0102∗∗∗ 0.00116∗ 0.00956∗∗∗ 0.00170 0.00140∗∗∗

(0.00252) (0.000698) (0.00238) (0.00132) (0.000484)

τit 0.000449∗∗∗ 0.000116∗∗∗ 0.000354∗∗∗ 0.0000902 0.0000686∗∗∗

(0.000114) (0.0000330) (0.000100) (0.0000558) (0.0000231)

Postit × τit -0.000559∗∗∗ -0.000123∗∗∗ -0.000545∗∗∗ -0.0000373 -0.000110∗∗∗

(0.000164) (0.0000469) (0.000157) (0.0000863) (0.0000312)

Observations 246462 246462 246462 246462 246462R2 0.136 0.0664 0.122 0.0966 0.0533

Notes: This table reports the results from estimating variants of Eq. (2). The outcome in: column (1) is whether astartup engages in public activities related to software development/back end (SD/BE); column (2) is whether a startupengages in public activities related to machine learning (ML); column (3) is whether a startup engages in public activitiesrelated to Application Programming Interface (API); column (4) is whether a startup engages in public activities relatedto user interface (UI); column (5) is whether a startup engages in public activities related to human resources (HR). InPanel A, we consider public activities on GitHub related to a startup’s own repositories. In Panel B, we focus on astartup’s engagement with external repositories. Postit is equal to one for the twelve months that succeed a startup’sfirst financing round, and zero in the twelve points preceding it. The variable τit, instead, is the count of months to/froma startup’s first round. We include startup and startup age fixed effects, as well as fixed effects controlling for whethera startup raised a subsequent round in any of the twelve months following a first round. Additionally, we include regionby year and industry group by year fixed effects. Standard errors - reported in parentheses - are clustered at the startuplevel. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.

47

Table 7: Startup engagement on GitHub - Making new or existing private repositories public

(1) (2) (3) (4) (5) (6)Making a repository public

Postit 0.00519∗ 0.00561∗∗ 0.00561∗∗ 0.00604∗∗

(0.00269) (0.00269) (0.00269) (0.00271)

τit 0.000815∗∗∗ 0.000838∗∗∗ 0.000834∗∗∗ 0.000876∗∗∗ 0.000937∗∗∗ 0.000952∗∗∗

(0.000117) (0.000117) (0.000118) (0.000118) (0.000120) (0.000160)

Postit × τit -0.000103 -0.000139 -0.000139 -0.000204 -0.000185 -0.0000330(0.000179) (0.000179) (0.000179) (0.000180) (0.000227) (0.000319)

Ageit 0.0175∗∗∗ 0.0149∗∗∗ 0.0141∗∗∗

(0.00304) (0.00304) (0.00303)


Observations 246462 246462 246462 246462 244617 150972R2 0.273 0.274 0.274 0.275 0.423 0.424

Notes: This table reports the results from estimating variants of Eq. (2). The dependent variable is an indicator forwhether a startup make a new or a previously private repository public. Postit is equal to one for the twelve monthsthat succeed a startup’s first financing round, and zero in the twelve points preceding it. The variable τit, instead, isthe count of months to/from a startup’s first round. In column (1), we control for the natural logarithm of a startup’sage. We additionally include startup and year fixed effects, as well as fixed effects controlling for whether a startupraised a second round in any of the twelve months following the first round. In column (2), we add region by yearand industry group by year fixed effects. The industry groups we consider are Information Technology, Software,Data Analytics, Internet Services, and Artificial Intelligence. In column (3), we replace our industry groups with theshare of a startup’s industry group keywords that are related to software (Software sharei). The keywords related tosoftware are: Apps, Artificial Intelligence, Consumer Electronics, Data and Analytics, Design, Financial Services,Gaming, Hardware, Information Technology, Internet Services, Messaging and Telecommunications, Mobile, Payments,Platforms, Privacy and Security, and Software. In column (4), we estimate a similar model as the one in column(3), this time replacing the natural logarithm of a startup’s age with age fixed effects. In column (5), we estimatea similar model as the one in column (4) where we add age by startup and Postit by startup fixed effects. Finally,in column (6), we add lead investor fixed effects to the specification in column (5). Standard errors - reported inparentheses - are clustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.

48

Table 8: Startup engagement on GitHub - New repositories with permissive licenses

(1) (2) (3) (4) (5) (6)New repositories with permissive licenses

Postit 0.0196∗∗∗ 0.0203∗∗∗ 0.0201∗∗∗ 0.0213∗∗∗

(0.00410) (0.00409) (0.00410) (0.00413)

τit 0.00131∗∗∗ 0.00134∗∗∗ 0.00133∗∗∗ 0.00143∗∗∗ 0.00182∗∗∗ 0.00197∗∗∗

(0.000175) (0.000175) (0.000175) (0.000176) (0.000180) (0.000244)

Postit × τit -0.000747∗∗∗ -0.000803∗∗∗ -0.000789∗∗∗ -0.000945∗∗∗ -0.00155∗∗∗ -0.00162∗∗∗

(0.000278) (0.000277) (0.000277) (0.000280) (0.000341) (0.000465)

Ageit 0.0481∗∗∗ 0.0446∗∗∗ 0.0446∗∗∗

(0.00464) (0.00464) (0.00463)


Observations 246762 246761 246762 246761 244856 151129R2 0.187 0.188 0.187 0.188 0.337 0.331

Notes: This table reports the results from estimating variants of Eq. (2). The dependent variable is an indicator forwhether a startup chooses a permissive license (BSD, MIT, Apache, CC-BY) for at least one new repository. Postit isequal to one for the twelve months that succeed a startup’s first financing round, and zero in the twelve points precedingit. The variable τit, instead, is the count of months to/from a startup’s first round. In column (1), we control for thenatural logarithm of a startup’s age. We additionally include startup and year fixed effects, as well as fixed effectscontrolling for whether a startup raised a second round in any of the twelve months following the first round. In column(2), we add region by year and industry group by year fixed effects. The industry groups we consider are InformationTechnology, Software, Data Analytics, Internet Services, and Artificial Intelligence. In column (3), we replace ourindustry groups with the share of a startup’s industry group keywords that are related to software (Software sharei).The keywords related to software are: Apps, Artificial Intelligence, Consumer Electronics, Data and Analytics, Design,Financial Services, Gaming, Hardware, Information Technology, Internet Services, Messaging and Telecommunications,Mobile, Payments, Platforms, Privacy and Security, and Software. In column (4), we estimate a similar model as theone in column (3), this time replacing the natural logarithm of a startup’s age with age fixed effects. In column (5),we estimate a similar model as the one in column (4) where we add age by startup and Postit by startup fixed effects.Finally, in column (6), we add lead investor fixed effects to the specification in column (5). Standard errors - reported inparentheses - are clustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.

49

Table 9: Startup engagement on GitHub - Distinguishing between startups developingsoftware and others

(1) (2)Engaging in a public activity on GitHub

Postit 0.0260∗∗∗ 0.0115(0.00775) (0.00941)

Postit × Softwarej 0.0318∗∗∗

(0.0104)

τit 0.00260∗∗∗ 0.00142∗∗∗

(0.000352) (0.000401)

τit × Softwarej 0.00171∗∗∗

(0.000481)

Postit × τit -0.00122∗∗ -0.000207(0.000514) (0.000630)

Postit × τit × Softwarej -0.00196∗∗∗

(0.000691)

Postit × Software sharei 0.0562∗∗∗

(0.0148)

τit × Software sharei 0.00367∗∗∗

(0.000625)

Postit × τit × Software sharei -.0036455∗∗∗

(0.000996)

Startup FE Y YYear × Region FE Y YYear × Industry Group FE Y YStartup Age FE Y Y

Observations 246462 246462R2 0.275 0.276

Notes : This table reports the results from estimating variants of Eq. (2). In column (1), weinteract Postit, τit, and Postit × τit with an indicator that equals 1 if a startups is assignedto the ”software” industry group by Crunchbase and zero otherwise. In column (2), weinteract Postit, τit, and Postit × τit with the share of a startup’s industry keywords thatare related to software. The keywords related to software are: Apps, Artificial Intelligence,Consumer Electronics, Data and Analytics, Design, Financial Services, Gaming, Hardware,Information Technology, Internet Services, Messaging and Telecommunications, Mobile,Payments, Platforms, Privacy and Security, and Software. The models in both columnsinclude startup, age, region by year, and industry group by year fixed effects. We alsoinclude fixed effects controlling for whether a startup raised a second round in any of thetwelve months following the first round. Standard errors - reported in parentheses - areclustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.

50

Table 10: Startup engagement on GitHub - By round

(1) (2) (3)Engaging in a public activity on GitHub

Round I Round II Round III

Postit 0.0456∗∗∗ 0.0311∗∗∗ 0.0188∗

(0.00554) (0.00474) (0.00998)

τit 0.00365∗∗∗ 0.00334∗∗∗ -0.0000236(0.000249) (0.000719) (0.00270)

Postit × τit -0.00242∗∗∗ -0.00213∗∗∗ 0.000321(0.000381) (0.000738) (0.00271)

Startup FE Y Y YYear × Region FE Y Y YYear × Industry Group FE Y Y YStartup Age FE Y Y Y

Observations 246462 168695 115082R2 0.275 0.302 0.318

Notes: This table reports the results from estimating variants of Eq. (2). Postit isequal to one for the twelve months that succeed a startup’s given financing round,and zero in the twelve points preceding it. The variable τit, instead, is the count ofmonths to/from a startup’s round. In column (1), we examine a startup’s engagementon GitHub before and after a first financing round. In column (2), we examine astartup’s engagement on GitHub before and after a second financing round. Finally, incolumn (3), we examine a startup’s engagement on GitHub before and after a thirdfinancing round. We include startup and startup age fixed effects, as well as fixedeffects controlling for whether a startup raised a subsequent round in any of the twelvemonths following the examined round. Additionally, we include region by year andindustry group by year fixed effects. Standard errors - reported in parentheses - areclustered at the startup level. Significance noted as: *p<0.10; **p<0.05; ***p<0.01.

51

Table 11: Startup engagement on GitHub - By investor type

(1) (2) (3)Engaging in a public activity on GitHub

VC Round Successful Investor Round

Postit 0.0449∗∗∗ 0.0412∗∗∗ 0.0334∗∗∗

(0.00674) (0.00727) (0.00668)

Postit × VCit -0.000350(0.0108)

Postit × Success Inv.it 0.000866 0.00159(0.00217)) ((0.00289)

τit 0.00280∗∗∗ 0.00250∗∗∗ 0.00265∗∗∗

(0.000287) (0.000305) (0.000289)

τit × VCit 0.00190∗∗∗

(0.000419)

τit × Success Inv.it 0.000416∗∗∗ 0.000545∗∗∗

(0.0000840) (0.000113)

Postit × τit -0.00235∗∗∗ -0.00227∗∗∗ -0.00182∗∗∗

(0.000456) (0.000494) (0.000455)

Postit × τit × VCit -0.00000433(0.000726)

Postit × τit× Success Inv.it 0.00000681 0.0000485(0.000146) (0.000194)

Startup FE Y Y YYear × Region FE Y Y YYear × Industry Group FE Y Y YStartup Age FE Y Y Y

Observations 246462 246462 246462R2 0.276 0.276 0.276

Notes: This table reports the results from estimating variants of Eq. (2). Postit is equalto one for the twelve months that succeed a startup’s first financing round, and zero inthe twelve points preceding it. The variable τit, instead, is the count of months to/froma startup’s first round. We interact Postit, τit, and Postit × τit with: an indicatorthat equals 1 if a startup had a VC participating in its first round (column (1)); themaximum number of investments in which any of a startup’s investors participated inthe five years preceding t (column (2)); the maximum number of successful investmentsmade by a startup’s investors in the five years preceding t (column (3)). A successfulinvestment is one made in startups that ultimately exited via an acquisition or an IPO.We include startup and startup age fixed effects, as well as fixed effects controlling forwhether a startup raised a subsequent round in any of the twelve months following afirst round. Additionally, we include region by year and industry group by year fixedeffects. Standard errors - reported in parentheses - are clustered at the startup level.Significance noted as: *p<0.10; **p<0.05; ***p<0.01.

52

7 Appendix

As mentioned in the main text, we classify the public repositories of all organizations, as

well as the external repositories with which the organizations interacts through commits, pull

requests or forks according to their type. We distinguish between repositories that pertain to

software development/back end (SD/BE), machine learning (ML), application programming

interface (API), and user interface (UI). To do so, we implement a combination of the TF-IDF

vectorization and K-Nearest Neighbors prediction methods.

7.0.1 TF-IDF vectorization

The first step in the categorization process is vectorization using the TF-IDF method.

Each text document (that is, a collection of all the text in a single repository) is transformed

into a vector of numbers. These numbers are the ”scores” that the TF-IDF method assigns

to each word in a document. A TF-IDF score is defined as the frequency with which a given

word occurs in a certain document divided by the fraction of documents in which the word

is present.11 Thus, if a word is frequently used in a given document then it will have a

high score. Similarly, if a word is used in very few other documents it will also have a high

score. This vectorization step is fundamental as the vectors that are generated can be used

to calculate the similarity between two repositories by comparing the vectors that represent

them.

7.0.2 K-Nearest Neighbors prediction

Using the vectors produced by the TF-IDF method, we are able to compare any two

repositories by calculating the dot product of the vectors that represent them. This calculation

yields a similarity score between 0 and 1 (0 meaning totally different and 1 meaning totally

similar).

Given that we can compare any two repositories, it is possible to classify any repository

by examining similar repositories that have already been classified. To do so, we manually

11More precisely, the formula used is: tf ∗ log(idf) where tf is the frequency of the word in a given documentand idf is the inverse of the number of documents where the word appears.

53

classify a subset of 150 repositories. We then use the KNN algorithm to classify the rest of

the repositories using the 150 classified repositories as training data. The KNN algorithm

consists of finding the k (in this case 5) most similar repositories in the training data and

selecting the most common category to classify any new repository (each of the 5 similar

repositories “votes” on a category for the new repository).

In order to accurately classify the large number of repositories we have available, we

iteratively expand the training set applying the following steps:

1. Using the current training set, classify the rest of the repositories using KNN;

2. Retain only the repositories with the top N most confident classifications, and manually

review them;

3. Add these repositories to the training set, and repeat the procedure until all repositories

have been classified.

We increase the number N throughout the procedure as the training set grows, and with

it, the accuracy of KNN. We calculate confidence using the number of “votes”: if all 5 similar

repositories “agree” on a category, then confidence is 100%. Conversely, if only 2 out of the 5

similar repositories “agree”, then the confidence is low.12

12Due to the high number of categories, the “voting” process was weighted by distance, meaning votes fromvery similar documents counted more than votes from less similar documents.

54

Beefing IT up for Your Investor? Open Sourcing and Startup ...

Documents

Transcript of Beefing IT up for Your Investor? Open Sourcing and Startup ...