Disrupting with Data: Lessons from Silicon Valley

84
Data-Driven Disruption: Lessons from Silicon Valley Anand Rajaraman

Transcript of Disrupting with Data: Lessons from Silicon Valley

Data-Driven Disruption: Lessons from Silicon ValleyAnand Rajaraman

The Rise of Data Driven Disruption

2

50-fold Growth from 2010 to 2020

3

2014: More bits in the

digital universe than stars in the physical

universe

Sources of Data

• The world creates 1.7MB of data per minute per person4The Digital Universe -- IDC Report, 2014

Data-Driven Applications

5

Data-Driven Applications

Talk outline• The evolution of data-driven applications

• 5 generations

• Lessons and Opportunities• From the intersection of startups, venture capital, and

research• Key theme: Disruption vs Optimization

• Conclusion

6

THE EVOLUTION OF DATA-DRIVEN APPS

7

Follow the Data!• Value-creation has followed the most valuable data sources available!

• 5 overlapping generations

8

Data driven apps: The First Generation

• All about leveraging private, structured data assets for competitive advantage• E.g., Sales, inventory, payroll, …

9

Data-driven apps: The Second Generation• Harnessing the power of public data

10

Data-Driven Apps: The Third Generation

• Leveraging the power of “semi-public” Social + Mobile Data • Personal data shared in a frictionless manner with

user’s consent

11

Third Generation Examples

12

Data-driven apps: The Fourth Generation

• Combining public, semi-public, and private data

13

+

4G Example: Paysa

14

• Am I being compensated fairly?• 2012 Stanford CS grad• Java, C++, Ruby, and Machine Learning• Software Eng II at Google

4G Example: Paysa

15

Salaries35M+ salary datapoints

Companies500k+

companies

PeopleProfessional

DNA of15M tech

employees

JobsMillions of

job postings updated daily

Local/National Government Databases

Partnerships(e.g., Udacity)

Recruiters

Companies Web Crawl

Social Media

Private Public

The Fifth Generation: Just add AI!

16

• Companies generate massive amounts of training data• New class of proprietary data

The Fifth Generation

17

+

Fifth Generation Examples

18

Summary: Follow the Data!

19

LESSONS AND OPPORTUNITIES

20

Lessons and Opportunities1. Startup and Investment Landscape2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given

21

Lessons and Opportunities1. Startup and Investment Landscape2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given

22

23

3 broad categories: Infrastructure

AnalyticsIntelligent Applications

Infrastructure• Accessed primarily by developers

24

Analytics• Data exploration and modeling for data scientists and business people

25

Vertical Analytics: Cuberon

26

The “Why?” Question

• Why are signupsdown this week?

• Why did this marketing campaign do so well?

• Why did this A/B test not perform?

27

Consumer Behavior Analytics: Cuberon

28

Build data cube

Identifyanomalous subcubes

Intelligent Applications

29Matt Turck, Jim Hao & FirstMark Capital

More Intelligent Applications…

30Matt Turck, Jim Hao & FirstMark Capital

Intelligent App Example: Descartes Labs

31Another example: Zillow

Trends and Takeaways• Infrastructure is available and solid

• Major transition from Hadoop to Spark

• Investment focus on “Vertical” analytics plays• e.g., Cuberon, Ayasdi

• The Age of the Intelligent App has dawned• Major opportunities and investment dollars flowing here!• e.g., Troo.ly, Descartes Labs, DocsApp

32

Lessons and Opportunities1. Startup and Investment Landscape2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given

33

Data-driven Optimization

34EMC: Understanding Data Lakes

Data-driven Disruption

35

Beware the HippoHiPPO = Highest Paid Person’s Opinion

36

Why does disruption happen?• Data scientist as advisor not decision maker

• Domain expertise and experience often win out over data

• Data-driven approach enables a completely different business model• E.g., A la carte streaming vs fixed number of channels• Cannibalization concerns

• Fear of making mistakes• Algorithms can make mistakes• But algorithms can learn and improve much faster with data!

37

Why does disruption happen?• Classic Innovator’s Dilemma with a turbo-boost: data network effects • Accelerates the pace of disruption

38

Disruption Example: Venture Capital• Venture Capital has been an established industry for several decades• Process has not changed much since early days• VC firms expect entrepreneurs to approach them with

pitches

• Some VC firms have tried using data• Data scientists in advisory role• Not partners who make investment decisions

• High concentration in Silicon Valley• And a few other places…

39

Sets the stage for…

40

rocketship.vcVenture Investing through Data Science

More Global Startups

41

Reduced costs to launch a startup

Large consolidating markets; smartphone ubiquity

Emerging Market Opportunities

Untapped talent pools

Beyond Human Scale

42

2.1 Million “Startups”

115K need funding at any time

90% outside Silicon Valley

12.8 Million Companies

Why Data-Driven? Geography

43

0

10

20

30

40

50

60

70

80

90

100

2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014

Coun

t

Numberof$Bcompaniesbyyear

SiliconValley OutsideSiliconValley

The Company Model

44

Company ModelTraction

Team

Market

Competition

Customer Feedback

Business Model Innovation• Proactively identify interesting companies and reach out to them at the appropriate moment

45

South America9%

East Europe

11%

China13%

India7%Other

East Asia11%

Other Europe5%

Other North America

7%

US SF11%

US Other22%

Unknown4%

Optimize or Disrupt?• Key question for every entrepreneur (and researcher too!)• Often difference between success and failure

• Hard to answer in general, but look out for disruption cues• Established, fragmented industry• Slow to adopt latest technology trend• Asset-heavy models

• Risk/reward tradeoff• Disruption is much riskier but the rewards compensate

46

Lessons and Opportunities1. Startup and Investment Landscape2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given

47

Current view of Human-Machine Collaboration

4810Clouds Blog

But what about…

49

rocketship.vc

Peripheral Vision• To make optimal decisions, humans must provide “peripheral vision” to model

• Is this data point an outlier or does it fit the model?• e.g., Geo or category in VC

• Is there bias in the model?• e.g., historical racial gap in sentencing and parole decisions

• Has the world changed in a way that invalidates the assumption of the model?• e.g., flash crash on Wall Street

50

The Problem•Must judges, policemen, doctors, bureaucrats understand the nuances of the data and the model?

•Even trickier when we consider complex workflows involving multiple decision makers• e.g., a drug trial

51

The Opportunity• Systems that include humans and models as peers• Can also be complex workflows that involve many

humans and models

• How best to structure such systems to produce optimal decisions?• Model might need to be tuned to work with specific

human

• Model Invalidation• Can models know when they are no longer valid?

52

Is it time to disrupt Mechanical Turk?• The world has changed a lot since Mechanical Turk was introduced in 2005

• Can we move closer to true hybrid human-machine computing?• Harness both human initiative and

computing power• Harness sensors in phones

• Reimagine problems, tasks and incentives

53

Lessons and Opportunities1. Startup and Investment Landscape2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given

54

Data-driven software all around us…

55

The Agency Problem•Each model is optimized for the good of the company that owns it

•Often our goals and the company’s goals are in alignment but not always!

56

Problems• Privacy

• Everyone has your data and is modeling your actions

• Pricing and Discovery disadvantage• You discover only what they choose to show you

• You are not a population• Each service models its population of users• And is optimizing for its own ends

• Would you rather be explored or exploited?

57

We have helped create this situation

vs

Wooden weapons against guns and steel

59Conquistadors and Incas -- Painting by John Everett Millais

Or if you prefer…

60South Park

Enter the Cyborg

61

Cyborg Layer mediates interactions

62

Cyborg Layer Services• Privacy protection

• e.g., using Differential Privacy techniques• Or by strategically spreading interactions across services• e.g., watch some movies on Netflix and some on Amazon

• Discovery and Pricing • Looks at a larger selection and picks items for you• Acts strictly as your agent; no conflict

• Combine personal and population models• Cyborg has complete access to all my data• External services have population data, but only limited

window

63

Combining Personal and Population Models

64

Lessons and Opportunities1. The Age of the App2. Disruption vs Optimization3. Human-Machine Collaboration4. The Rise of the Cyborg5. The Data is not a Given

65

How to build a Model: Conventional View

• Use ground truth to build the best model possible• Feature engineering + model selection• Maybe some data cleaning and integration

66

Example: Troo.ly2005

TRANSACTIONS2015

EXPERIENCES

Need for online trust has grown dramatically!

Would you rent your house to this stranger?

WHAT WE ARE GIVEN

Troo.ly Problem StatementKNOWN

BAD

KNOWN GOOD

NOT KNOWN

Can you trust the ground truth?

! Bad users might have a good label if they haven’t engaged in bad activity yet

Labels may be incorrect if they are coming from bad internal models

Labels may be incorrect because of wrong attributions in bad transactions

!

!

Rocketship.vc: company data

70

• How to tradeoff data sources based on Coverage, Accuracy, Depth, Freshness, and Cost?

• Which subset of data sources yields the best model?

• Which subset of data sources will identify promising companies most quickly?

• Promising start• Dong et al, VLDB 2012• Rekatsinas et al, SIGMOD

2014

Algorithmic Law Enforcement

71The Economist, August 20, 2016

But what about perpetuating bias against minorities?

Summary• Cannot trust the given data completely

• Ground truth is often neither true nor grounded• Data may have bias

• Look for additional data that can improve model• Quality/cost tradeoff?

• Generate your own training data!• E.g., Polarr photo-editing app• Data Programming (Ratner et al, 2016)

72

CONCLUSION

73

Summary• 5 generations of data-driven applications

• Lessons and Opportunities1. The Age of the Intelligent App2. Disruption vs Optimization3. Human-Machine Collaboration4. Rise of the Cyborg5. The Data is not a Given

74

Identity Crisis?

75

Data Management

Semantic Web

Machine Learning

Data MiningInformation Retrieval

AI

Systems

Panel at NorCal DB Day, 2016

Marketing Myopia

76Marketing Myopia, Theodore Levitt. HBS Case Study, 1960

Data impacts every human endeavor

77

Data

Entertainment

Transportation

Government

ManufacturingSciences

Education

Security

Commerce

Data + X• Core identity of the field is to create value from data• Never a better time for it!

• Data is now a key part of every field of human endeavor• Stanford CS+X

• The value of being an outsider

78

Go Forth And Disrupt!

79

Entertainment

Transportation

Government

ManufacturingSciences

Education

Security

Commerce

ANNOUNCEMENT

80

IIT Madras CS Visiting Chair Program • Focus area: data-driven approaches to tackle important problems

• Leading faculty/researchers from around the world welcome!

• Flexible time commitment• Minimum 2 weeks

• Endowed by Venky Harinarayanand Anand Rajaraman

81

Confirmed Visiting Chairs so far…

82

Jeff UllmanProfessor Emeritus, CSStanford

Randy KatzDistinguished Professor, EECS UC Berkeley

Hari BalakrishnanProfessor, EECSMIT

For more information

[email protected]

83

Prof. Nagarajan

Thanks!

Anand Rajaraman

[email protected]

@anand_raj