DSSG Speaker Series: Paco Nathan

96
DSSG Speaker Series, 2013-08-12: Learnings generalized from trends in Data Science: a 30-year retrospective on Machine Learning, a 10-year summary of Leading Data Science Teams, and a 2-year survey of Enterprise Use Cases Paco Nathan @pacoid Chief Scientist, Mesosphere 1

description

An invited talk by Paco Nathan in the speaker series at the University of Chicago's Data Science for Social Good fellowship (2013-08-12) http://dssg.io/2013/05/21/the-fellowship-and-the-fellows.html Learnings generalized from trends in Data Science: a 30-year retrospective on Machine Learning, a 10-year summary of Leading Data Science Teams, and a 2-year survey of Enterprise Use Cases. http://www.eventbrite.com/event/7476758185

Transcript of DSSG Speaker Series: Paco Nathan

Page 1: DSSG Speaker Series: Paco Nathan

DSSG Speaker Series, 2013-08-12:

Learnings generalized from trends in Data Science:

a 30-year retrospective on Machine Learning,

a 10-year summary of Leading Data Science Teams,

and a 2-year survey of Enterprise Use Cases

Paco Nathan @pacoidChief Scientist, Mesosphere

1

Page 2: DSSG Speaker Series: Paco Nathan

Learnings generalized from trends in Data Science:

1. the practice of leading data science teams

2. strategies for leveraging data at scale

3. machine learning and optimization

4. large-scale data workflows

5. the evolution of cluster computing

DSSG, 2013-08-122

Page 3: DSSG Speaker Series: Paco Nathan

employing a mode of thought which includes both logical and analytical reasoning: evaluating the whole of a problem, as well as its component parts; attempting to assess the effects of changing one or more variables

this approach attempts to understand not just problems and solutions, but also the processes involved and their variances

particularly valuable in Big Data work when combined with hands-on experience in physics – roughly 50% of my peers come from physics or physical engineering…

programmers typically don’t think this way… however, both systems engineers and data scientists must

Process Variation Data Tools

Statistical Thinking

3

Page 4: DSSG Speaker Series: Paco Nathan

Modeling

back in the day, we worked with practices based on data modeling

1. sample the data

2. fit the sample to a known distribution

3. ignore the rest of the data

4. infer, based on that fitted distribution

that served well with ONE computer, ONE analyst, ONE model… just throw away annoying “extra” data

circa late 1990s: machine data, aggregation, clusters, etc.algorithmic modeling displaced the prior practices of data modeling

because the data won’t fit on one computer anymore

4

Page 5: DSSG Speaker Series: Paco Nathan

Two Cultures

“A new research community using these tools sprang up. Their goal was predictive accuracy. The community consisted of young computer scientists, physicists and engineers plus a few aging statisticians. They began using the new tools in working on complex prediction problems where it was obvious that data models were not applicable: speech recognition, image recognition, nonlinear time series prediction, handwriting recognition, prediction in financial markets.”

Statistical Modeling: The Two Cultures Leo Breiman, 2001bit.ly/eUTh9L

chronicled a sea change from data modeling (silos, manual process) to the rising use of algorithmic modeling (machine data for automation/optimization) which led in turn to the practice of leveraging inter-disciplinary teams

5

Page 6: DSSG Speaker Series: Paco Nathan

approximately 80% of the costs for data-related projects gets spent on data preparation – mostly on cleaning up data quality issues: ETL, log files, etc., generally by socializing the problem

unfortunately, data-related budgets tend to go into frameworks that can only be used after clean up

most valuable skills:

‣ learn to use programmable tools that prepare data

‣ learn to understand the audience and their priorities

‣ learn to socialize the problems, knocking down silos

‣ learn to generate compelling data visualizations

‣ learn to estimate the confidence for reported results

‣ learn to automate work, making process repeatable

What is needed most?

Unique Registration

Launched games lobby

NUI:TutorialMode

Birthday Message

Chat PublicRoom voice

Launched heyzap game

ConnectivityTest: test suite started

Create New Pet

Movie View Started: client, community

NUI:MovieMode

Buy an Item: web

Put on Clothing

Address space remaining: 512M

Customer Made Purchase Cart Page Step 2

Feed Pet

Play Pet

Chat Now

Edit Panel

Client Inventory Panel Flip Product Over

Add Friend

Open 3D Window

Change Seat

Type a Bubble

Visit Own Homepage

Take a Snapshot

NUI:BuyCreditsMode

NUI:MyProfileClicked

Address space remaining: 1G

Leave a Message

NUI:ChatMode

NUI:FriendsModedv

Website Login

Add Buddy

NUI:PublicRoomMode

NUI:MyRoomMode

Client Inventory Panel Remove Product

Client Inventory Panel Apply Product

NUI:DressUpMode

Unique RegistrationLaunched games lobbyNUI:TutorialModeBirthday MessageChat PublicRoom voiceLaunched heyzap gameConnectivityTest: test suite startedCreate New PetMovie View Started: client, communityNUI:MovieModeBuy an Item: webPut on ClothingAddress space remaining: 512MCustomer Made Purchase Cart Page Step 2Feed PetPlay PetChat NowEdit PanelClient Inventory Panel Flip Product OverAdd FriendOpen 3D WindowChange SeatType a BubbleVisit Own HomepageTake a SnapshotNUI:BuyCreditsModeNUI:MyProfileClickedAddress space remaining: 1GLeave a MessageNUI:ChatModeNUI:FriendsModedvWebsite LoginAdd BuddyNUI:PublicRoomModeNUI:MyRoomModeClient Inventory Panel Remove ProductClient Inventory Panel Apply ProductNUI:DressUpMode

6

Page 7: DSSG Speaker Series: Paco Nathan

apps

discovery

modeling

integration

systems

help people ask the right questions

allow automation to place informed bets

deliver data products at scale to LOB end uses

build smarts into product features

keep infrastructure running, cost-effective

Team Process = Needs

analysts

engineers

inter-disciplinary leadership

7

Page 8: DSSG Speaker Series: Paco Nathan

business process,stakeholder

data prep, discovery, modeling, etc.

software engineering, automation

systems engineering, availability

datascience

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

Team Composition = Roles

leverage non-traditional pairing among roles, to complement skills and tear down silos

8

Page 9: DSSG Speaker Series: Paco Nathan

discovery

discovery

modeling

modeling

integration

integration

appsapps systems

systems

business process,stakeholder

data prep, discovery, modeling, etc.

software engineering, automation

systems engineering, availability

datascience

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

Team Composition = Needs × Roles

9

Page 11: DSSG Speaker Series: Paco Nathan

Learning Curves

difficulties in the commercial use of distributed systems often get represented as issues of managing complexity

much of the risk in managing a data science team is about budgeting for learning curve: some orgs practice a kind of engineering “conservatism”, with highly structured process and strictly codified practices – people learn a few things well, then avoid having to struggle with learning many new things perpetually…

that anti-pattern leads to big teams, low ROI scale ➞

com

plexity ➞

ultimately, the challenge is about

managing learning curves within

a social context

11

Page 12: DSSG Speaker Series: Paco Nathan

Learnings generalized from trends in Data Science:

1. the practice of leading data science teams

2. strategies for leveraging data at scale

3. machine learning and optimization

4. large-scale data workflows

5. the evolution of cluster computing

DSSG, 2013-08-1212

Page 13: DSSG Speaker Series: Paco Nathan

Business Disruption through Data

Geoffrey MooreMohr Davidow Ventures, author Crossing The Chasm@Hadoop Summit, 2012: what Amazon did to the retail sector… has put the entire Global 1000 on notice over the next decade… data as the major force… mostly through apps – verticals, leveraging domain expertise

Michael StonebrakerINGRES, PostgreSQL, Vertica, VoltDB, Paradigm4, etc. @XLDB, 2012: complex analytics workloads are now displacing SQL as the basis for Enterprise apps

13

Page 14: DSSG Speaker Series: Paco Nathan

Data Categories

Three broad categories of dataCurt Monash, 2010

dbms2.com/2010/01/17/three-broad-categories-of-data

• Human/Tabular data – human-generated data which fits into tables/arrays

• Human/Nontabular data – all other data generated by humans

• Machine-Generated data

let’s now add other useful distinctions:

• Open Data

• Curated Metadata

• A/D conversion for sensors (IoT)

14

Page 15: DSSG Speaker Series: Paco Nathan

Open Data notes

successful apps incorporate three components:

• Big Data (consumer interest, personalization)

• Open Data (monetizing public data)

• Curated Metadata

most of the largest Cascading deployments leverage some Open Data components: Climate Corp, Factual, Nokia, etc.

consider buildingeye.com, aggregate building permits:

• pricing data for home owners looking to remodel

• sales data for contractors

• imagine joining data with building inspection history,for better insights about properties for sale…

research notes about Open Data use cases: goo.gl/cd995T

15

Page 16: DSSG Speaker Series: Paco Nathan

Trends in Public Administration

late 1880s – late 1920s (Woodrow Wilson)as hierarchy, bureaucracy → only for the most educated, elite

late 1920s – late 1930sas a business, relying on “Scientific Method”, gov as a process

late 1930s – late 1940s (Robert Dale)relationships, behavioral-based → policy not separate from politics

late 1940s – 1980syet another form of management → less “command and control”

1980s – 1990s (David Osborne, Ted Gaebler)New Public Management → service efficiency, more private sector

1990s – present (Janet & Robert Denhardt)Digital Age → transparency, citizen-based “debugging”, bankruptcies

Adapted from:The Roles, Actors, and Norms Necessary to Institutionalize Sustainable Collaborative GovernancePeter PirnejadUSC Price School of Policy2013-05-02

Drivers, circa 2013

• governments have run out of money, cannot increase staff and services

• better data infra at scale (cloud, OSS, etc.)

• machine learning techniques to monetize

• viable ecosystem for data products, APIs

• mobile devices enabling use cases

16

Page 17: DSSG Speaker Series: Paco Nathan

Open Data ecosystem

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, Chicago, DC, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook, Climate, etc.

Data feeds structured for public private partnerships

17

Page 18: DSSG Speaker Series: Paco Nathan

Open Data ecosystem – caveats for agencies

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, Chicago, DC, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook, Climate, etc.

Required Focus

• respond to viable use cases

• not budgeting hackathons

18

Page 19: DSSG Speaker Series: Paco Nathan

Open Data ecosystem – caveats for publishers

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, Chicago, DC, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook, Climate, etc.

Required Focus

• surface the metadata

• curate, allowing for joins/aggregation

• not scans as PDFs

19

Page 20: DSSG Speaker Series: Paco Nathan

Open Data ecosystem – caveats for aggregators

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, Chicago, DC, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook, Climate, etc.

Required Focus

• make APIs consumable by automation

• allow for probabilistic usage

• not OSS licensing for data

20

Page 21: DSSG Speaker Series: Paco Nathan

Open Data ecosystem – caveats for data vendors

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, Chicago, DC, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook, Climate, etc.

Required Focus

• supply actionable data

• track data provenance carefully

• provide feedback upstream, i.e., cleaned data at source

• focus on core verticals

21

Page 22: DSSG Speaker Series: Paco Nathan

Open Data ecosystem – caveats for end uses

municipaldepartments

publishingplatforms

aggregators

data productvendors

end use cases

e.g., Palo Alto, Chicago, DC, etc.

e.g., Junar, Socrata, etc.

e.g., OpenStreetMap, WalkScore, etc.

e.g., Factual, Marinexplore, etc.

e.g., Facebook, Climate, etc.

Required Focus

• address consumer needs

• identify community benefits of the data

22

Page 23: DSSG Speaker Series: Paco Nathan

algorithmic modeling + machine data (Big Data) + curation, metadata + Open Data

⇒ data products, as feedback into automation

⇒ evolution of feedback loops

less about “bigness”, more about complexity

internet of things + A/D conversion + more complex analytics ⇒ accelerated evolution, additional feedback loops

⇒ orders of magnitude higher data rates

Recipes for Success

source: National Geographic

“A kind of Cambrian explosion”source: National Geographic

23

Page 24: DSSG Speaker Series: Paco Nathan

Internet of Things

24

Page 25: DSSG Speaker Series: Paco Nathan

Trendlines

Big Data? we’re just getting started:

• ~12 exabytes/day, jet turbines on commercial flights

• Google self-driving cars, ~1 Gb/s per vehicle

• National Instruments initiative: Big Analog Data™

• 1m resolution satellites skyboximaging.com

• open resource monitoring reddmetrics.com

• Sensing XChallenge nokiasensingxchallenge.org

consider the implications of Jawbone, Nike, etc., plus the effects of Google Glass

7+ billion people, instrumented better than … how we have Nagios instrumenting our web servers right now

technologyreview.com/...

25

Page 26: DSSG Speaker Series: Paco Nathan

Learnings generalized from trends in Data Science:

1. the practice of leading data science teams

2. strategies for leveraging data at scale

3. machine learning and optimization

4. large-scale data workflows

5. the evolution of cluster computing

DSSG, 2013-08-1226

Page 27: DSSG Speaker Series: Paco Nathan

in general, apps alternate between learning patterns/rules and retrieving similar things…

machine learning – scalable, arguably quite ad-hoc, generally “black box” solutions, enabling you to make billion dollar mistakes, with oh so much commercial emphasis(i.e. the “heavy lifting”)

statistics – rigorous, much slower to evolve, confidence and rationale become transparent, preventing you from making billion dollar mistakes, any good commercial project has ample stats work used in QA(i.e., “CYA, cover your analysis”)

once Big Data projects get beyond merely digesting log files, optimization will likely become the next overused buzzword :)

Learning Theory

27

Page 28: DSSG Speaker Series: Paco Nathan

Generalizations about Machine Learning…

great introduction to ML, plus a proposed categorization for comparing different machine learning approaches:

A Few Useful Things to Know about Machine LearningPedro Domingos, U Washingtonhomes.cs.washington.edu/~pedrod/papers/cacm12.pdf

toward a categorization for Machine Learning algorithms:

• representation: classifier must be represented in some formal language that computers can handle (algorithms, data structures, etc.)

• evaluation: evaluation function (objective function, scoring function) is needed to distinguish good classifiers from bad ones

• optimization: method to search among the classifiers in the language for the highest-scoring one

28

Page 29: DSSG Speaker Series: Paco Nathan

Something to consider about Algorithms…

many algorithm libraries used today are based on implementationsback when people used DO loops in FORTRAN, 30+ years ago

MapReduce is Good Enough?Jimmy Lin, U Marylandumiacs.umd.edu/~jimmylin/publications/Lin_BigData2013.pdf

astrophysics and genomics are light years ahead of e-commerce in terms of data rates and sophisticated algorithms work – as Breiman suggested in 2001 – may take a few years to percolate into industry

other game-changers:

• streaming algorithms, sketches, probabilistic data structures

• significant “Big O” complexity reduction (e.g., skytree.net)

• better architectures and topologies (e.g., GPUs and CUDA)

• partial aggregates – parallelizing workflows

29

Page 30: DSSG Speaker Series: Paco Nathan

Make It Sparse…

also, take a moment to check this out… (and related work on sparse Cholesky, etc.)

QR factorization of a “tall-and-skinny” matrix

• used to solve many data problems at scale, e.g., PCA, SVD, etc.

• numerically stable with efficient implementation on large-scale Hadoop clusters

suppose that you have a sparse matrix of customer interactions where there are 100MM customers, with a limited set of outcomes…

cs.purdue.edu/homes/dgleich

stanford.edu/~arbenson

github.com/ccsevers/scalding-linalg

David Gleich, slideshare.net/dgleich

30

Page 31: DSSG Speaker Series: Paco Nathan

Sparse Matrix Collection

for those times when you really, really need a wide variety of sparse matrix examples…

University of Florida Sparse Matrix Collectioncise.ufl.edu/research/sparse/matrices/

Tim Davis, U Floridacise.ufl.edu/~davis/welcome.html

Yifan Hu, AT&T Researchwww2.research.att.com/~yifanhu/

31

Page 32: DSSG Speaker Series: Paco Nathan

A Winning Approach…

consider that if you know priors about a system, then you may be able to leverage low dimensional structure within high dimensional data… what impact does that have on sampling rates?

1. real-world data ⇒

2. graph theory for representation ⇒

3. sparse matrix factorization for production work ⇒

4. cost-effective parallel processing for machine learning app at scale

32

Page 33: DSSG Speaker Series: Paco Nathan

Just Enough Mathematics?

having a solid background in statistics becomes vital, because it provides formalisms for what we’re trying to accomplish at scale

along with that, some areas of math help – regardless of the “calculus threshold” invoked at many universities…

linear algebra e.g., calculating algorithms for large-scale apps efficiently

graph theory e.g., representation of problems in a calculable language

abstract algebra e.g., probabilistic data structures in streaming analytics

topology e.g., determining the underlying structure of the data

operations research e.g., techniques for optimization … in other words, ROI

33

Page 34: DSSG Speaker Series: Paco Nathan

ADMM: a general approach for optimizing learners

Distributed Optimization and Statistical Learning via the Alternating Direction Method of MultipliersStephen Boyd, Neal Parikh, et al., Stanfordstanford.edu/~boyd/papers/admm_distr_stats.html

“Throughout, the focus is on applications rather than theory, and a main goal is to provide the reader with a kind of ‘toolbox’ that can be applied in many situations to derive and implement a distributed algorithm of practical use. Though the focus here is on parallelism, the algorithm can also be used serially, and it is interesting to note that with no tuning, ADMM can be competitive with the best known methods for some problems.”

“While we have emphasized applications that can be concisely explained, the algorithm would also be a natural fit for more complicated problems in areas like graphical models. In addition, though our focus is on statistical learning problems, the algorithm is readily applicable in many other cases, such as in engineering design, multi-period portfolio optimization, time series analysis, network flow, or scheduling.”

34

Page 35: DSSG Speaker Series: Paco Nathan

Learnings generalized from trends in Data Science:

1. the practice of leading data science teams

2. strategies for leveraging data at scale

3. machine learning and optimization

4. large-scale data workflows

5. the evolution of cluster computing

DSSG, 2013-08-1235

Page 36: DSSG Speaker Series: Paco Nathan

Enterprise Data Workflows

middleware for Big Data applications is evolving, with commercial examples that include:

Cascading, Lingual, Pattern, etc.

Concurrent

ParAccel Big Data Analytics Platform

Actian

Anaconda supporting IPython Notebook, Pandas, Augustus, etc.

Continuum Analytics

ETL dataprep

predictivemodel

datasources

enduses

36

Page 37: DSSG Speaker Series: Paco Nathan

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

ANSI SQL for ETL

37

Page 38: DSSG Speaker Series: Paco Nathan

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

endusesJ2EE for business logic

38

Page 39: DSSG Speaker Series: Paco Nathan

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive models

39

Page 40: DSSG Speaker Series: Paco Nathan

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

40

Page 41: DSSG Speaker Series: Paco Nathan

Anatomy of an Enterprise app

definition of a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

endusesJ2EE for business logic

most of the project costs…

41

Page 42: DSSG Speaker Series: Paco Nathan

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

a compiler sees it all…one connected DAG:

• optimization

• troubleshooting

• exception handling

• notifications

cascading.org

42

Page 43: DSSG Speaker Series: Paco Nathan

a compiler sees it all…

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );

cascading.org

43

Page 44: DSSG Speaker Series: Paco Nathan

a compiler sees it all…

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );

44

Page 45: DSSG Speaker Series: Paco Nathan

Cascading – functional programming

Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature.

to ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows:

• leverages JVM and Java-based tools without anyneed to create new languages

• allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters

Edgar Codd alluded to this (DSLs for structuring data) in his original paper about relational model

45

Page 46: DSSG Speaker Series: Paco Nathan

Cascading – functional programming

• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments

• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:

Cascalog in Clojure (2010)Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wiki

Why Adopting the Declarative Programming Practices Will Improve Your Return from TechnologyDan Woods, 2013-04-17 Forbes

forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/

46

Page 47: DSSG Speaker Series: Paco Nathan

Functional Programming for Big Data

WordCount with token scrubbing…

Apache Hive: 52 lines HQL + 8 lines Python (UDF)

compared to

Scalding: 18 lines Scala/Cascading

functional programming languages help reduce software engineering costs at scale, over time

47

Page 48: DSSG Speaker Series: Paco Nathan

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.

• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.

48

Page 49: DSSG Speaker Series: Paco Nathan

Workflow Abstraction – pattern language

Cascading uses a “plumbing” metaphor in Java to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

data is represented as flows of tuples

operations in the flows bring functional programming aspects into Java

A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199

49

Page 50: DSSG Speaker Series: Paco Nathan

Workflow Abstraction – literate programming

Cascading workflows generate their own visual documentation: flow diagrams

in formal terms, flow diagrams leverage a methodology called literate programming

provides intuitive, visual representations for apps –great for cross-team collaboration

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Literate ProgrammingDon Knuthliterateprogramming.com

50

Page 51: DSSG Speaker Series: Paco Nathan

Workflow Abstraction – business process

following the essence of literate programming, Cascading workflows provide statements of business process

this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)

Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.)

this is especially apparent in large-scale Cascalog apps:

“Specify what you require, not how to achieve it.”

by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale

51

Page 52: DSSG Speaker Series: Paco Nathan

void map (String doc_id, String text):

for each word w in segment(text):

emit(w, "1");

void reduce (String word, Iterator group):

int count = 0;

for each pc in group:

count += Int(pc);

emit(word, String(count));

The Ubiquitous Word Count

Definition:

this simple program provides an excellent test case for parallel processing:

• requires a minimal amount of code

• demonstrates use of both symbolic and numeric values

• shows a dependency graph of tuples as an abstraction

• is not many steps away from useful search indexing

• serves as a “Hello World” for Hadoop apps

a distributed computing framework that runs Word Count efficiently in parallel at scale can handle much larger and more interesting compute problems

count how often each word appears in a collection of text documents

52

Page 53: DSSG Speaker Series: Paco Nathan

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

1 map 1 reduce18 lines code gist.github.com/3900702

WordCount – conceptual flow diagram

cascading.org/category/impatient

53

Page 54: DSSG Speaker Series: Paco Nathan

WordCount – Cascading app in Java

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

54

Page 55: DSSG Speaker Series: Paco Nathan

map

reduceEvery('wc')[Count[decl:'count']]

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

GroupBy('wc')[by:['token']]

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[head]

[tail]

[{2}:'token', 'count'][{1}:'token']

[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']

wc[{1}:'token'][{1}:'token']

[{2}:'token', 'count'][{2}:'token', 'count']

[{1}:'token'][{1}:'token']

WordCount – generated flow diagramDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

55

Page 56: DSSG Speaker Series: Paco Nathan

A Thought Exercise

Consider that when a company like Caterpillar moves into data science, they won’t be building the world’s next search engine or social network

They will be optimizing supply chain, optimizing fuel costs, automating data feedback loops integrated into their equipment…

Operations Research –crunching amazing amounts of data

$50B company, in a $250B market segment

Upcoming: tractors as drones – guided by complex, distributed data apps

56

Page 57: DSSG Speaker Series: Paco Nathan

Alternatively…

climate.com

57

Page 58: DSSG Speaker Series: Paco Nathan

Two Avenues to the App Layer…

scale ➞co

mpl

exity

Enterprise: must contend with complexity at scale everyday…

incumbents extend current practices and infrastructure investments – using J2EE, ANSI SQL, SAS, etc. – to migrate workflows onto Apache Hadoop while leveraging existing staff

Start-ups: crave complexity and scale to become viable…

new ventures move into Enterprise space to compete using relatively lean staff, while leveraging sophisticated engineering practices, e.g., Cascalog and Scalding

58

Page 59: DSSG Speaker Series: Paco Nathan

Learnings generalized from trends in Data Science:

1. the practice of leading data science teams

2. strategies for leveraging data at scale

3. machine learning and optimization

4. large-scale data workflows

5. the evolution of cluster computing

DSSG, 2013-08-1259

Page 60: DSSG Speaker Series: Paco Nathan

Q3 1997: inflection point

four independent teams were working toward horizontal scale-out of workflows based on commodity hardware

this effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack emerged from this period

60

Page 61: DSSG Speaker Series: Paco Nathan

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

61

Page 62: DSSG Speaker Series: Paco Nathan

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

“throw it over the wall”

62

Page 63: DSSG Speaker Series: Paco Nathan

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

63

Page 64: DSSG Speaker Series: Paco Nathan

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

“data products”

64

Page 65: DSSG Speaker Series: Paco Nathan

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere

65

Page 66: DSSG Speaker Series: Paco Nathan

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere

“optimize topologies”

66

Page 67: DSSG Speaker Series: Paco Nathan

Amazon“Early Amazon: Splitting the website” – Greg Lindenglinden.blogspot.com/2006/02/early-amazon-splitting-website.html

eBay“The eBay Architecture” – Randy Shoup, Dan Pritchettaddsimplicity.com/adding_simplicity_an_engi/2006/11/you_scaled_your.htmladdsimplicity.com.nyud.net:8080/downloads/eBaySDForum2006-11-29.pdf

Inktomi (YHOO Search)“Inktomi’s Wild Ride” – Erik Brewer (0:05:31 ff)youtu.be/E91oEn1bnXM

Google“Underneath the Covers at Google” – Jeff Dean (0:06:54 ff)youtu.be/qsan-GQaeykperspectives.mvdirona.com/2008/06/11/JeffDeanOnGoogleInfrastructure.aspx

MIT Media Lab“Social Information Filtering for Music Recommendation” – Pattie Maespubs.media.mit.edu/pubs/papers/32paper.psted.com/speakers/pattie_maes.html

Primary Sources

67

Page 68: DSSG Speaker Series: Paco Nathan

Cluster Computing’s Dirty Little Secret

people like me make a good living by leveraging high ROI apps based on clusters, and so the execs agree to build out more data centers…

clusters for Hadoop/HBase, for Storm, for MySQL, for Memcached, for Cassandra, for Nginx, etc.

this becomes expensive!

a single class of workloads on a given cluster is simpler to manage; but terrible for utilization… various notions of “cloud” help

Cloudera, Hortonworks, probably EMC soon: sell a notion of “Hadoop as OS” ⇒ All your workloads are belong to us

regardless of how architectures change, death and taxes will endure: servers fail, and data must move

Google Data Center, Fox News

~2002

68

Page 69: DSSG Speaker Series: Paco Nathan

Three Laws, or more?

meanwhile, architectures evolve toward much, much larger data…

pistoncloud.com/ ...

Rich Freitas, IBM Research

Q:what kinds of disruption in topologies could this imply? because there’s no such thing as RAM anymore…

69

Page 70: DSSG Speaker Series: Paco Nathan

Topologies

Hadoop and other topologies arose from a need for fault-tolerant workloads, leveraging horizontal scale-out based on commodity hardware

because the data won’t fit on one computer anymore

a variety of Big Data technologies has since emerged, which can be categorized in terms of topologies and the CAP Theorem

C A

P

strongconsistency

highavailability

partition tolerance

eventualconsistency

“You can have at most two of these properties for any shared-data system… the choice of which feature to discard determines the nature of your system.” – Eric Brewer, 2000 (Inktomi/YHOO)

cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf

julianbrowne.com/article/viewer/brewers-cap-theorem

70

Page 72: DSSG Speaker Series: Paco Nathan

“Return of the Borg”

consider that Google is generations ahead of Hadoop, etc., with much improved ROI on its data centers…

Borg serves as a kind of “secret sauce” for data center OS, with Omega as its next evolution:

2011 GAFS OmegaJohn Wilkes, et al.youtu.be/0ZFMlO98Jkc

72

Page 73: DSSG Speaker Series: Paco Nathan

“Return of the Borg”

Return of the Borg: How Twitter Rebuilt Google’s Secret WeaponCade Metzwired.com/wiredenterprise/2013/03/google-borg-twitter-mesos

The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale MachinesLuiz André Barroso, Urs Hölzleresearch.google.com/pubs/pub35290.html

73

Page 74: DSSG Speaker Series: Paco Nathan

Mesos – definitions

a common substrate for cluster computing

heterogenous assets in your data center or cloud made available as a homogenous set of resources

• top-level Apache project

• scalability to 10,000s of nodes

• obviates the need for virtual machines

• isolation between tasks with Linux Containers (pluggable)

• fault-tolerant replicated master using ZooKeeper

• multi-resource scheduling (memory and CPU aware)

• APIs in C++, Java, Python

• web UI for inspecting cluster state

• available for Linux, Mac OSX, OpenSolaris

74

Page 75: DSSG Speaker Series: Paco Nathan

Mesos – simplifies app development

CHRONOS SPARK HADOOP DPARK MPI

JVM (JAVA, SCALA, CLOJURE, JRUBY)

MESOS

PYTHON C++

75

Page 76: DSSG Speaker Series: Paco Nathan

Mesos – data center OS stack

HADOOP STORM CHRONOS RAILS JBOSS

TELEMETRY

Kernel

OS

Apps

MESOS

CAPACITY PLANNING GUISECURITYSMARTER SCHEDULING

76

Page 77: DSSG Speaker Series: Paco Nathan

Mesos Kernel

Chronos Marathon

Apps

Web AppsStreamingBatch

FrameworksHadoop Spark Storm

RailsJBoss

KafkaMPI

Hive Scalding

JVM

Python

C++

Workloads

Mesos – architecture

77

Page 78: DSSG Speaker Series: Paco Nathan

Prior Practice: Dedicated Servers

DATACENTER

• low utilization rates

• longer time to ramp up new services

78

Page 79: DSSG Speaker Series: Paco Nathan

Prior Practice: Virtualization

DATACENTER PROVISIONED VMS

• even more machines to manage

• substantial performance decrease due to virtualization

• VM licensing costs

79

Page 80: DSSG Speaker Series: Paco Nathan

Prior Practice: Static Partitioning

DATACENTER STATIC PARTITIONING

• even more machines to manage

• substantial performance decrease due to virtualization

• VM licensing costs

• static partitioning limits elasticity

80

Page 81: DSSG Speaker Series: Paco Nathan

MESOS

Mesos: One Large Pool Of Resources

DATACENTER

“We wanted people to be able to program for the data center just like they program for their laptop."

Ben Hindman

81

Page 83: DSSG Speaker Series: Paco Nathan

What are the costs of Single Tenancy?

0%

25%

50%

75%

100%

RAILS CPU LOAD

MEMCACHED CPU LOAD

0%

25%

50%

75%

100%

HADOOP CPU LOAD

0%

25%

50%

75%

100%

t t

0%

25%

50%

75%

100%

Rails MemcachedHadoop

COMBINED CPU LOAD (RAILS, MEMCACHED, HADOOP)

83

Page 84: DSSG Speaker Series: Paco Nathan

Compelling arguments for Data Center OS

• obviates the need for VMs (licensing, adios VMware)

• provides OS-level building blocks for developing new distributed frameworks (learning curve, adios Hadoop)

• removes significant VM overhead (performance)

• requires less h/w to buy (CapEx), power and fix (OpEx)

• implies less VMs, thus less Ops overhead (staff)

• removes the complexity of Chef/Puppet (staff)

• allows higher utilization rates (ROI)

• reduces latency for data updates (OLTP + OLAP on same server)

• reshapes cluster resources dynamically (100’s ms vs. minutes)

• runs dev/test clusters on same h/w as production (flexibility)

• evaluates multiple versions without more h/w (vendor lock-in)

84

Page 85: DSSG Speaker Series: Paco Nathan

Opposite Ends of the Spectrum, One Substrate

Built-in /bare metal

Hypervisors

Solaris Zones

Linux CGroups

85

Page 86: DSSG Speaker Series: Paco Nathan

Opposite Ends of the Spectrum, One Substrate

Request /Response Batch

86

Page 87: DSSG Speaker Series: Paco Nathan

Case Study: Twitter (bare metal / on premise)

“Mesos is the cornerstone of our elastic compute infrastructure – it’s how we build all our new services and is critical for Twitter’s continued success at scale. It's one of the primary keys to our data center efficiency."

Chris Fry, SVP Engineeringblog.twitter.com/2013/mesos-graduates-from-apache-incubation

• key services run in production: analytics, typeahead, ads

• Twitter engineers rely on Mesos to build all new services

• instead of thinking about static machines, engineers think about resources like CPU, memory and disk

• allows services to scale and leverage a shared pool of servers across data centers efficiently

• reduces the time between prototyping and launching

87

Page 88: DSSG Speaker Series: Paco Nathan

Case Study: Airbnb (fungible cloud infrastructure)

“We think we might be pushing data science in the field of travel more so than anyone has ever done before… a smaller number of engineers can have higher impact through automation on Mesos."

Mike Curtis, VP Engineeringgigaom.com/2013/07/29/airbnb-is-engineering-itself-into-a-data-driven...

• improves resource management and efficiency

• helps advance engineering strategy of building small teams that can move fast

• key to letting engineers make the most of AWS-based infrastructure beyond just Hadoop

• allowed company to migrate off Elastic MapReduce

• enables use of Hadoop along with Chronos, Spark, Storm, etc.

88

Page 90: DSSG Speaker Series: Paco Nathan

Learnings generalized from trends in Data Science:

1. the practice of leading data science teams

2. strategies for leveraging data at scale

3. machine learning and optimization

4. large-scale data workflows

5. the evolution of cluster computing

SUMMARY…

DSSG, 2013-08-1290

Page 91: DSSG Speaker Series: Paco Nathan

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere – Four-Part Harmony

91

Page 92: DSSG Speaker Series: Paco Nathan

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere – Four-Part Harmony

1. End Use Cases, the drivers

92

Page 93: DSSG Speaker Series: Paco Nathan

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere – Four-Part Harmony

2. A new kind of team process

93

Page 94: DSSG Speaker Series: Paco Nathan

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere – Four-Part Harmony

3. Abstraction layer as optimizing middleware, e.g., Cascading

94

Page 95: DSSG Speaker Series: Paco Nathan

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere – Four-Part Harmony

4. Data Center OS, e.g., Mesos

95