July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

79
Paco Nathan liber118.com/pxn/ “Using Cascalog with Palo Alto Open Data” Licensed under a Creative Commons Attribution- NonCommercial-NoDerivs 3.0 Unported License. LA Clojure User Group 1 Friday, 19 July 13

Transcript of July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Page 2: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Cascading / Cascalog / Scalding

Enterprise Data Workflows with Cascading

Cluster Computing with Mesos

Using Cascalog with Palo Alto Open Data

2Friday, 19 July 13

Page 3: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Cascading – origins

API author Chris Wensel worked as a system architect at an Enterprise firm well-known for many popular data products.

Wensel was following the Nutch open source project – where Hadoop started.

Observation: would be difficult to find Java developers to write complex Enterprise apps in MapReduce – potential blocker for leveraging new open source technology.

3Friday, 19 July 13

Page 4: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Cascading – functional programming

Key insight: MapReduce is based on functional programming – back to LISP in 1970s. Apache Hadoop use cases are mostly about data pipelines, which are functional in nature.

To ease staffing problems as “Main Street” Enterprise firms began to embrace Hadoop, Cascading was introduced in late 2007, as a new Java API to implement functional programming for large-scale data workflows:

• leverages JVM and Java-based tools without anyneed to create new languages

• allows programmers who have J2EE expertise to leverage the economics of Hadoop clusters

4Friday, 19 July 13

Page 5: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Cascading – functional programming

• Twitter, eBay, LinkedIn, Nokia, YieldBot, uSwitch, etc., have invested in open source projects atop Cascading – used for their large-scale production deployments

• new case studies for Cascading apps are mostly based on domain-specific languages (DSLs) in JVM languages which emphasize functional programming:

Cascalog in Clojure (2010)Scalding in Scala (2012)

github.com/nathanmarz/cascalog/wikigithub.com/twitter/scalding/wiki

Why Adopting the Declarative Programming Practices Will Improve Your Return from TechnologyDan Woods, 2013-04-17 Forbes

forbes.com/sites/danwoods/2013/04/17/why-adopting-the-declarative-programming-practices-will-improve-your-return-from-technology/

5Friday, 19 July 13

Page 6: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Cascading – integrations

• partners: Microsoft Azure, Hortonworks, Amazon AWS, MapR, EMC, SpringSource, Cloudera

• taps: Memcached, Cassandra, MongoDB, HBase, JDBC, Parquet, etc.

• serialization: Avro, Thrift, Kryo, JSON, etc.

• topologies: Apache Hadoop, tuple spaces, local mode

6Friday, 19 July 13

Page 7: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.

• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.

7Friday, 19 July 13

Page 8: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Cascading – deployments

• case studies: Climate Corp, Twitter, Etsy, Williams-Sonoma, uSwitch, Airbnb, Nokia, YieldBot, Square, Harvard, Factual, etc.

• use cases: ETL, marketing funnel, anti-fraud, social media, retail pricing, search analytics, recommenders, eCRM, utility grids, telecom, genomics, climatology, agronomics, etc.

workflow abstraction addresses: • staffing bottleneck; • system integration; • operational complexity; • test-driven development

8Friday, 19 July 13

Page 9: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

1 map 1 reduce18 lines code gist.github.com/3900702

WordCount – conceptual flow diagram

cascading.org/category/impatient

9Friday, 19 July 13

Page 10: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

WordCount – Cascading app in Java

String docPath = args[ 0 ];String wcPath = args[ 1 ];Properties properties = new Properties();AppProps.setApplicationJarClass( properties, Main.class );HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );

// create source and sink tapsTap docTap = new Hfs( new TextDelimited( true, "\t" ), docPath );Tap wcTap = new Hfs( new TextDelimited( true, "\t" ), wcPath );

// specify a regex to split "document" text lines into token streamFields token = new Fields( "token" );Fields text = new Fields( "text" );RegexSplitGenerator splitter = new RegexSplitGenerator( token, "[ \\[\\]\\(\\),.]" );// only returns "token"Pipe docPipe = new Each( "token", text, splitter, Fields.RESULTS );// determine the word countsPipe wcPipe = new Pipe( "wc", docPipe );wcPipe = new GroupBy( wcPipe, token );wcPipe = new Every( wcPipe, Fields.ALL, new Count(), Fields.ALL );

// connect the taps, pipes, etc., into a flowFlowDef flowDef = FlowDef.flowDef().setName( "wc" ) .addSource( docPipe, docTap ) .addTailSink( wcPipe, wcTap );// write a DOT file and run the flowFlow wcFlow = flowConnector.connect( flowDef );wcFlow.writeDOT( "dot/wc.dot" );wcFlow.complete();

DocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

10Friday, 19 July 13

Page 11: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

map

reduceEvery('wc')[Count[decl:'count']]

Hfs['TextDelimited[[UNKNOWN]->['token', 'count']]']['output/wc']']

GroupBy('wc')[by:['token']]

Each('token')[RegexSplitGenerator[decl:'token'][args:1]]

Hfs['TextDelimited[['doc_id', 'text']->[ALL]]']['data/rain.txt']']

[head]

[tail]

[{2}:'token', 'count'][{1}:'token']

[{2}:'doc_id', 'text'][{2}:'doc_id', 'text']

wc[{1}:'token'][{1}:'token']

[{2}:'token', 'count'][{2}:'token', 'count']

[{1}:'token'][{1}:'token']

WordCount – generated flow diagramDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

11Friday, 19 July 13

Page 12: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

(ns impatient.core  (:use [cascalog.api]        [cascalog.more-taps :only (hfs-delimited)])  (:require [clojure.string :as s]            [cascalog.ops :as c])  (:gen-class))

(defmapcatop split [line]  "reads in a line of string and splits it by regex"  (s/split line #"[\[\]\\\(\),.)\s]+"))

(defn -main [in out & args]  (?<- (hfs-delimited out)       [?word ?count]       ((hfs-delimited in :skip-header? true) _ ?line)       (split ?line :> ?word)       (c/count ?count)))

; Paul Lam; github.com/Quantisan/Impatient

WordCount – Cascalog / ClojureDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

12Friday, 19 July 13

Page 13: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

github.com/nathanmarz/cascalog/wiki

• implements Datalog in Clojure, with predicates backed by Cascading – for a highly declarative language

• run ad-hoc queries from the Clojure REPL –approx. 10:1 code reduction compared with SQL

• composable subqueries, used for test-driven development (TDD) practices at scale

• Leiningen build: simple, no surprises, in Clojure itself

• more new deployments than other Cascading DSLs – Climate Corp is largest use case: 90% Clojure/Cascalog

• has a learning curve, limited number of Clojure developers

• aggregators are the magic, and those take effort to learn

WordCount – Cascalog / ClojureDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

13Friday, 19 July 13

Page 14: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

import com.twitter.scalding._ class WordCount(args : Args) extends Job(args) { Tsv(args("doc"), ('doc_id, 'text), skipHeader = true) .read .flatMap('text -> 'token) { text : String => text.split("[ \\[\\]\\(\\),.]") } .groupBy('token) { _.size('count) } .write(Tsv(args("wc"), writeHeader = true))}

WordCount – Scalding / ScalaDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

14Friday, 19 July 13

Page 15: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

github.com/twitter/scalding/wiki

• extends the Scala collections API so that distributed lists become “pipes” backed by Cascading

• code is compact, easy to understand

• nearly 1:1 between elements of conceptual flow diagram and function calls

• extensive libraries are available for linear algebra, abstract algebra, machine learning – e.g., Matrix API, Algebird, etc.

• significant investments by Twitter, Etsy, eBay, etc.

• great for data services at scale

• less learning curve than Cascalog

WordCount – Scalding / ScalaDocumentCollection

WordCount

TokenizeGroupBytoken Count

R

M

15Friday, 19 July 13

Page 16: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Workflow Abstraction – pattern language

Cascading uses a “plumbing” metaphor in the Java API, to define workflows out of familiar elements: Pipes, Taps, Tuple Flows, Filters, Joins, Traps, etc.

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Data is represented as flows of tuples. Operations within the flows bring functional programming aspects into Java

A Pattern LanguageChristopher Alexander, et al.amazon.com/dp/0195019199

16Friday, 19 July 13

Page 17: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Workflow Abstraction – literate programming

Cascading workflows generate their own visual documentation: flow diagrams

in formal terms, flow diagrams leverage a methodology called literate programming

provides intuitive, visual representations for apps –great for cross-team collaboration

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

Literate ProgrammingDon Knuthliterateprogramming.com

17Friday, 19 July 13

Page 18: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Workflow Abstraction – business process

following the essence of literate programming, Cascading workflows provide statements of business process

this recalls a sense of business process management for Enterprise apps (think BPM/BPEL for Big Data)

Cascading creates a separation of concerns between business process and implementation details (Hadoop, etc.)

this is especially apparent in large-scale Cascalog apps:

“Specify what you require, not how to achieve it.”

by virtue of the pattern language, the flow planner then determines how to translate business process into efficient, parallel jobs at scale

18Friday, 19 July 13

Page 19: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Cascading / Cascalog / Scalding

Enterprise Data Workflows with Cascading

Cluster Computing with Mesos

Using Cascalog with Palo Alto Open Data

19Friday, 19 July 13

Page 20: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

20Friday, 19 July 13

Page 21: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

ANSI SQL for ETL

21Friday, 19 July 13

Page 22: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

endusesJ2EE for business logic

22Friday, 19 July 13

Page 23: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive models

23Friday, 19 July 13

Page 24: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

enduses

SAS for predictive modelsANSI SQL for ETL most of the licensing costs…

24Friday, 19 July 13

Page 25: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Anatomy of an Enterprise app

Definition a typical Enterprise workflow which crosses through multiple departments, languages, and technologies…

ETL dataprep

predictivemodel

datasources

endusesJ2EE for business logic

most of the project costs…

25Friday, 19 July 13

Page 26: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

a compiler sees it all…

cascading.org

26Friday, 19 July 13

Page 27: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

a compiler sees it all…

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FlowDef flowDef = FlowDef.flowDef() .setName( "etl" ) .addSource( "example.employee", emplTap ) .addSource( "example.sales", salesTap ) .addSink( "results", resultsTap ); SQLPlanner sqlPlanner = new SQLPlanner() .setSql( sqlStatement ); flowDef.addAssemblyPlanner( sqlPlanner );

cascading.org

27Friday, 19 July 13

Page 28: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

a compiler sees it all…

ETL dataprep

predictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

FlowDef flowDef = FlowDef.flowDef() .setName( "classifier" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlModel ) ) .retainOnlyActiveIncomingFields(); flowDef.addAssemblyPlanner( pmmlPlanner );

28Friday, 19 July 13

Page 29: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

cascading.orgETL data

preppredictivemodel

datasources

enduses

Lingual:DW → ANSI SQL

Pattern:SAS, R, etc. → PMML

business logic in Java, Clojure, Scala, etc.

sink taps for Memcached, HBase, MongoDB, etc.

source taps for Cassandra, JDBC,Splunk, etc.

Anatomy of an Enterprise app

Cascading allows multiple departments to combine their workflow components into an integrated app – one among many, typically – based on 100% open source

visual collaboration for the business logic is a great way to improve how teams work together

FailureTraps

bonusallocation

employee

PMMLclassifier

quarterlysales

Join Countleads

29Friday, 19 July 13

Page 30: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Lingual – CSV data in local file system

cascading.org/lingual

30Friday, 19 July 13

Page 31: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Lingual – shell prompt, catalog

cascading.org/lingual

31Friday, 19 July 13

Page 32: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Lingual – queries

cascading.org/lingual

32Friday, 19 July 13

Page 33: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

# load the JDBC packagelibrary(RJDBC) # set up the driverdrv <- JDBC("cascading.lingual.jdbc.Driver", "~/src/concur/lingual/lingual-local/build/libs/lingual-local-1.0.0-wip-dev-jdbc.jar") # set up a database connection to a local repositoryconnection <- dbConnect(drv, "jdbc:lingual:local;catalog=~/src/concur/lingual/lingual-examples/tables;schema=EMPLOYEES") # query the repository: in this case the MySQL sample database (CSV files)df <- dbGetQuery(connection, "SELECT * FROM EMPLOYEES.EMPLOYEES WHERE FIRST_NAME = 'Gina'")head(df) # use R functions to summarize and visualize part of the datadf$hire_age <- as.integer(as.Date(df$HIRE_DATE) - as.Date(df$BIRTH_DATE)) / 365.25summary(df$hire_age)

library(ggplot2)m <- ggplot(df, aes(x=hire_age))m <- m + ggtitle("Age at hire, people named Gina")m + geom_histogram(binwidth=1, aes(y=..density.., fill=..count..)) + geom_density()

Lingual – connecting Hadoop and R

33Friday, 19 July 13

Page 34: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

> summary(df$hire_age) Min. 1st Qu. Median Mean 3rd Qu. Max. 20.86 27.89 31.70 31.61 35.01 43.92

Lingual – connecting Hadoop and R

cascading.org/lingual

34Friday, 19 July 13

Page 35: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Hadoop Cluster

sourcetap

sourcetap sink

taptraptap

customer profile DBsCustomer

Prefs

logslogs

Logs

DataWorkflow

Cache

Customers

Support

WebApp

Reporting

Analytics Cubes

sinktap

Modeling PMML

Pattern – model scoring

• migrate workloads: SAS,Teradata, etc., exporting predictive models as PMML

• great open source tools – R, Weka, KNIME, Matlab, RapidMiner, etc.

• integrate with other libraries –Matrix API, etc.

• leverage PMML as another kind of DSL

cascading.org/pattern

35Friday, 19 July 13

Page 36: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

• established XML standard for predictive model markup

• organized by Data Mining Group (DMG), since 1997 http://dmg.org/

• members: IBM, SAS, Visa, NASA, Equifax, Microstrategy, Microsoft, etc.

• PMML concepts for metadata, ensembles, etc., translate directly into Cascading tuple flows

“PMML is the leading standard for statistical and data mining models and supported by over 20 vendors and organizations. With PMML, it is easy to develop a model on one system using one application and deploy the model on another system using another application.”

PMML – standard

wikipedia.org/wiki/Predictive_Model_Markup_Language

36Friday, 19 July 13

Page 37: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

PMML – vendor coverage

37Friday, 19 July 13

Page 38: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

• Association Rules: AssociationModel element

• Cluster Models: ClusteringModel element

• Decision Trees: TreeModel element

• Naïve Bayes Classifiers: NaiveBayesModel element

• Neural Networks: NeuralNetwork element

• Regression: RegressionModel and GeneralRegressionModel elements

• Rulesets: RuleSetModel element

• Sequences: SequenceModel element

• Support Vector Machines: SupportVectorMachineModel element

• Text Models: TextModel element

• Time Series: TimeSeriesModel element

PMML – model coverage

ibm.com/developerworks/industry/library/ind-PMML2/

38Friday, 19 July 13

Page 39: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

## train a RandomForest model f <- as.formula("as.factor(label) ~ .")fit <- randomForest(f, data_train, ntree=50) ## test the model on the holdout test set print(fit$importance)print(fit) predicted <- predict(fit, data)data$predicted <- predictedconfuse <- table(pred = predicted, true = data[,1])print(confuse) ## export predicted labels to TSV write.table(data, file=paste(dat_folder, "sample.tsv", sep="/"), quote=FALSE, sep="\t", row.names=FALSE) ## export RF model to PMML saveXML(pmml(fit), file=paste(dat_folder, "sample.rf.xml", sep="/"))

Pattern – create a model in R

39Friday, 19 July 13

Page 40: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

<?xml version="1.0"?><PMML version="4.0" xmlns="http://www.dmg.org/PMML-4_0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.dmg.org/PMML-4_0 http://www.dmg.org/v4-0/pmml-4-0.xsd"> <Header copyright="Copyright (c)2012 Concurrent, Inc." description="Random Forest Tree Model">  <Extension name="user" value="ceteri" extender="Rattle/PMML"/>  <Application name="Rattle/PMML" version="1.2.30"/>  <Timestamp>2012-10-22 19:39:28</Timestamp> </Header> <DataDictionary numberOfFields="4">  <DataField name="label" optype="categorical" dataType="string">   <Value value="0"/>   <Value value="1"/>  </DataField>  <DataField name="var0" optype="continuous" dataType="double"/>  <DataField name="var1" optype="continuous" dataType="double"/>  <DataField name="var2" optype="continuous" dataType="double"/> </DataDictionary> <MiningModel modelName="randomForest_Model" functionName="classification">  <MiningSchema>   <MiningField name="label" usageType="predicted"/>   <MiningField name="var0" usageType="active"/>   <MiningField name="var1" usageType="active"/>   <MiningField name="var2" usageType="active"/>  </MiningSchema>  <Segmentation multipleModelMethod="majorityVote">   <Segment id="1">    <True/>    <TreeModel modelName="randomForest_Model" functionName="classification" algorithmName="randomForest" splitCharacteristic="binarySplit">     <MiningSchema>      <MiningField name="label" usageType="predicted"/>      <MiningField name="var0" usageType="active"/>      <MiningField name="var1" usageType="active"/>      <MiningField name="var2" usageType="active"/>     </MiningSchema>...

Pattern – capture model parameters as PMML

40Friday, 19 July 13

Page 41: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

public static void main( String[] args ) throws RuntimeException { String inputPath = args[ 0 ]; String classifyPath = args[ 1 ]; // set up the config properties Properties properties = new Properties(); AppProps.setApplicationJarClass( properties, Main.class ); HadoopFlowConnector flowConnector = new HadoopFlowConnector( properties );  // create source and sink taps Tap inputTap = new Hfs( new TextDelimited( true, "\t" ), inputPath ); Tap classifyTap = new Hfs( new TextDelimited( true, "\t" ), classifyPath );  // handle command line options OptionParser optParser = new OptionParser(); optParser.accepts( "pmml" ).withRequiredArg();  OptionSet options = optParser.parse( args );  // connect the taps, pipes, etc., into a flow FlowDef flowDef = FlowDef.flowDef().setName( "classify" ) .addSource( "input", inputTap ) .addSink( "classify", classifyTap );  if( options.hasArgument( "pmml" ) ) { String pmmlPath = (String) options.valuesOf( "pmml" ).get( 0 ); PMMLPlanner pmmlPlanner = new PMMLPlanner() .setPMMLInput( new File( pmmlPath ) ) .retainOnlyActiveIncomingFields() .setDefaultPredictedField( new Fields( "predict", Double.class ) ); // default value if missing from the model flowDef.addAssemblyPlanner( pmmlPlanner ); }  // write a DOT file and run the flow Flow classifyFlow = flowConnector.connect( flowDef ); classifyFlow.writeDOT( "dot/classify.dot" ); classifyFlow.complete(); }

Pattern – score a model, within an app

41Friday, 19 July 13

Page 42: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

CustomerOrders

Classify ScoredOrders

GroupBytoken

Count

PMMLModel

M R

FailureTraps

Assert

ConfusionMatrix

Pattern – score a model, using pre-defined Cascading app

cascading.org/pattern

42Friday, 19 July 13

Page 43: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Roadmap – existing algorithms for scoring

• Random Forest

• Decision Trees

• Linear Regression

• GLM

• Logistic Regression

• K-Means Clustering

• Hierarchical Clustering

• Multinomial

• Support Vector Machines (prepared for release)

also, model chaining and general support for ensembles

cascading.org/pattern

43Friday, 19 July 13

Page 44: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Roadmap – next priorities for scoring

• Time Series (ARIMA forecast)

• Association Rules (basket analysis)

• Naïve Bayes

• Neural Networks

algorithms extended based on customer use cases – contact groups.google.com/forum/?fromgroups#!forum/pattern-user

cascading.org/pattern

44Friday, 19 July 13

Page 45: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Cascading / Cascalog / Scalding

Enterprise Data Workflows with Cascading

Cluster Computing with Mesos

Using Cascalog with Palo Alto Open Data

45Friday, 19 July 13

Page 46: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Q3 1997: inflection point

Four independent teams were working toward horizontal scale-out of workflows based on commodity hardware

This effort prepared the way for huge Internet successesin the 1997 holiday season… AMZN, EBAY, Inktomi (YHOO Search), then GOOG

MapReduce and the Apache Hadoop open source stack emerged from this

46Friday, 19 July 13

Page 47: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

47Friday, 19 July 13

Page 48: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

RDBMS

Stakeholder

SQL Queryresult sets

Excel pivot tablesPowerPoint slide decks

Web App

Customers

transactions

Product

strategy

Engineering

requirements

BIAnalysts

optimizedcode

Circa 1996: pre- inflection point

“throw it over the wall”

48Friday, 19 July 13

Page 49: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

49Friday, 19 July 13

Page 50: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

RDBMS

SQL Queryresult sets

recommenders+

classifiersWeb Apps

customertransactions

AlgorithmicModeling

Logs

eventhistory

aggregation

dashboards

Product

EngineeringUX

Stakeholder Customers

DW ETL

Middleware

servletsmodels

Circa 2001: post- big ecommerce successes

“data products”

50Friday, 19 July 13

Page 51: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere

51Friday, 19 July 13

Page 52: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Workflow

RDBMS

near timebatch

services

transactions,content

socialinteractions

Web Apps,Mobile, etc.History

Data Products Customers

RDBMS

LogEvents

In-Memory Data Grid

Hadoop, etc.

Cluster Scheduler

Prod

Eng

DW

Use Cases Across Topologies

s/wdev

datascience

discovery+

modeling

Planner

Ops

dashboardmetrics

businessprocess

optimizedcapacitytaps

DataScientist

App Dev

Ops

DomainExpert

introducedcapability

existingSDLC

Circa 2013: clusters everywhere

“optimize topologies”

52Friday, 19 July 13

Page 53: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Operating Systems, redux

meanwhile, GOOG is 3+ generations ahead, with much improved ROI on data centers

John Wilkes, et al.Borg/Omega: data center “secret sauce”youtu.be/0ZFMlO98Jkc

0%

25%

50%

75%

100%

RAILS CPU LOAD

MEMCACHED CPU LOAD

0%

25%

50%

75%

100%

HADOOP CPU LOAD

0%

25%

50%

75%

100%

t t

0%

25%

50%

75%

100%

Rails MemcachedHadoop

COMBINED CPU LOAD (RAILS, MEMCACHED, HADOOP)

Florian Leibert, Chronos/Mesos @ Airbnb

Mesos, open source cloud OS – like Borggoo.gl/jPtTP

53Friday, 19 July 13

Page 55: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Mesos

a common substrate for cluster computing

• scale to 10,000s of nodes using fast, event-driven C++ impl

• improve utilization across workloads

• run long-lived services (e.g., Hypertable and HBase) on the same nodes as batch app and share resources

• build new cluster computing frameworks without reinventing low-level facilities, and have them coexist with existing work

• run multiple instances/versions of Hadoop on the same cluster to isolate production and experimental jobs

• reshape cluster resources based on ML from app history

• reduce latency in transferring data products from one cluster to another

• enable new kinds of apps, which combine frameworks with lower latency

55Friday, 19 July 13

Page 56: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Cascading / Cascalog / Scalding

Enterprise Data Workflows with Cascading

Cluster Computing with Mesos

Using Cascalog with Palo Alto Open Data

56Friday, 19 July 13

Page 57: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Palo Alto is quite a pleasant place

• temperate weather

• lots of parks, enormous trees

• great coffeehouses

• walkable downtown

• not particularly crowded

On a nice summer day, who wants to be stuck indoors on a phone call?

Instead, take it outside – go for a walk

And example open source project: github.com/Cascading/CoPA/wiki

57Friday, 19 July 13

Page 58: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

1. Open Data about municipal infrastructure(GIS data: trees, roads, parks)

2. Big Data about where people like to walk(smartphone GPS logs)

3. some curated metadata(which surfaces the value)

⇒4. personalized recommendations:

“Find a shady spot on a summer day in which to walk near downtown Palo Alto. While on a long conference call. Sipping a latte or enjoying some fro-yo.”

Scrubtoken

DocumentCollection

Tokenize

WordCount

GroupBytoken

Count

Stop WordList

Regextoken

HashJoinLeft

RHS

M

R

58Friday, 19 July 13

Page 59: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

The City of Palo Alto recently began to support Open Data to give the local community greater visibility into how their city government operates

This effort is intended to encourage students, entrepreneurs, local organizations, etc., to build new apps which contribute to the public good

paloalto.opendata.junar.com/dashboards/7576/geographic-information/

discovery

59Friday, 19 July 13

Page 60: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

GIS about trees in Palo Alto:discovery

60Friday, 19 July 13

Page 61: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Geographic_Information,,,

"Tree: 29 site 2 at 203 ADDISON AV, on ADDISON AV 44 from pl"," Private: -1 Tree ID: 29 Street_Name: ADDISON AV Situs Number: 203 Tree Site: 2 Species: Celtis australis Source: davey tree Protected: Designated: Heritage: Appraised Value: Hardscape: None Identifier: 40 Active Numeric: 1 Location Feature ID: 13872 Provisional: Install Date: ","37.4409634615283,-122.15648458861,0.0 ","Point""Wilkie Way from West Meadow Drive to Victoria Place"," Sequence: 20 Street_Name: Wilkie Way From Street PMMS: West Meadow Drive To Street PMMS: Victoria Place Street ID: 598 (Wilkie Wy, Palo Alto) From Street ID PMMS: 689 To Street ID PMMS: 567 Year Constructed: 1950 Traffic Count: 596 Traffic Index: residential local Traffic Class: local residential Traffic Date: 08/24/90 Paving Length: 208 Paving Width: 40 Paving Area: 8320 Surface Type: asphalt concrete Surface Thickness: 2.0 Base Type Pvmt: crusher run base Base Thickness: 6.0 Soil Class: 2 Soil Value: 15 Curb Type: Curb Thickness: Gutter Width: 36.0 Book: 22 Page: 1 District Number: 18 Land Use PMMS: 1 Overlay Year: 1990 Overlay Thickness: 1.5 Base Failure Year: 1990 Base Failure Thickness: 6 Surface Treatment Year: Surface Treatment Type: Alligator Severity: none Alligator Extent: 0 Block Severity: none Block Extent: 0 Longitude and Transverse Severity: none Longitude and Transverse Extent: 0 Ravelling Severity: none Ravelling Extent: 0 Ridability Severity: none Trench Severity: none Trench Extent: 0 Rutting Severity: none Rutting Extent: 0 Road Performance: UL (Urban Local) Bike Lane: 0 Bus Route: 0 Truck Route: 0 Remediation: Deduct Value: 100 Priority: Pavement Condition: excellent Street Cut Fee per SqFt: 10.00 Source Date: 6/10/2009 User Modified By: mnicols Identifier System: 21410 ","-122.1249640794,37.4155803115645,0.0 -122.124661859039,37.4154224594993,0.0 -122.124587720719,37.4153758330704,0.0 -122.12451895942,37.4153242300888,0.0 -122.124456098457,37.4152680432944,0.0 -122.124399616238,37.4152077003122,0.0 -122.124374937753,37.4151774433318,0.0 ","Line"

discovery

(unstructured data…)

61Friday, 19 July 13

Page 62: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

(defn parse-gis [line] "leverages parse-csv for complex CSV format in GIS export" (first (csv/parse-csv line)) )  (defn etl-gis [gis trap] "subquery to parse data sets from the GIS source tap" (<- [?blurb ?misc ?geo ?kind] (gis ?line) (parse-gis ?line :> ?blurb ?misc ?geo ?kind) (:trap (hfs-textline trap)) ))

discovery

(specify what you require, not how to achieve it…

80/20 rule of data prep cost)

62Friday, 19 July 13

Page 63: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

discovery

(ad-hoc queries get refined into composable predicates)

Identifier: 474 Tree ID: 412 Tree: 412 site 1 at 115 HAWTHORNE AVTree Site: 1 Street_Name: HAWTHORNE AV Situs Number: 115 Private: -1 Species: Liquidambar styraciflua Source: davey tree Hardscape: None 37.446001565119,-122.167713417554,0.0Point

63Friday, 19 July 13

Page 64: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

discovery

(curate valuable metadata)

64Friday, 19 July 13

Page 65: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

(defn get-trees [src trap tree_meta] "subquery to parse/filter the tree data" (<- [?blurb ?tree_id ?situs ?tree_site ?species ?wikipedia ?calflora ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash ] (src ?blurb ?misc ?geo ?kind) (re-matches #"^\s+Private.*Tree ID.*" ?misc) (parse-tree ?misc :> _ ?priv ?tree_id ?situs ?tree_site ?raw_species) ((c/comp s/trim s/lower-case) ?raw_species :> ?species) (tree_meta ?species ?wikipedia ?calflora ?min_height ?max_height) (avg ?min_height ?max_height :> ?avg_height) (geo-tree ?geo :> _ ?tree_lat ?tree_lng ?tree_alt) (read-string ?tree_lat :> ?lat) (read-string ?tree_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (:trap (hfs-textline trap)) ))

discovery

?blurb!! Tree: 412 site 1 at 115 HAWTHORNE AV, on HAWTHORNE AV 22 from pl?tree_id! " 412?situs"" 115?tree_site" 1?species" " liquidambar styraciflua?wikipedia" http://en.wikipedia.org/wiki/Liquidambar_styraciflua?calflora http://calflora.org/cgi-bin/species_query.cgi?where-calrecnum=8598?avg_height" 27.5?tree_lat" 37.446001565119?tree_lng" -122.167713417554?tree_alt" 0.0?geohash" " 9q9jh0

65Friday, 19 July 13

Page 66: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

// run analysis and visualization in Rlibrary(ggplot2)

dat_folder <- '~/src/concur/CoPA/out/tree'data <- read.table(file=paste(dat_folder, "part-00000", sep="/"), sep="\t", quote="", na.strings="NULL", header=FALSE, encoding="UTF8") summary(data)

t <- head(sort(table(data$V5), decreasing=TRUE)trees <- as.data.frame.table(t, n=20))colnames(trees) <- c("species", "count") m <- ggplot(data, aes(x=V8))m <- m + ggtitle("Estimated Tree Height (meters)")m + geom_histogram(aes(y = ..density.., fill = ..count..)) + geom_density() par(mar = c(7, 4, 4, 2) + 0.1)plot(trees, xaxt="n", xlab="")axis(1, labels=FALSE)text(1:nrow(trees), par("usr")[3] - 0.25, srt=45, adj=1, labels=trees$species, xpd=TRUE)grid(nx=nrow(trees))

discovery

66Friday, 19 July 13

Page 67: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

discovery

sweetgum

analysis of the tree data:

67Friday, 19 July 13

Page 68: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

M

tree

GISexport

Regexparse-gis

src

Scrubspecies

Geohash

Regexparse-tree

tree

TreeMetadata

Join

FailureTraps

Estimateheight

M

discovery

(flow diagram, gis ⇒ tree)

68Friday, 19 July 13

Page 69: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

9q9jh0

geohash with 6-digit resolution

approximates a 5-block square

centered lat: 37.445, lng: -122.162

modeling

69Friday, 19 July 13

Page 70: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Each road in the GIS export is listed as a block between two cross roads, and each may have multiple road segments to represent turns:

" -122.161776959558,37.4518836690781,0.0 " -122.161390381489,37.4516410983794,0.0 " -122.160786011735,37.4512589903357,0.0 " -122.160531178368,37.4510977281699,0.0

modeling

( lat0, lng0, alt0 )

( lat1, lng1, alt1 )

( lat2, lng2, alt2 )

( lat3, lng3, alt3 )

NB: segments in the raw GIS have the order of geo coordinates scrambled: (lng, lat, alt)

70Friday, 19 July 13

Page 71: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

9q9jh0

X X

X

Filter trees which are too far away to provide shade. Calculate a sum of moments for tree height × distance, as an estimator for shade:

modeling

71Friday, 19 July 13

Page 72: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

(defn get-shade [trees roads] "subquery to join tree and road estimates, maximize for shade" (<- [?road_name ?geohash ?road_lat ?road_lng

?road_alt ?road_metric ?tree_metric] (roads ?road_name _ _ _

?albedo ?road_lat ?road_lng ?road_alt ?geohash ?traffic_count _ ?traffic_class _ _ _ _)

(road-metric ?traffic_class ?traffic_count ?albedo :> ?road_metric)

(trees _ _ _ _ _ _ _ ?avg_height ?tree_lat ?tree_lng ?tree_alt ?geohash)

(read-string ?avg_height :> ?height) ;; limit to trees which are higher than people (> ?height 2.0) (tree-distance

?tree_lat ?tree_lng ?road_lat ?road_lng :> ?distance) ;; limit to trees within a one-block radius (not meters) (<= ?distance 25.0) (/ ?height ?distance :> ?tree_moment) (c/sum ?tree_moment :> ?sum_tree_moment) ;; magic number 200000.0 used to scale tree moment

;; based on median (/ ?sum_tree_moment 200000.0 :> ?tree_metric) ))

modeling

72Friday, 19 July 13

Page 73: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

M

tree

Join Calculatedistance

shade

Filterheight

Summoment

REstimatetraffic

Rroad

Filterdistance

M M

Filtersum_moment

(flow diagram, shade)

modeling

73Friday, 19 July 13

Page 74: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

(defn get-gps [gps_logs trap] "subquery to aggregate and rank GPS tracks per user" (<- [?uuid ?geohash ?gps_count ?recent_visit] (gps_logs

?date ?uuid ?gps_lat ?gps_lng ?alt ?speed ?heading ?elapsed ?distance)

(read-string ?gps_lat :> ?lat) (read-string ?gps_lng :> ?lng) (geohash ?lat ?lng :> ?geohash) (c/count :> ?gps_count) (date-num ?date :> ?visit) (c/max ?visit :> ?recent_visit) ))

modeling

?uuid ?geohash ?gps_count ?recent_visitcf660e041e994929b37cc5645209c8ae 9q8yym 7 1972376866448342ac6fd3f5f44c6b97724d618d587cf 9q9htz 4 197237669096932cc09e69bc042f1ad22fc16ee275e21 9q9hv3 3 1972376670935342ac6fd3f5f44c6b97724d618d587cf 9q9hv3 3 1972376691356342ac6fd3f5f44c6b97724d618d587cf 9q9hwn 13 1972376690782342ac6fd3f5f44c6b97724d618d587cf 9q9hwp 58 1972376690965482dc171ef0342b79134d77de0f31c4f 9q9jh0 15 1972376952532b1b4d653f5d9468a8dd18a77edcc5143 9q9jh0 18 1972376945348

74Friday, 19 July 13

Page 75: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Recommenders often combine multiple signals, via weighted averages, to rank personalized results:

•GPS of person ∩ road segment

• frequency and recency of visit

• traffic class and rate

• road albedo (sunlight reflection)

• tree shade estimator

Adjusting the mix allows for further personalization at the end use

modeling

(defn get-reco [tracks shades] "subquery to recommend road segments based on GPS tracks" (<- [?uuid ?road ?geohash ?lat ?lng ?alt ?gps_count ?recent_visit ?road_metric ?tree_metric] (tracks ?uuid ?geohash ?gps_count ?recent_visit) (shades ?road ?geohash ?lat ?lng ?alt ?road_metric ?tree_metric) ))

75Friday, 19 July 13

Page 76: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

‣ addr: 115 HAWTHORNE AVE‣ lat/lng: 37.446, -122.168‣ geohash: 9q9jh0‣ tree: 413 site 2‣ species: Liquidambar styraciflua‣ est. height: 23 m‣ shade metric: 4.363‣ traffic: local residential, light traffic‣ recent visit: 1972376952532‣ a short walk from my train stop ✔

apps

76Friday, 19 July 13

Page 77: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Could combine this with a variety of data APIs:

• Trulia neighborhood data, housing prices

• Factual local business (FB Places, etc.)

• CommonCrawl open source full web crawl

• Wunderground local weather data

• WalkScore neighborhood data, walkability

• Data.gov US federal open data

• Data.NASA.gov NASA open data

• DBpedia datasets derived from Wikipedia

• GeoWordNet semantic knowledge base

• Geolytics demographics, GIS, etc.

• Foursquare, Yelp, CityGrid, Localeze, YP

• various photo sharing

apps

77Friday, 19 July 13

Page 79: July Clojure Users Group Meeting: "Using Cascalog with Palo Alto Open Data"

Follow-Up…

blog, developer community, code/wiki/gists, maven repo, commercial products, etc.:

cascading.org

zest.to/group11

github.com/Cascading

conjars.org

goo.gl/KQtUL

concurrentinc.com

79Friday, 19 July 13