Microservices, Containers, and Machine Learning

Microservices, Containers, and Machine Learning

Paco Nathan, @pacoid

Downloads

oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html

follow the license agreement instructions

then click the download for your OS

need JDK instead of JRE (for Maven, etc.)

JDK 6, 7, 8 is fine

Downloads: Java JDK

For Python 2.7, check out Anaconda by Continuum Analytics for a full-featured platform:

store.continuum.io/cshop/anaconda/

Downloads: Python

Lets get started using Apache Spark, in just a few easy steps Download code from:

databricks.com/spark-training-resources#itas

or for a fallback: spark.apache.org/downloads.html

!

Also, the GitHub project:

github.com/ceteri/spark-exercises/tree/master/exsto

Downloads: Spark

Connect into the inflated spark directory, then run:

./bin/spark-shell!

Downloads: Spark

Spark Deconstructed

// load error messages from a log into memory!// then interactively search for various patterns!// https://gist.github.com/ceteri/8ae5b9509a08c08a1132!!// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()

Spark Deconstructed: Log Mining Example

Driver

Worker

Worker

Worker


We start with Spark running on a cluster submitting code to be evaluated on it:

// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()


discussing the other part


scala> messages.toDebugString!res5: String = !MappedRDD[4] at map at :16 (3 partitions)! MappedRDD[3] at map at :16 (3 partitions)! FilteredRDD[2] at filter at :14 (3 partitions)! MappedRDD[1] at textFile at :12 (3 partitions)! HadoopRDD[0] at textFile at :12 (3 partitions)

At this point, take a look at the transformed RDD operator graph:

Driver

Worker

Worker

Worker


// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()discussing the other part

Driver

Worker

Worker

Worker

block 1

block 2

block 3



Driver

Worker

Worker

Worker

block 1

block 2

block 3

readHDFSblock

readHDFSblock

readHDFSblock



Driver

Worker

Worker

Worker

block 1

block 2

block 3

cache 1

cache 2

cache 3

process,cache data

process,cache data

process,cache data



Driver

Worker

Worker

Worker

block 1

block 2

block 3

cache 1

cache 2

cache 3



// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()

Driver

Worker

Worker

Worker

block 1

block 2

block 3

cache 1

cache 2

cache 3



Driver

Worker

Worker

Worker

block 1

block 2

block 3

cache 1

cache 2

cache 3

processfrom cache

processfrom cache

processfrom cache


// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains(mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()


Driver

Worker

Worker

Worker

block 1

block 2

block 3

cache 1

cache 2

cache 3


// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains(mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()


GraphX

GraphX

spark.apache.org/docs/latest/graphx-programming-guide.html

!Key Points:

! graph-parallel systems

importance of workflows

optimizations

PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin graphlab.org/files/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf

Pregel: Large-scale graph computing at Google Grzegorz Czajkowski, et al. googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html

GraphX: Unified Graph Analytics on Spark Ankur Dave, Databricksdatabricks-training.s3.amazonaws.com/slides/graphx@sparksummit_2014-07.pdf

Advanced Exercises: GraphX databricks-training.s3.amazonaws.com/graph-analytics-with-graphx.html

GraphX

// http://spark.apache.org/docs/latest/graphx-programming-guide.html!!import org.apache.spark.graphx._!import org.apache.spark.rdd.RDD!!case class Peep(name: String, age: Int)!!val nodeArray = Array(! (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),! (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),! (5L, Peep("Leslie", 45))! )!val edgeArray = Array(! Edge(2L, 1L, 7), Edge(2L, 4L, 2),! Edge(3L, 2L, 4), Edge(3L, 5L, 3),! Edge(4L, 1L, 1), Edge(5L, 3L, 9)! )!!val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)!val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)!val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)!!val results = g.triplets.filter(t => t.attr > 7)!!for (triplet

TextRank Demo:

!cdn.liber118.com/spark/ipynb/textrank/PySparkTextRank.ipynb

!IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark!

GraphX: demo

Workflows

evaluationoptimizationrepresentationcirca 2010

ETL into cluster/cloud

datadata

visualize,reporting

Data Prep

Features

Learners, Parameters

UnsupervisedLearning

Explore

train set

test set

models

Evaluate

Optimize

Scoringproduction

datause

cases

data pipelines

actionable resultsdecisions, feedback

bar developers

foo algorithms

Typical Workflows:

Workflows: Scraper pipeline

Typical data rates, e.g., for [email protected]:

~2K msgs/month

~6 MB as JSON

~13 MB parsed

Three months list activity represents a graph of:

1061 senders

753,400 nodes

1,027,806 edges

A big graph! However, it satisfies definition for a graph-parallel system; lots of data locality to leverage

Workflows: A Few Notes about Microservices and Containers

The Strengths and Weaknesses of Microservices Abel Avram http://www.infoq.com/news/2014/05/microservices

DockerCon EU Keynote: State of the Art in Microservices Adrian Cockcroft https://blog.docker.com/2014/12/dockercon-europe-keynote-state-of-the-art-in-microservices-by-adrian-cockcroft-battery-ventures/

Microservices ArchitectureMartin Fowler http://martinfowler.com/articles/microservices.html

Workflows: An Example

Python-based service in a Docker container?

Just Enough Math, IPython+DockerPaco Nathan, Andrew Odewahn, Kyle Kellyhttps://github.com/ceteri/jem-dockerhttps://registry.hub.docker.com/u/ceteri/jem/

Docker JumpstartAndrew Odewahn http://odewahn.github.io/docker-jumpstart/

Workflows: A Brief Note about ETL in SparkSQL

Spark SQL Data Sources API: Unified Data Access for the Spark Platform Michael Armbrust databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html

This Workflow: Microservices meet Parallel Processing

services

emailarchives community

leaderboards

SparkSQLData Prep

Features

Explore

Scraper /Parser

NLTKdata Unique

Word IDs

TextRank, Word2Vec,

etc.

communityinsights

not so big data relatively big compute


messageJSON

Py

filter quotedcontent

Apacheemail listarchive

urllib2

crawl monthly list

by date

Py

segmentparagraphs


messageJSON

Py

filter quotedcontent

Apacheemail listarchive

urllib2

crawl monthly list

by date

Py

segmentparagraphs

{! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg@mail.gmail.com%3e",! "prev_thread": "",! "sender": "Debasish Das ",! "subject": "Re: memory vs data_size",! "text": "\nOnly fit the data in memory where you want to run the iterative\nalgorithm....\n\nFor map-reduce operations, it's better not to cache if !}

TextBlob

tag and lemmatize

words

TextBlob

segment sentences

TextBlob

sentimentanalysis

Py

generate skip-grams

parsedJSON

messageJSON Treebank,

WordNet

Workflows: Parser pipeline

TextBlob

tag and lemmatize

words

TextBlob

segment sentences

TextBlob

sentimentanalysis

Py

generate skip-grams

parsedJSON

messageJSON Treebank,

WordNet

Workflows: Parser pipeline

{! "graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],! "id": CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "polr": 0.2,! "sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",! "size": 14,! "subj": 0.7,! "tile": [ [1, 2], [2, 3], [3, 4] ... ]! ]!}

{! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg@mail.gmail.com%3e",! "prev_thread": "",! "sender": "Debasish Das ",! "subject": "Re: memory vs data_size",! "text": "\nOnly fit the data in memory where you want to run the iterative\nalgorithm....\n\nFor map-reduce operations, it's better not to cache if !}

Workflows: TextRank pipeline

Spark

create word graph

RDD

wordgraph

NetworkX

visualizegraph

GraphX

runTextRank

Spark

extractphrases

rankedphrases

parsedJSON

Workflows: TextRank pipeline

"Compatibility of systems of linear constraints"

[{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'}, {'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'}, {'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'}, {'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}]

compat

system

linear

constraint

1:

2:

3:

TextRank: Bringing Order into Texts

Rada Mihalcea, Paul Tarau

http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

https://en.wikipedia.org/wiki/PageRank

Workflows: TextRank how it works

TextRank impl

TextRank impl: load parquet files

import org.apache.spark.graphx._!import org.apache.spark.rdd.RDD!!val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!import sqlCtx._!!val edge = sqlCtx.parquetFile("graf_edge.parquet")!edge.registerTempTable("edge")!!val node = sqlCtx.parquetFile("graf_node.parquet")!node.registerTempTable("node")!!// pick one message as an example; at scale we'd parallelize!val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"

TextRank impl: use SparkSQL to collect node list + edge list

val sql = """!SELECT node_id, root !FROM node !WHERE id='%s' AND keep='1'!""".format(msg_id)!!val n = sqlCtx.sql(sql.stripMargin).distinct()!val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])!}!nodes.collect()!!val sql = """!SELECT node0, node1 !FROM edge !WHERE id='%s'!""".format(msg_id)!!val e = sqlCtx.sql(sql.stripMargin).distinct()!val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)!}!edges.collect()

TextRank impl: use GraphX to run PageRank

// run PageRank!val g: Graph[String, Int] = Graph(nodes, edges)!val r = g.pageRank(0.0001).vertices!!r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!!// save the ranks!case class Rank(id: Int, rank: Float)!val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))!rank.registerTempTable("rank")!!def median[T](s: Seq[T])(implicit n: Fractional[T]) = {! import n._! val (lower, upper) = s.sortWith(_

TextRank impl: join ranked words with parsed text

var span:List[String] = List()!var last_index = -1!var rank_sum = 0.0!!var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()!!val sql = """!SELECT n.num, n.raw, r.rank!FROM node n JOIN rank r ON n.node_id = r.id !WHERE n.id='%s' AND n.keep='1'!ORDER BY n.num!""".format(msg_id)!!val s = sqlCtx.sql(sql.stripMargin).collect()

TextRank impl: pull strings for the top-ranked keyphrases

s.foreach { x => ! //println (x)! val index = x.getInt(0)! val word = x.getString(1)! val rank = x.getFloat(2)! var isStop = false!! // test for break from past! if (span.size > 0 && rank < min_rank) isStop = true! if (span.size > 0 && (index - last_index > 1)) isStop = true!! // clear accumulation! if (isStop) {! val phrase = span.mkString(" ")! phrases += (phrase -> rank_sum)!! span = List()! last_index = index! rank_sum = 0.0! }!! // start or append! if (rank >= min_rank) {! span = span :+ word! last_index = index! rank_sum += rank! }!}!

TextRank impl: report the top keyphrases

// summarize the text as a list of ranked keyphrases!val summary = sc.parallelize(phrases.toSeq)! .distinct()! .sortBy(_._2, ascending=false)

Reply Graph

Reply Graph: load parquet files

import org.apache.spark.graphx._!import org.apache.spark.rdd.RDD!!val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!import sqlCtx._!!val edge = sqlCtx.parquetFile("reply_edge.parquet")!edge.registerTempTable("edge")!!val node = sqlCtx.parquetFile("reply_node.parquet")!node.registerTempTable("node")!!edge.schemaString!node.schemaString

Reply Graph: use SparkSQL to collect node list + edge list

val sql = "SELECT id, sender FROM node"!val n = sqlCtx.sql(sql).distinct()!val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Long], p(1).asInstanceOf[String])!}!nodes.collect()!!val sql = "SELECT replier, sender, num FROM edge"!val e = sqlCtx.sql(sql).distinct()!val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])!}!edges.collect()

Reply Graph: use GraphX to run graph analytics

// run graph analytics!val g: Graph[String, Int] = Graph(nodes, edges)!val r = g.pageRank(0.0001).vertices!r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!!// define a reduce operation to compute the highest degree vertex!def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {! if (a._2 > b._2) a else b!}!!// compute the max degrees!val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)!val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)!val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)!!// connected components!val scc = g.stronglyConnectedComponents(10).vertices!node.join(scc).foreach(println)

Reply Graph: PageRank of top dev@spark email, 4Q2014

(389,(22.690229478710016,Sean Owen ))!(857,(20.832469059298248,Akhil Das ))!(652,(13.281821379806798,Michael Armbrust ))!(101,(9.963167550803664,Tobias Pfeiffer ))!(471,(9.614436778460558,Steve Lewis ))!(931,(8.217073486575732,shahab ))!(48,(7.653814912512137,ll ))!(1011,(7.602002681952157,Ashic Mahtab ))!(1055,(7.572376489758199,Cheng Lian ))!(122,(6.87247388819558,Gerard Maas ))!(904,(6.252657820614504,Xiangrui Meng ))!(827,(6.0941062762076115,Jianshi Huang ))!(887,(5.835053915864531,Davies Liu ))!(303,(5.724235650446037,Ted Yu ))!(206,(5.430238461114108,Deep Pradhan ))!(483,(5.332452537151523,Akshat Aranya ))!(185,(5.259438927615685,SK ))!(636,(5.235941228955769,Matei Zaharia ))!!// seaaaaaaaaaan!!maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)!maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)!maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)

Reply Graph: What SSSP looks like in GraphX/Pregel

github.com/ceteri/spark-exercises/blob/master/src/main/scala/com/databricks/apps/graphx/sssp.scala

Look Ahead: Where is this heading?

Feature learning with Word2VecMatt Krzuswww.yseam.com/blog/WV.html

rankedphrases

GraphX

runCon.Comp.

MLlib

runWord2Vec

aggregatedby topic

MLlib

runKMeans

topicvectors

better than LDA?

features models insights

Resources

Apache Spark developer certificate program

http://oreilly.com/go/sparkcert

defined by Spark experts @Databricks

assessed by OReilly Media

establishes the bar for Spark expertise

certification:

MOOCs:

Anthony Joseph UC Berkeley

begins 2015-02-23

edx.org/course/uc-berkeleyx/uc-berkeleyx-cs100-1x-introduction-big-6181

Ameet Talwalkar UCLA

begins 2015-04-14

edx.org/course/uc-berkeleyx/uc-berkeleyx-cs190-1x-scalable-machine-6066

community:

spark.apache.org/community.html

events worldwide: goo.gl/2YqJZK

!video+preso archives: spark-summit.org

resources: databricks.com/spark-training-resources

workshops: databricks.com/spark-training

http://spark-summit.org/

confs:Strata CA San Jose, Feb 18-20 strataconf.com/strata2015

Spark Summit East NYC, Mar 18-19 spark-summit.org/east

Big Data Tech Con Boston, Apr 26-28 bigdatatechcon.com

Strata EULondon, May 5-7 strataconf.com/big-data-conference-uk-2015

Spark Summit 2015 SF, Jun 15-17 spark-summit.org

books:

Fast Data Processing with Spark Holden Karau Packt (2013) shop.oreilly.com/product/9781782167068.do

Spark in Action Chris FreglyManning (2015*) sparkinaction.com/

Learning Spark Holden Karau, Andy Konwinski, Matei ZahariaOReilly (2015*) shop.oreilly.com/product/0636920028512.do

presenter:

Just Enough Math OReilly, 2014

justenoughmath.compreview: youtu.be/TQ58cWgdCpA

monthly newsletter for updates, events, conf summaries, etc.: liber118.com/pxn/

Enterprise Data Workflows with Cascading OReilly, 2013

shop.oreilly.com/product/0636920028536.do

Microservices, Containers, and Machine Learning

Technology

Transcript of Microservices, Containers, and Machine Learning