Microservices, Containers, and Machine Learning
-
Upload
paco-nathan -
Category
Technology
-
view
2.955 -
download
2
Transcript of Microservices, Containers, and Machine Learning
-
Microservices, Containers, and Machine Learning
Paco Nathan, @pacoid
-
Downloads
-
oracle.com/technetwork/java/javase/downloads/jdk7-downloads-1880260.html
follow the license agreement instructions
then click the download for your OS
need JDK instead of JRE (for Maven, etc.)
JDK 6, 7, 8 is fine
Downloads: Java JDK
-
For Python 2.7, check out Anaconda by Continuum Analytics for a full-featured platform:
store.continuum.io/cshop/anaconda/
Downloads: Python
-
Lets get started using Apache Spark, in just a few easy steps Download code from:
databricks.com/spark-training-resources#itas
or for a fallback: spark.apache.org/downloads.html
!
Also, the GitHub project:
github.com/ceteri/spark-exercises/tree/master/exsto
Downloads: Spark
-
Connect into the inflated spark directory, then run:
./bin/spark-shell!
Downloads: Spark
-
Spark Deconstructed
-
// load error messages from a log into memory!// then interactively search for various patterns!// https://gist.github.com/ceteri/8ae5b9509a08c08a1132!!// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example
-
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
We start with Spark running on a cluster submitting code to be evaluated on it:
-
// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()
Spark Deconstructed: Log Mining Example
discussing the other part
-
Spark Deconstructed: Log Mining Example
scala> messages.toDebugString!res5: String = !MappedRDD[4] at map at :16 (3 partitions)! MappedRDD[3] at map at :16 (3 partitions)! FilteredRDD[2] at filter at :14 (3 partitions)! MappedRDD[1] at textFile at :12 (3 partitions)! HadoopRDD[0] at textFile at :12 (3 partitions)
At this point, take a look at the transformed RDD operator graph:
-
Driver
Worker
Worker
Worker
Spark Deconstructed: Log Mining Example
// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()discussing the other part
-
Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()discussing the other part
-
Driver
Worker
Worker
Worker
block 1
block 2
block 3
Spark Deconstructed: Log Mining Example
// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()discussing the other part
-
Driver
Worker
Worker
Worker
block 1
block 2
block 3
readHDFSblock
readHDFSblock
readHDFSblock
Spark Deconstructed: Log Mining Example
// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()discussing the other part
-
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
process,cache data
process,cache data
process,cache data
Spark Deconstructed: Log Mining Example
// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()discussing the other part
-
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()discussing the other part
-
// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains("mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
discussing the other part
-
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
processfrom cache
processfrom cache
processfrom cache
Spark Deconstructed: Log Mining Example
// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains(mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()
discussing the other part
-
Driver
Worker
Worker
Worker
block 1
block 2
block 3
cache 1
cache 2
cache 3
Spark Deconstructed: Log Mining Example
// base RDD!val lines = sc.textFile("hdfs://...")!!// transformed RDDs!val errors = lines.filter(_.startsWith("ERROR"))!val messages = errors.map(_.split("\t")).map(r => r(1))!messages.cache()!!// action 1!messages.filter(_.contains(mysql")).count()!!// action 2!messages.filter(_.contains("php")).count()
discussing the other part
-
GraphX
-
GraphX
spark.apache.org/docs/latest/graphx-programming-guide.html
!Key Points:
! graph-parallel systems
importance of workflows
optimizations
-
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs J. Gonzalez, Y. Low, H. Gu, D. Bickson, C. Guestrin graphlab.org/files/osdi2012-gonzalez-low-gu-bickson-guestrin.pdf
Pregel: Large-scale graph computing at Google Grzegorz Czajkowski, et al. googleresearch.blogspot.com/2009/06/large-scale-graph-computing-at-google.html
GraphX: Unified Graph Analytics on Spark Ankur Dave, Databricksdatabricks-training.s3.amazonaws.com/slides/graphx@sparksummit_2014-07.pdf
Advanced Exercises: GraphX databricks-training.s3.amazonaws.com/graph-analytics-with-graphx.html
GraphX
- // http://spark.apache.org/docs/latest/graphx-programming-guide.html!!import org.apache.spark.graphx._!import org.apache.spark.rdd.RDD!!case class Peep(name: String, age: Int)!!val nodeArray = Array(! (1L, Peep("Kim", 23)), (2L, Peep("Pat", 31)),! (3L, Peep("Chris", 52)), (4L, Peep("Kelly", 39)),! (5L, Peep("Leslie", 45))! )!val edgeArray = Array(! Edge(2L, 1L, 7), Edge(2L, 4L, 2),! Edge(3L, 2L, 4), Edge(3L, 5L, 3),! Edge(4L, 1L, 1), Edge(5L, 3L, 9)! )!!val nodeRDD: RDD[(Long, Peep)] = sc.parallelize(nodeArray)!val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)!val g: Graph[Peep, Int] = Graph(nodeRDD, edgeRDD)!!val results = g.triplets.filter(t => t.attr > 7)!!for (triplet
-
TextRank Demo:
!cdn.liber118.com/spark/ipynb/textrank/PySparkTextRank.ipynb
!IPYTHON_OPTS="notebook --pylab inline" ./bin/pyspark!
GraphX: demo
-
Workflows
-
evaluationoptimizationrepresentationcirca 2010
ETL into cluster/cloud
datadata
visualize,reporting
Data Prep
Features
Learners, Parameters
UnsupervisedLearning
Explore
train set
test set
models
Evaluate
Optimize
Scoringproduction
datause
cases
data pipelines
actionable resultsdecisions, feedback
bar developers
foo algorithms
Typical Workflows:
-
Workflows: Scraper pipeline
Typical data rates, e.g., for [email protected]:
~2K msgs/month
~6 MB as JSON
~13 MB parsed
Three months list activity represents a graph of:
1061 senders
753,400 nodes
1,027,806 edges
A big graph! However, it satisfies definition for a graph-parallel system; lots of data locality to leverage
-
Workflows: A Few Notes about Microservices and Containers
The Strengths and Weaknesses of Microservices Abel Avram http://www.infoq.com/news/2014/05/microservices
DockerCon EU Keynote: State of the Art in Microservices Adrian Cockcroft https://blog.docker.com/2014/12/dockercon-europe-keynote-state-of-the-art-in-microservices-by-adrian-cockcroft-battery-ventures/
Microservices ArchitectureMartin Fowler http://martinfowler.com/articles/microservices.html
-
Workflows: An Example
Python-based service in a Docker container?
Just Enough Math, IPython+DockerPaco Nathan, Andrew Odewahn, Kyle Kellyhttps://github.com/ceteri/jem-dockerhttps://registry.hub.docker.com/u/ceteri/jem/
Docker JumpstartAndrew Odewahn http://odewahn.github.io/docker-jumpstart/
-
Workflows: A Brief Note about ETL in SparkSQL
Spark SQL Data Sources API: Unified Data Access for the Spark Platform Michael Armbrust databricks.com/blog/2015/01/09/spark-sql-data-sources-api-unified-data-access-for-the-spark-platform.html
-
This Workflow: Microservices meet Parallel Processing
services
emailarchives community
leaderboards
SparkSQLData Prep
Features
Explore
Scraper /Parser
NLTKdata Unique
Word IDs
TextRank, Word2Vec,
etc.
communityinsights
not so big data relatively big compute
-
Workflows: Scraper pipeline
messageJSON
Py
filter quotedcontent
Apacheemail listarchive
urllib2
crawl monthly list
by date
Py
segmentparagraphs
-
Workflows: Scraper pipeline
messageJSON
Py
filter quotedcontent
Apacheemail listarchive
urllib2
crawl monthly list
by date
Py
segmentparagraphs
{! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg@mail.gmail.com%3e",! "prev_thread": "",! "sender": "Debasish Das ",! "subject": "Re: memory vs data_size",! "text": "\nOnly fit the data in memory where you want to run the iterative\nalgorithm....\n\nFor map-reduce operations, it's better not to cache if !}
-
TextBlob
tag and lemmatize
words
TextBlob
segment sentences
TextBlob
sentimentanalysis
Py
generate skip-grams
parsedJSON
messageJSON Treebank,
WordNet
Workflows: Parser pipeline
-
TextBlob
tag and lemmatize
words
TextBlob
segment sentences
TextBlob
sentimentanalysis
Py
generate skip-grams
parsedJSON
messageJSON Treebank,
WordNet
Workflows: Parser pipeline
{! "graf": [ [1, "Only", "only", "RB", 1, 0], [2, "fit", "fit", "VBP", 1, 1 ] ... ],! "id": CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "polr": 0.2,! "sha1": "178b7a57ec6168f20a8a4f705fb8b0b04e59eeb7",! "size": 14,! "subj": 0.7,! "tile": [ [1, 2], [2, 3], [3, 4] ... ]! ]!}
{! "date": "2014-10-01T00:16:08+00:00",! "id": "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw",! "next_thread": "CALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg",! "next_url": "http://mail-archives.apache.org/mod_mbox/spark-user/201410.mbox/%3cCALEj8eP5hpQDM=p2xryL-JT-x_VhkRcD59Q+9Qr9LJ9sYLeLVg@mail.gmail.com%3e",! "prev_thread": "",! "sender": "Debasish Das ",! "subject": "Re: memory vs data_size",! "text": "\nOnly fit the data in memory where you want to run the iterative\nalgorithm....\n\nFor map-reduce operations, it's better not to cache if !}
-
Workflows: TextRank pipeline
Spark
create word graph
RDD
wordgraph
NetworkX
visualizegraph
GraphX
runTextRank
Spark
extractphrases
rankedphrases
parsedJSON
-
Workflows: TextRank pipeline
"Compatibility of systems of linear constraints"
[{'index': 0, 'stem': 'compat', 'tag': 'NNP','word': 'compatibility'}, {'index': 1, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 2, 'stem': 'system', 'tag': 'NNS', 'word': 'systems'}, {'index': 3, 'stem': 'of', 'tag': 'IN', 'word': 'of'}, {'index': 4, 'stem': 'linear', 'tag': 'JJ', 'word': 'linear'}, {'index': 5, 'stem': 'constraint', 'tag': 'NNS','word': 'constraints'}]
compat
system
linear
constraint
1:
2:
3:
TextRank: Bringing Order into Texts
Rada Mihalcea, Paul Tarau
http://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
-
https://en.wikipedia.org/wiki/PageRank
Workflows: TextRank how it works
-
TextRank impl
-
TextRank impl: load parquet files
import org.apache.spark.graphx._!import org.apache.spark.rdd.RDD!!val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!import sqlCtx._!!val edge = sqlCtx.parquetFile("graf_edge.parquet")!edge.registerTempTable("edge")!!val node = sqlCtx.parquetFile("graf_node.parquet")!node.registerTempTable("node")!!// pick one message as an example; at scale we'd parallelize!val msg_id = "CA+B-+fyrBU1yGZAYJM_u=gnBVtzB=sXoBHkhmS-6L1n8K5Hhbw"
-
TextRank impl: use SparkSQL to collect node list + edge list
val sql = """!SELECT node_id, root !FROM node !WHERE id='%s' AND keep='1'!""".format(msg_id)!!val n = sqlCtx.sql(sql.stripMargin).distinct()!val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[String])!}!nodes.collect()!!val sql = """!SELECT node0, node1 !FROM edge !WHERE id='%s'!""".format(msg_id)!!val e = sqlCtx.sql(sql.stripMargin).distinct()!val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Int].toLong, p(1).asInstanceOf[Int].toLong, 0)!}!edges.collect()
-
TextRank impl: use GraphX to run PageRank
// run PageRank!val g: Graph[String, Int] = Graph(nodes, edges)!val r = g.pageRank(0.0001).vertices!!r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!!// save the ranks!case class Rank(id: Int, rank: Float)!val rank = r.map(p => Rank(p._1.toInt, p._2.toFloat))!rank.registerTempTable("rank")!!def median[T](s: Seq[T])(implicit n: Fractional[T]) = {! import n._! val (lower, upper) = s.sortWith(_
-
TextRank impl: join ranked words with parsed text
var span:List[String] = List()!var last_index = -1!var rank_sum = 0.0!!var phrases:collection.mutable.Map[String, Double] = collection.mutable.Map()!!val sql = """!SELECT n.num, n.raw, r.rank!FROM node n JOIN rank r ON n.node_id = r.id !WHERE n.id='%s' AND n.keep='1'!ORDER BY n.num!""".format(msg_id)!!val s = sqlCtx.sql(sql.stripMargin).collect()
-
TextRank impl: pull strings for the top-ranked keyphrases
s.foreach { x => ! //println (x)! val index = x.getInt(0)! val word = x.getString(1)! val rank = x.getFloat(2)! var isStop = false!! // test for break from past! if (span.size > 0 && rank < min_rank) isStop = true! if (span.size > 0 && (index - last_index > 1)) isStop = true!! // clear accumulation! if (isStop) {! val phrase = span.mkString(" ")! phrases += (phrase -> rank_sum)!! span = List()! last_index = index! rank_sum = 0.0! }!! // start or append! if (rank >= min_rank) {! span = span :+ word! last_index = index! rank_sum += rank! }!}!
-
TextRank impl: report the top keyphrases
// summarize the text as a list of ranked keyphrases!val summary = sc.parallelize(phrases.toSeq)! .distinct()! .sortBy(_._2, ascending=false)
-
Reply Graph
-
Reply Graph: load parquet files
import org.apache.spark.graphx._!import org.apache.spark.rdd.RDD!!val sqlCtx = new org.apache.spark.sql.SQLContext(sc)!import sqlCtx._!!val edge = sqlCtx.parquetFile("reply_edge.parquet")!edge.registerTempTable("edge")!!val node = sqlCtx.parquetFile("reply_node.parquet")!node.registerTempTable("node")!!edge.schemaString!node.schemaString
-
Reply Graph: use SparkSQL to collect node list + edge list
val sql = "SELECT id, sender FROM node"!val n = sqlCtx.sql(sql).distinct()!val nodes: RDD[(Long, String)] = n.map{ p =>! (p(0).asInstanceOf[Long], p(1).asInstanceOf[String])!}!nodes.collect()!!val sql = "SELECT replier, sender, num FROM edge"!val e = sqlCtx.sql(sql).distinct()!val edges: RDD[Edge[Int]] = e.map{ p =>! Edge(p(0).asInstanceOf[Long], p(1).asInstanceOf[Long], p(2).asInstanceOf[Int])!}!edges.collect()
-
Reply Graph: use GraphX to run graph analytics
// run graph analytics!val g: Graph[String, Int] = Graph(nodes, edges)!val r = g.pageRank(0.0001).vertices!r.join(nodes).sortBy(_._2._1, ascending=false).foreach(println)!!// define a reduce operation to compute the highest degree vertex!def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {! if (a._2 > b._2) a else b!}!!// compute the max degrees!val maxInDegree: (VertexId, Int) = g.inDegrees.reduce(max)!val maxOutDegree: (VertexId, Int) = g.outDegrees.reduce(max)!val maxDegrees: (VertexId, Int) = g.degrees.reduce(max)!!// connected components!val scc = g.stronglyConnectedComponents(10).vertices!node.join(scc).foreach(println)
-
Reply Graph: PageRank of top dev@spark email, 4Q2014
(389,(22.690229478710016,Sean Owen ))!(857,(20.832469059298248,Akhil Das ))!(652,(13.281821379806798,Michael Armbrust ))!(101,(9.963167550803664,Tobias Pfeiffer ))!(471,(9.614436778460558,Steve Lewis ))!(931,(8.217073486575732,shahab ))!(48,(7.653814912512137,ll ))!(1011,(7.602002681952157,Ashic Mahtab ))!(1055,(7.572376489758199,Cheng Lian ))!(122,(6.87247388819558,Gerard Maas ))!(904,(6.252657820614504,Xiangrui Meng ))!(827,(6.0941062762076115,Jianshi Huang ))!(887,(5.835053915864531,Davies Liu ))!(303,(5.724235650446037,Ted Yu ))!(206,(5.430238461114108,Deep Pradhan ))!(483,(5.332452537151523,Akshat Aranya ))!(185,(5.259438927615685,SK ))!(636,(5.235941228955769,Matei Zaharia ))!!// seaaaaaaaaaan!!maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (389,126)!maxOutDegree: (org.apache.spark.graphx.VertexId, Int) = (389,170)!maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (389,296)
-
Reply Graph: What SSSP looks like in GraphX/Pregel
github.com/ceteri/spark-exercises/blob/master/src/main/scala/com/databricks/apps/graphx/sssp.scala
-
Look Ahead: Where is this heading?
Feature learning with Word2VecMatt Krzuswww.yseam.com/blog/WV.html
rankedphrases
GraphX
runCon.Comp.
MLlib
runWord2Vec
aggregatedby topic
MLlib
runKMeans
topicvectors
better than LDA?
features models insights
-
Resources
-
Apache Spark developer certificate program
http://oreilly.com/go/sparkcert
defined by Spark experts @Databricks
assessed by OReilly Media
establishes the bar for Spark expertise
certification:
-
MOOCs:
Anthony Joseph UC Berkeley
begins 2015-02-23
edx.org/course/uc-berkeleyx/uc-berkeleyx-cs100-1x-introduction-big-6181
Ameet Talwalkar UCLA
begins 2015-04-14
edx.org/course/uc-berkeleyx/uc-berkeleyx-cs190-1x-scalable-machine-6066
-
community:
spark.apache.org/community.html
events worldwide: goo.gl/2YqJZK
!video+preso archives: spark-summit.org
resources: databricks.com/spark-training-resources
workshops: databricks.com/spark-training
-
http://spark-summit.org/
-
confs:Strata CA San Jose, Feb 18-20 strataconf.com/strata2015
Spark Summit East NYC, Mar 18-19 spark-summit.org/east
Big Data Tech Con Boston, Apr 26-28 bigdatatechcon.com
Strata EULondon, May 5-7 strataconf.com/big-data-conference-uk-2015
Spark Summit 2015 SF, Jun 15-17 spark-summit.org
-
books:
Fast Data Processing with Spark Holden Karau Packt (2013) shop.oreilly.com/product/9781782167068.do
Spark in Action Chris FreglyManning (2015*) sparkinaction.com/
Learning Spark Holden Karau, Andy Konwinski, Matei ZahariaOReilly (2015*) shop.oreilly.com/product/0636920028512.do
-
presenter:
Just Enough Math OReilly, 2014
justenoughmath.compreview: youtu.be/TQ58cWgdCpA
monthly newsletter for updates, events, conf summaries, etc.: liber118.com/pxn/
Enterprise Data Workflows with Cascading OReilly, 2013
shop.oreilly.com/product/0636920028536.do