Building Scalable Data Pipelines - 2016 DataPalooza Seattle
-
Upload
evan-chan -
Category
Engineering
-
view
2.470 -
download
2
Transcript of Building Scalable Data Pipelines - 2016 DataPalooza Seattle
Building Scalable Data Pipelines
Evan Chan
Who am IDistinguished Engineer, Tuplejump
@evanfchan
http://velvia.github.io
User and contributor to Spark since 0.9
Co-creator and maintainer of Spark Job Server
TupleJump - Big Data Dev Partners 3
Instant Gratification
I want insights now
I want to act on news right away
I want stuff personalized for me (?)
Fast Data, notBig Data
How Fast do you Need to Act?
Financial trading - milliseconds
Dashboards - seconds to minutes
BI / Reports - hours to days?
What’s Your App?
Concurrent video viewers
Anomaly detection
Clickstream analysis
Live geospatial maps
Real-time trend detection & learning
Common Components
Message Queue
EventsStream
Processing Layer
State / Database
Happy Users
Example: Real-time trend detection
Events: time, OS, location, asset/product ID
Analyze 1-5 second batches of new “hot” data in stream processor
Combine with recent and historical top K feature vectors in database
Update database recent feature vectors
Serve to users
Example 2: Smart Cities
Smart City Streaming Data
City buses - regular telemetry (position, velocity, timestamp)
Street sweepers - regular telemetry
Transactions from rail, subway, buses, smart cards
311 info
911 info - new emergencies
Citizens want to know…
Where and for how long can I park my car?
Are transportation options affected by 311 and 911 events?
How long will it take the next bus to get here?
Where is the closest bus to where I am?
Cities want to know…
How can I maximize parking revenue?
More granular updates to parking spots that don't need sweeping
How does traffic affect waiting times in public transit, and revenue?
Patterns in subway train times - is a breakdown coming?
Population movement - where should new transit routes be placed?
Message Queue
Stream Processing
Layer
Event storage
Ad-Hoc
311
911
Buses
MetroShort term telemetry
Models
Dashboard
The HARD Principle
Highly Available, Resilient, Distributed
Flexibility - do as many transformations as possible with as few components as possible
Real-time: “NoETL”
Community: best of breed OSS projects with huge adoption and commercial support
Message Queue
Message Queue
EventsStream
Processing Layer
State / Database
Happy Users
Why a message queue?
Centralized publish-subscribe of events
Need more processing? Add another consumer
Buffer traffic spikes
Replay events in cases of failure
Message Queues help distribute data
A-F
G-M
N-S
T-Z
Input 1
Input 2
Input3
Input4
Processing
Processing
Processing
Processing
Intro to Apache Kafka
Kafka is a distributed publish subscribe system
It uses a commit log to track changes
Kafka was originally created at LinkedIn
Open sourced in 2011
Graduated to a top-level Apache project in 2012
On being HARDMany Big Data projects are open source implementations of closed source products
Unlike Hadoop, HBase or Cassandra, Kafka actually isn't a clone of an existing closed source product
The same codebase being used for years at LinkedIn answers the questions:
Does it scale?
Is it robust?
Ad Hoc ETL
Decoupled ETL
Avro Schemas And Schema Registry
Keys and values in Kafka can be Strings or byte arrays
Avro is a serialization format used extensively with Kafka and Big Data
Kafka uses a Schema Registry to keep track of Avro schemas Verifies that the correct schemas are being used
Consumer Groups
Commit Logs
Kafka Resources
Official docs - https://kafka.apache.org/documentation.html
Design section is really good read
http://www.confluent.io/product
Includes schema registry
Stream Processing
Message Queue
EventsStream
Processing Layer
State / Database
Happy Users
Types of Stream Processors
Event by Event: Apache Storm, Apache Flink, Intel GearPump, Akka
Micro-batch: Apache Spark
Hybrid? Google Dataflow
Apache Storm and Flink
Transform one message at a time
Very low latency
State and more complex analytics difficult
Akka and Gearpump
Actor to actor messaging. Local state.
Used for extreme low latency (ad networks, etc)
Dynamically reconfigurable topology
Configurable fault tolerance and failure recovery
Cluster or local mode - you don’t always need distribution!
Spark Streaming
Data processed as stream of micro batches
Higher latency (seconds), higher throughput, more complex analysis / ML possible
Same programming model as batch
Why Spark?
file = spark.textFile("hdfs://...") file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _)
1 package org.myorg; 2 3 import java.io.IOException; 4 import java.util.*; 5 6 import org.apache.hadoop.fs.Path; 7 import org.apache.hadoop.conf.*; 8 import org.apache.hadoop.io.*; 9 import org.apache.hadoop.mapreduce.*; 10 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; 11 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; 12 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; 13 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; 14 15 public class WordCount { 16 17 public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { 18 private final static IntWritable one = new IntWritable(1); 19 private Text word = new Text(); 20 21 public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { 22 String line = value.toString(); 23 StringTokenizer tokenizer = new StringTokenizer(line); 24 while (tokenizer.hasMoreTokens()) { 25 word.set(tokenizer.nextToken()); 26 context.write(word, one); 27 } 28 } 29 } 30 31 public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { 32 33 public void reduce(Text key, Iterable<IntWritable> values, Context context) 34 throws IOException, InterruptedException { 35 int sum = 0; 36 for (IntWritable val : values) { 37 sum += val.get(); 38 } 39 context.write(key, new IntWritable(sum)); 40 } 41 } 42 43 public static void main(String[] args) throws Exception { 44 Configuration conf = new Configuration(); 45 46 Job job = new Job(conf, "wordcount"); 47 48 job.setOutputKeyClass(Text.class); 49 job.setOutputValueClass(IntWritable.class); 50 51 job.setMapperClass(Map.class); 52 job.setReducerClass(Reduce.class); 53 54 job.setInputFormatClass(TextInputFormat.class); 55 job.setOutputFormatClass(TextOutputFormat.class); 56 57 FileInputFormat.addInputPath(job, new Path(args[0])); 58 FileOutputFormat.setOutputPath(job, new Path(args[1])); 59 60 job.waitForCompletion(true); 61 } 62 63 }
Spark Production Deployments
Explosion of Specialized Systems
Spark and Berkeley AMP Lab
Benefits of Unified LibrariesOptimizations can be shared between libraries Core Project Tungsten MLlib
Shared statistics libraries Spark Streaming GC and memory management
Mix and match modules
Easily go from DataFrames (SQL) to MLLib / statistics, for example:
scala> import org.apache.spark.mllib.stat.Statistics
scala> val numMentions = df.select("NumMentions").map(row => row.getInt(0).toDouble)numMentions: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[100] at map at DataFrame.scala:848
scala> val numArticles = df.select("NumArticles").map(row => row.getInt(0).toDouble)numArticles: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[104] at map at DataFrame.scala:848
scala> val correlation = Statistics.corr(numMentions, numArticles, "pearson")
Spark Worker FailureRebuild RDD Partitions on Worker from Lineage
Spark SQL & DataFrames
DataFrames & Catalyst Optimizer
Catalyst OptimizationsColumn and partition pruning (Column filters) Predicate pushdowns (Row filters)
Spark SQL Data Sources APIEnables custom data sources to participate in SparkSQL = DataFrames + Catalyst Production Impls spark-csv (Databricks) spark-avro (Databricks) spark-cassandra-connector (DataStax) elasticsearch-hadoop (Elastic.co)
Spark Streaming
Streaming SourcesBasic: Files, Akka actors, queues of RDDs, Socket
Advanced
Kafka
Kinesis
Flume
Twitter firehose
DStreams = micro-batches
Streaming Fault ToleranceIncoming data is replicated to 1 other node Write Ahead Log for sources that support ACKs Checkpointing for recovery if Driver fails
Direct Kafka Streaming: KafkaRDD
No single Receiver Parallelizable No Write Ahead Log Kafka *is* the Write Ahead Log! KafkaRDD stores Kafka offsets KafkaRDD partitions recover from offsets
Spark MLlib & GraphX
Spark MLlib Common AlgosClassifiers DecisionTree, RandomForest
Clustering K-Means, Streaming K-Means
Collaborative Filtering Alternating Least Squares (ALS)
Spark Text Processing AlgosTF/IDF
LDA
Word2Vec
*Pro-Tip: Use Stanford CoreNLP!
Spark ML PipelinesModeled after scikit-learn
Spark GraphX
PageRank Top Influencers
Connected Components Measure of clusters
Triangle Counting Measure of cluster density
Handling State
Message Queue
EventsStream
Processing Layer
State / Database
Happy Users
What Kind of State?
Non-persistent / in-memory: concurrent viewers
Short term: latest trends
Longer term: raw event & aggregate storage
ML Models, predictions, scored data
Spark RDDs
Immutable, cache in memory and/or on disk
Spark Streaming: UpdateStateByKey
IndexedRDD - can update bits of data
Snapshotting for recovery
•Massively Scalable• High Performance• Always On• Masterless
Scale
Apache Cassandra• Scales Linearly to as many nodes as you need
• Scales whenever you need
Performance
Apache Cassandra• It’s Fast • Built to sustain massive data insertion rates in irregular pattern spikes
FaultTolerance
&Availability
Apache Cassandra• Automatic Replication • Multi Datacenter • Decentralized - no single point of failure • Survive regional outages • New nodes automatically add themselves to the cluster
• DataStax drivers automatically discover new nodes
Architecture
Apache Cassandra• Distributed, Masterless Ring Architecture
• Network Topology Aware
• Flexible, Schemaless - your data
structure can evolve seamlessly over time
To download:
https://cassandra.apache.org/download/
https://github.com/pcmanus/ccm
^ Highly recommended for local testing/cluster setup
Cassandra Data Modeling
Primary key = (partition keys, clustering keys)
Fast queries = fetch single partition
Range scans by clustering key
Must model for query patterns
Clustering 1 Clustering 2 Clustering 3Partition 1Partition 2Partition 3
City Bus Data Modeling Example
Primary key = (Bus UUID, timestamp)
Easy queries: location and speed of single bus for a range of time
Can also query most recent location + speed of all buses (slower)
1020 s 1010 s 1000 sBus A speed, GPSBus BBus C
Using Cassandra for Short Term StorageIdea is store and read small values
Idempotent writes + huge write capacity = ideal for streaming ingestion
For example, store last few (latest + last N) snapshots of buses, taxi locations, recent traffic info
But Mommy! What about longer term data?
I need to read lots of data, fast!!
- Ad hoc analytics of events - More specialized / geospatial - Building ML models from
large quantities of data - Storing scored/classified data
from models - OLAP / Data Warehousing
Can Cassandra Handle Batch?
Cassandra tables are much better at lots of small reads than big data scans
You CAN store data efficiently in C*
Files seem easier for long term storage and analysis
But are files compatible with streaming?
Lambda Architecture
Lambda is Hard and Expensive
Very high TCO - Many moving parts - KV store, real time, batch
Lots of monitoring, operations, headache
Running similar code in two places
Lower performance - lots of shuffling data, network hops, translating domain objects
Reconcile queries against two different places
NoLambda
A unified system
Real-time processing and reprocessing
No ETLs
Fault tolerance
Everything is a stream
Can Cassandra do batch and ad-hoc?Yes, it can be competitive with Hadoop actually….
If you know how to be creative with storing your data!
Tuplejump/SnackFS - HDFS for Cassandra
github.com/tuplejump/FiloDB - analytics database
Store your data using Protobuf / Avro / etc.
Introduction to FiloDB
Efficient columnar storage - 5-10x better
Scan speeds competitive with Parquet - 100x faster than regular Cassandra tables
Very fine grained filtering for sub-second concurrent queries
Easy BI and ad-hoc analysis via Spark SQL/Dataframes (JDBC etc.)
Uses Cassandra for robust, proven storage
Combining FiloDB + Cassandra
Regular Cassandra tables for highly concurrent, aggregate / key-value lookups (dashboards)
FiloDB + C* + Spark for efficient long term event storage
Ad hoc / SQL / BI
Data source for MLLib / building models
Data storage for classified / predicted / scored data
Message Queue
EventsSpark
Streaming
Short term storage, K-V
Adhoc, SQL, ML
Cassandra
FiloDB: Events, ad-hoc, batch
Spark
Dashboards, maps
Message Queue
EventsSpark
Streaming Models
Cassandra
FiloDB: Long term event storage
Spark Learned Data
FiloDB + CassandraRobust, peer to peer, proven storage platform
Use for short term snapshots, dashboards
Use for efficient long term event storage & ad hoc querying
Use as a source to build detailed models