Post on 08-Jan-2017
A Practical Guide to Selecting a Stream Processing Technology
Michael � G. � NollProduct � Manager, � Confluent
Kafka Talk SeriesDate Title
Sep 27 Introduction To Streaming Data and Stream Processing with Apache Kafka
Oct 06 Deep Dive into Apache Kafka
Oct 27 Data Integration with Apache Kafka
Nov 17 Demystifying Stream Processing with Apache Kafka
Dec 01 A Practical Guide to Selecting a Stream Processing Technology
Dec 15 Streaming in Practice: Putting Apache Kafka in Production
https://www.confluent.io/apache-‐kafka-‐talk-‐series
Agenda
• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions
• Summary
Agenda
• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions
• Summary
Agenda
• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions
• Summary
Powered by Kafka (﴾thousands more)﴿
Spark Streaming API (﴾2.0)﴿
Kafka’s Streams API (﴾0.10)﴿
Example: Streams and Tables in Kafka
Word Count
hello 2
kafka 1
world 1
… …
Streams & Databases
• A � stream � processing � technology � must � have � first-class � support � for Streams � and Tables• With � scalability, � fault � tolerance, � …
• Why? � Because � most � use � cases � require � not � just � one, � but � both!• Support � – or � lack � thereof � – strongly � impacts � the � resulting �
technical � architecture � and � development � efforts• No � support � means:• Painful � Do-It-Yourself• Increased � complexity, � more � moving � pieces � to � juggle
Agenda
• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions
• Summary
Agenda
• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions
• Summary
Organizational/Non-‐Tech Dimensions
• Can � your � org � understand � and � leverage � the � technology?• Familiarity � with � languages; � intuitive � concepts � and � APIs; � trainings
• Are � you � permitted � to � use � it � in � your � organization?• Security � features, � licensing, � open � source � vs. � proprietary
• Can � you � continue � to � use � it � in � the � future?• Longevity � of � technology, � licensing, � vendor � strength
Organizational/Non-‐Tech Dimensions
• Do � you � believe � in � the � long-term � vision?• Switching � technologies � in � an � organization � is � often � expensive/slow: �
legacy � migration, � re-training, � resistance � to � change, � etc.
• What � is � the � path � and � time � to � success?• Can � you � move � smoothly � and � quickly � from � proof-of-concept � to �
production?
• Areas � and � range � of � applicability in � your � organization• General-purpose � vs. � niche � technology• Viable � for � S/M/L/XL � use � cases � vs. � for � XL � use � cases � only• Building � core � business � apps � vs. � doing � backend � analytics
Organizational/Non-‐Tech Dimensions
Licensing Vision/Roadmap ROI
Impact onOrganization
Broad vs. NicheApplicability
Time to Market
ProfessionalServices
Documentation Examples User CommunityLearning Curve
Impact on Tools,Infrastructure, …
Agenda
• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions
• Summary
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
State
• Stateful � processing � of � any � kind � requires…state• Many � (most?) � use � cases � for � stream � processing � are � stateful• Joins, � aggregations, � windowing, � counting, � ...
• Is � state � performant? � Local � vs. � remote � state?
50
State
• Stateful � processing � of � any � kind � requires…state• Many � (most?) � use � cases � for � stream � processing � are � stateful• Joins, � aggregations, � windowing, � counting, � ...
• Is � state � performant? � Local � vs. � remote � state?• Is � state � fault-tolerant? � How � fast � is � recovery/failover?
53
State
• Stateful � processing � of � any � kind � requires…state• Many � (most?) � use � cases � for � stream � processing � are � stateful• Joins, � aggregations, � windowing, � counting, � ...
• Is � state � performant? � Local � vs. � remote � state?• Is � state � fault-tolerant? � How � fast � is � recovery/failover?• Is � state � interactively � queryable?• Kafka: � ready � for � use � (GA)• Spark, � Flink: � under � development � (alpha)• Storm, � Samza, � and � others: � not � available
55
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
Abstractions
• What � are � the � data � model � and � the � available � abstractions?• Most � common � abstraction: � stream of � records, � events• Kafka, � Spark, � Storm, � Samza, � Flink, � Apex, � ...
• New, � very � powerful: � table � of � records• Currently � unique � to � Kafka• Represents � latest � state and � materialized � views• State � must � have � a � first-class � abstraction � because, � as � we � just � saw � in �
the � previous � section, � state � is � crucial � for � stream � processing!
58
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
Time model
• Different � use � cases � require � different � time � semantics• Great � majority � of � use � cases � require � event-time semantics• Other � use � cases � may � require � processing-time (e.g. � real-
time � monitoring) � or � special � variants � like � ingestion-time• A � stream � processing � technology � should, � at � a � minimum, �
support � event-time � to � cover � most � use � cases � in � practice• Examples: � Kafka, � Beam, � Flink
Time Model
61
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
Windowing• Windowing � is � an � operation � that � groups events
Windowing
Input data, wherecolors represent
different users events
Rectangles denotedifferent event-‐time
windows
processing-‐time
event-‐time
windowing
alicebob
dave
Windowing• Windowing � is � an � operation � that � groups events• Most � commonly � needed: � time � windows, � session � windows• Examples:• Real-time � monitoring: � 5-minute � averages• Reader � behavior � on � a � website: � user � browsing � sessions
Windowing
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
Out-‐of-‐order and late-‐arriving data
• Is � very � common in � practice, � not � a � rare � corner � case• Related � to � time � model � discussion
Out-‐of-‐order and late-‐arriving data
Users with mobile phones enterairplane, lose Internet connectivity
Emails are being writtenduring the 10h flight
Internet connectivity is restored,phones will send queued emails now
Out-‐of-‐order and late-‐arriving data
• Is � very � common in � practice, � not � a � rare � corner � case• Related � to � time � model � discussion
• We � want � control over � how � out-of-order � data � is � handled• Example:• We � process � data � in � 5-minute � windows, � e.g. � compute � statistics• When � event � arrives � 1 � minute � late: � update the � original � result!• When � event � arrives � 2 � hours � late: � discard it!
• Handling � must � be � efficient because � it � happens � so � often
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
Reprocessing
• Re-process � data � by � rewinding � a � stream � back � in � time• Use � cases � in � practice � include• Correcting � output � data � after � fixing � a � bug• Facilitate � iterative � and � explorative � development• A/B � testing• Processing � historical � data• Walking � through � "What � If?" � scenarios
• Also: � often � used � behind-the-scenes � for � fault � tolerance
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
Scalability, Elasticity, Fault Tolerance
• Can � the � technology � scale according � to � your � needs?• Desired � latency, � throughput?• Able � to � process � millions � of � messages � per � second?
• What � is � the � minimum � footprint?• Expand/shrink � capacity � dynamically � during � operations?
• Helps � with � resource � utilization � because � most � stream � apps � run � continuously• Resilience and � fault � tolerance
• Which � guarantees � for � data � delivery � and � for � state? � "At-least-once", � "exactly-once", � "effectively-once", � etc.
• Failover � behavior � and � recovery � time? � Automated � or � manual?• Any � negative � impact � of � fault � tolerance � features � on � performance?
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
Security
• To � meet � internal � security � policies, � legal � compliance, � etc.• Typical � base � requirements � for � stream � processing � applications:• Encrypt � data-in-transit � (e.g. � from/to � Kafka)• Authentication: � "only � some � applications � may � talk � to � production"• Authorization: � "access � to � sensitive � data � such � as � PII � is � restricted”
• The � easier � it � is � to � use � security � features, � the � more � likely � they � are � actually � being � used � in � practice
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
Processing Model• True � stream � processing � is � record-at-a-time processing
• Benefits � include � low � latency (millisecs), � dealing � efficiently � with � out-of-order � data• Can � provide � both � latency � and � high � throughput � via � internal � optimizations• Examples: � Kafka, � Storm, � Samza, � Flink, � Beam
• Some � processing � technologies � opt � for � (micro)batching• Micro-batching � has � no � true � benefits: � consider � it � a � technical � workaround � to �
shoehorn � stream-like � functionality � into � a � tool• Suffers � from � significant � overhead � when � dealing � with � e.g. � out-of-order/late-arriving �
data, � when � performing � windowed � analyses � (e.g. � session � windows)• Typically � a � strong � blocker � for � use � cases � such � as � fraud � detection � or � anything � where �
"a � few � seconds" � of � latency � is � prohibitive• Examples: � Spark, � Storm � (Trident), � Hadoop*
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
API
• Choice � of � API � is � a � subjective � matter � – skills, � preference, � …• Typical � options• Declarative, � expressive � API: � operations � like � map(), � filter()• Imperative, � lower-level � API: � callbacks � like � process(event)• Streaming � SQL: � STREAM SELECT … FROM … WHERE … • In � the � best � case � you � get � not � just � one, � but � all � three
• "Abstractions � are � great!"• "Abstractions � considered � harmful!"
Technical Dimensions
Reprocessing Scalability &Elasticity
Fault Tolerance
API Dev/OpsLifecycle
Security ProcessingModel
Out of OrderData
Abstractions Time Model WindowingState
Developer/Operations Lifecycle
• How � should � your � daily � work � look � and � feel � like?• "I � like � to � do � quick, � iterative � development" � (modify/test/repeat)• "I � want � to � decouple � team � roadmaps, � project � schedules"
• Big � difference � between � App � Model � <-> � Cluster � Model• Testing, � packaging, � deployment, � monitoring, � operations• "Do � I � need � to � know � Java � (app) � or � YARN � (cluster) � for � this?”• "I � want � reactive � processing � in � containers � that � run � on � Mesos!"
• Rolling, � no-downtime � upgrades?• Integration � with � existing � Ops � infra, � tools, � processes?
Agenda
• Recap: � What � is � Stream � Processing?• The � Three � Pillars � of � Stream � Processing � in � Practice• Key � Selection � Criteria• Organizational/Non-Technical � Dimensions• Technical � Dimensions
• Summary
Summary
• What � we � covered � is � a � good � starting � point• But, � no � free � lunch!• Understand � what � you � need, � and � weigh � criteria � appropriately• Think � end-to-end: � idea, � development, � operations, � troubleshooting• Think � big-picture: � future � use � cases, � architecture, � security, � training, � …• Do � your � own � internal � hackathons, � proof-of-concepts• Do � your � own � benchmarks
• If � in � doubt: � simplicity � beats � complexity• Faster � to � learn, � easier � to � understand, � less � likely � to � fail, � …
Q&A Session
89
Coming Up NextDate Title Speaker
Dec 15 Streaming in Practice: Putting Apache Kafka in Production
Roger Hoover
https://www.confluent.io/apache-‐kafka-‐talk-‐series