Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
-
Upload
bryan-bende -
Category
Software
-
view
263 -
download
4
Transcript of Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi Bryan Bende – So>ware Engineer @Hortonworks Future of Data NY – December 5th 2016
2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Agenda
à Problem DefiniHon
à IntroducHon to Apache NiFi
à IntroducHon to Apache MiNiFi
à Demo!!
à Q&A
3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
About Me
à SoPware Engineer @ Hortonworks
à Apache NiFi PMC & CommiTer
à Working with NiFi since 2011
à Recent focus on integraHons with Hadoop ecosystem
à [email protected] / TwiTer @bbende / bryanbende.com
à Bethpage Class of 2001!
4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Problem
5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Team 2
It starts out so simple…
Hey! We have some important data to
send you!
Cool! Your data is really important to
us!
Team 1
This should be easy right?...
6 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
But what about formats & protocols?
Team 2
We can publish Avro records to a Ka\a topic, does
that work?
Oh, well we have a REST service that accepts
JSON…
Team 1
7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
And what about security & authenKcaKon?
Team 2
Hmm what about security? We can authenHcate via
Kerberos
Sorry, we only support 2-‐Way
TLS with cerHficates
Team 1
8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
And what about all these devices at the edge?
We also need to grab data from all these devices, how are we going to do
that?
Team 2
9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
And What About…
à OrganizaHonal PoliHcs (my data) à BriTle ConnecHvity à Firewalls/Security Domains à Partnerships bring new data / need
different formats à Data has to be masked for
compliance purposes à Where is this data even from? à Data is in that other system – I need
it over here
à Bandwidth between those sites is limited
à My Big Data system needs it in this other beTer/faster/stronger format
à What schema is that from? à It needs to be enriched first! à No not that reference set – this one! à I didn’t even know that system
existed
10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Ok so let’s fix this
• Enterprise Architecture – Standardize on • …format • …a schema (one that can evolve) • …a protocol • …an ontology
But now… • Standard schema becomes complex
• Hard to agree on common changes
• Some teams stuck on older versions
• ProducHvity starts slowing…
11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Something to ponder – the disconnect is healthy
• Having Corporate Standards is a good thing.
• InnovaHon is a good thing.
Innova&on o(en does not follow the Corporate Standard
12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
What is Dataflow Management?
13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataflow Management
The systemaKc process by which data is acquired from all producers and delivered to all consumers
14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataflow Management ConsideraKons
• Promote Loosely Coupled Systems • Types of coupling: Format, Schema, Protocol, Priority, Size, Interest, …
• Promote Highly Cohesive Systems • Producers should focus on producHon (not the intricacies of consumpHon) • Consumers should focus on storage or processing (not the details of producHon)
• Provide Provenance • The who/what/when/where/why of data • Inter and Intra Process Latency • Enable enterprise version control for data
15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Dataflow Management ConsideraKons
• Empower Understanding and InteracKon • Ability to see the flow, safely and quickly iterate and experiment • Breaking producHon is bad – so too is not being able to evolve fast enough
• Secure • Bridge between security domains • Data Plane (transport) • Control Plane (C&C, Monitoring)
• Self Service • Centralized teams – hard to scale – slow turnaround Hmes • Centralized systems – mulH-‐tenant management works
16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The role of messaging systems
• Reduce variables: Fix protocol, Data Size, Provide Buffering
• Historically not very fast or replayable: Apache Ka]a solved that
• Strong soluKon within a controlled domain
• But numerous challenges remain • Topics do not separate key concerns between producer and consumer pairs such as
§ AuthorizaHon § Format § Schema § Interest § PrioriHzaHon
• Flow control
17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IntroducKon to Apache NiFi
18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The NSA Years
• Created in 2006 • Improved over eight years
• Simple IniHal vision – Visio for real-‐Hme dataflow management
• Key Lessons Learned • What scale means – down, up, and out
• The fearsome force known as Compliance Requirements
• The power of provenance!
• OperaHonal best-‐pracHces and anH-‐paTerns
• NSA donated the codebase to the ASF in late 2014
19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Key Features
• Guaranteed delivery • Data buffering
- Backpressure - Pressure release
• PrioriKzed queuing • Flow specific QoS
- Latency vs. throughput - Loss tolerance
• Data provenance
• Recovery/recording a rolling log of fine-‐grained history
• Visual command and control • Flow templates • Pluggable/mulK-‐role security • Designed for extension • Clustering
20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
NiFi Core Concepts
FBP Term NiFi Term DescripKon InformaHon Packet
FlowFile Each object moving through the system.
Black Box FlowFile Processor
Performs the work, doing some combinaHon of data rouHng, transformaHon, or mediaHon between systems.
Bounded Buffer
ConnecHon The linkage between processors, acHng as queues and allowing various processes to interact at differing rates.
Scheduler Flow Controller
Maintains the knowledge of how processes are connected, and manages the threads and allocaHons thereof which all processes use.
Subnet Process Group
A set of processes and their connecHons, which can receive and send data via ports. A process group allows creaHon of enHrely new component simply by composiHon of its components.
21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Visual Command & Control
• Drag & drop processors to build a flow
• Start, stop, & configure components in real-‐Hme
• View errors & corresponding messages
• View staHsHcs & health of the dataflow
• Create shareable templates of common flows
22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Provenance/Lineage
• Tracks data at each point as it flows through the system
• Records, indexes, and makes events available for display
• Handles fan-‐in/fan-‐out, i.e. merging and splisng data
• View aTributes and content at given points in Hme
23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
PrioriKzaKon
• Configure a prioriHzer per connecHon
• Determine what is important for your data – Hme based, arrival order, importance of a data set
• Funnel many connecHons down to a single connecHon to prioriHze across data sets
24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Back-‐Pressure
• Configure back-‐pressure per connecHon
• Based on number of FlowFiles or total size of FlowFiles
• Upstream processor no longer scheduled to run unHl below threshold
25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Latency vs. Throughput
• Choose between lower latency, or higher throughput on each processor
• Higher throughput allows framework to batch together all operaHons for the selected amount of Hme for improved performance
• Processor developer determines whether to support this by using @SupportsBatching annotaHon
26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Security
à Control Plane – Pluggable authenHcaHon
• 2-‐Way TLS/SSL, LDAP, Kerberos – Pluggable authorizaHon with mulH-‐tenancy
• NiFi Policy Based Authorizer • Apache Ranger Authorizer
– Audit trail of all user acHons
à Data Plane – OpHonal 2-‐Way TLS/SSL between cluster nodes – OpHonal 2-‐Way TLS/SSL on Site-‐To-‐Site connecHons (NiFi-‐to-‐NiFi) – EncrypHon/DecrypHon of data through processors – Provenance for audit trail of data
27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Extensibility
à Built from the ground up with extensions in mind
à Service-‐loader paTern for… • Processors • Controller Services • ReporHng Tasks
à Extensions packaged as NiFi Archives (NARs) • Deploy NiFi lib directory and restart • Provides ClassLoader isolaHon • Same model as standard components
28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Architecture -‐ Standalone
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile Repository
Content Repository
Provenance Repository
Local Storage
à FlowFile Repository – Write Ahead Log – State of every FlowFile – Pointers to content repository
(pass-‐by-‐reference)
à Content Repository – FlowFile content – Copy-‐on-‐write
à Provenance Repository – Write Ahead Log + Lucene Indexes – Store & search lineage events
29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile Repository
Content Repository
Provenance Repository
Local Storage
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile Repository
Content Repository
Provenance Repository
Local Storage
Architecture -‐ Cluster
OS/Host
JVM
Flow Controller
Web Server
Processor 1 Extension N
FlowFile Repository
Content Repository
Provenance Repository
Local Storage
ZooKeeper
à Same dataflow on each node, data parHHoned across cluster
à Access the UI from any node à ZooKeeper for auto-‐elecHon of
Cluster Coordinator & Primary Node
à Cluster Coordinator receives heartbeats from other nodes, manages joining/ disconnecHng
à Primary Node for scheduling processors on a single node
30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Site-‐To-‐Site
à Direct communicaHon between two NiFi instances
à Push to Input Port on receiver, or Pull from Output Port on source
à Communicate between clusters, standalone instances, or both
à Handles load balancing and reliable delivery
à Secure connecHons using cerHficates (opHonal)
à Communicate over TCP or HTTP
31 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Site-‐To-‐Site Push Model
à Source connects Remote Process Group to Input Port on desHnaHon
à Site-‐To-‐Site takes care of load balancing across the nodes in the cluster
NiFi Cluster -‐ Node 2
Input Port
NiFi Cluster -‐ Node 3
Input Port
Standalone NiFi
RPG
NiFi Cluster -‐ Node 1
Input Port
32 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Site-‐To-‐Site Pull Model
à DesHnaHon connects Remote Process Group to Output Port on the source
à If source was a cluster, each node would pull from each node in cluster
NiFi Cluster -‐ Node 2
RPG
NiFi Cluster -‐ Node 3
RPG
Standalone NiFi
Output Port
NiFi Cluster -‐ Node 1
RPG
33 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
IntroducKon to Apache MiNiFi
34 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Apache MiNiFi
à Sub-‐project of Apache NiFi
à Created to more effecHvely collect data at the edge
à Smaller footprint, run where the JVM can’t
à Design & Deploy vs. Command & Control
35 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi DistribuKons
à Java – <40MB binary distribuHon – Requires Java 1.8 – More feature complete – Targeted for any systems that can run a JVM (ie. Servers, Raspberry Pi)
à C++ – 600KB code size and staHc data ~50KB – Dynamic heap of ~1MB based on use-‐case – Targeted for resource constrained environments (ie. edge IoT devices)
à Both use same config format and use NiFi terminology
Different focuses depending on requirements
36 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi Java
NiFi Framework
Components
MiNiFi
NiFi Framework
User Interface
Components
NiFi
37 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi Java
à Uses same NAR structure as NiFi
à Use any NAR from NiFi with MiNiFi Java
à NiFi standard processors are bundled by default – TailLog – UpdateATribute – Route on content and aTributes – PutEmail – ….
38 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi C++
à IniHal set of processors – TailFile – GetFile – GenerateFlowFile – LogATribute – ListenSyslog
à Site to Site Client implementaHon in C++ for talking to NiFi instances
39 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Design & Deploy
Same approach for Java & C++… 1. Design a flow in NiFi UI
2. Export template to XML file
3. Run MiNiFi Toolkit to convert NiFi template to MiNiFi YAML
4. Deploy config.yaml to MiNiFi instances
IniHally targeHng flows like… 1. GetFile/TailFile
2. RouHng Decision
3. Site-‐To-‐Site Back to core NiFi
40 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Simple config.yml Tail a rolling file -‐> Site to Site
41 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
MiNiFi Command and Control
à Design Flow at a centralized place, deploy on the edge
à Version control of flows – Align with NiFi SDLC work
à Agent status monitoring
à Bi-‐direcHonal command and control
Currently a feature proposal, iniKal version being architected
hTps://cwiki.apache.org/confluence/display/MINIFI/MiNiFi+Command+and+Control
42 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo!
43 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Demo Scenario
Raspberry Pi
MiNiFi Java
Temp/Humidity Sensor
NiFi
Raspberry Pi
MiNiFi Java
Temp/Humidity Sensor
site-‐to-‐site
Solr
Banana
44 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
QuesKons?
45 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Learn more and join us!
Apache NiFi site http://nifi.apache.org Subproject MiNiFi site http://nifi.apache.org/minifi/ Subscribe to and collaborate at [email protected] [email protected] Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI https://issues.apache.org/jira/browse/MINIFI Follow us on Twitter @apachenifi
46 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Thank you!