Hands-on with Apache NiFi and MiNiFi - Berlin Buzzwords · PDF fileHands-on with Apache NiFi...
Transcript of Hands-on with Apache NiFi and MiNiFi - Berlin Buzzwords · PDF fileHands-on with Apache NiFi...
3 ©HortonworksInc.2011– 2016.AllRightsReserved
SimplisticViewofDataflows:Easy,Definitive
AcquireData
StoreData
DataFlow
ProcessAnalyzeData
4 ©HortonworksInc.2011– 2016.AllRightsReserved
RealisticViewofDataflows:Complex,Convoluted
AcquireData
StoreData
AcquireData
StoreData
StoreData
StoreData
StoreData
Processand
AnalyzeData
DataFlow
AcquireData
AcquireData
5 ©HortonworksInc.2011– 2016.AllRightsReserved
Movingdataeffectivelyishard
Standards:http://xkcd.com/927/
7 ©HortonworksInc.2011– 2016.AllRightsReserved
• Web-based User Interface for creating, monitoring, & controlling data flows
• Directed graphs of data routing and transformation
• Highly configurable - modify data flow at runtime, dynamically prioritize data
• Easily extensible through development of custom components
• Data Provenance tracks data through entire system
[1]https://nifi.apache.org/
Apache NiFiDataflow
8 ©HortonworksInc.2011– 2016.AllRightsReserved
NiFi - Terminologyà FlowFile
• Unitofdatamovingthroughthesystem• Content+Attributes(key/valuepairs)
à Processor• Performs thework,canaccessFlowFiles
à Connection• Linksbetweenprocessors• Queuesthatcanbedynamicallyprioritized
à ProcessGroup• Setofprocessorsandtheirconnections• Receivedataviainputports,senddataviaoutputports
9 ©HortonworksInc.2011– 2016.AllRightsReserved
FlowFiles arelikeHTTPdataHTTPData FlowFile
HTTP/1.1200OKDate:Sun,10Oct201023:26:07GMTServer:Apache/2.2.8(CentOS)OpenSSL/0.9.8gLast-Modified:Sun,26Sep201022:04:35GMTETag:"45b6-834-49130cc1182c0"Accept-Ranges:bytesContent-Length:13Connection:closeContent-Type:text/html
Helloworld!
StandardFlowFile AttributesKey:'entryDate’ Value:'FriJun1717:15:04EDT2016'Key:'lineageStartDate’Value:'FriJun1717:15:04EDT2016'Key:'fileSize’ Value:'23609'FlowFile AttributeMapContentKey:'filename’ Value:'15650246997242'Key:'path’ Value:'./’
BinaryContent*
Header
Content
10 ©HortonworksInc.2011– 2016.AllRightsReserved
FlowFiles &DataAgnosticism
à NiFi isdataagnostic!
à But,NiFi wasdesignedunderstandingthatusers
cancareaboutspecificsandprovidestooling
tointeractwithspecificformats,protocols,etc.
ISO8601- http://xkcd.com/1179/
Robustnessprinciple
Beconservativeinwhatyoudo,beliberalinwhatyouacceptfromothers
11 ©HortonworksInc.2011– 2016.AllRightsReserved
NiFi - Terminologyà FlowFile
• Unitofdatamovingthroughthesystem• Content+Attributes(key/valuepairs)
à Processor• Performsthework,canaccessFlowFiles
à Connection• Linksbetweenprocessors• Queuesthatcanbedynamicallyprioritized
à ProcessGroup• Setofprocessorsandtheirconnections• Receivedataviainputports,senddataviaoutputports
12 ©HortonworksInc.2011– 2016.AllRightsReserved
NiFi isbasedonFlowBasedProgramming(FBP)
FBPTerm NiFi Term DescriptionInformationPacket
FlowFile Each objectmovingthroughthesystem.
Black Box FlowFileProcessor
Performsthework, doingsomecombinationofdatarouting,transformation,ormediationbetweensystems.
BoundedBuffer
Connection Thelinkage betweenprocessors, actingasqueuesandallowingvariousprocessestointeractatdifferingrates.
Scheduler FlowController
Maintainstheknowledgeofhowprocessesareconnected, andmanagesthethreadsandallocationsthereofwhichallprocessesuse.
Subnet ProcessGroup
Asetofprocessesandtheirconnections,whichcanreceiveandsenddataviaports.Aprocess groupallowscreationofentirelynewcomponentsimplybycompositionofits components.
14 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheNiFi
KeyFeaturesandPrinciples
• Guaranteeddelivery• Databuffering
- Backpressure- Pressurerelease
• Prioritizedqueuing• FlowspecificQoS
- Latencyvs.throughput- Losstolerance
• Dataprovenance
• Recovery/recordingarollinglogoffine-grainedhistory
• Visualcommandandcontrol
• Flowtemplates• Pluggable/multi-role
security• Designedforextension• Clustering
15 ©HortonworksInc.2011– 2016.AllRightsReserved
Theneedfordataprovenance
For Operators• Traceability, lineage• Recovery and replayFor Compliance• Audit trail• RemediationFor Business / Mission• Value sources • Value IT investment
BEGIN
ENDLINEAGE
16 ©HortonworksInc.2011– 2016.AllRightsReserved
DataProvenance(NotjustLineage)
• Viewattributesandcontentatgivenpointsintime(beforeandaftereachprocessor)!!!
• Records,indexes,andmakeseventsavailablefordisplay
17 ©HortonworksInc.2011– 2016.AllRightsReserved
Provenance
§ Constrained§ High-latency§ Localizedcontext
§ Hybrid– cloud/on-premises§ Low-latency§ Globalcontext
SOURCES REGIONALINFRASTRUCTURE
COREINFRASTRUCTURE
Origin– attributionReplay– recovery
EvolutionoftopologiesLongretention
TypesofLineage• Event(runtime)• Configuration(designtime)
18 ©HortonworksInc.2011– 2016.AllRightsReserved
Theneedforfine-grainedsecurityandcompliance
It’s not enough to say you have encrypted communications• Enterprise authorization
services –entitlements change often
• People and systems with different roles require difference access levels
• Tagged/classified data
19 ©HortonworksInc.2011– 2016.AllRightsReserved
Security
§ Constrained§ High-latency§ Localizedcontext
§ Hybrid– cloud/on-premises§ Low-latency§ Globalcontext
SOURCES REGIONALINFRASTRUCTURE
COREINFRASTRUCTURE
Sign,encrypt,static(dataandcontrol)
TLS,obfuscation,dynamicentitlementsKerberos,PKI,AD/DS,etc.
20 ©HortonworksInc.2011– 2016.AllRightsReserved
BackPressure
• Configureback-pressureperconnection• BasedonnumberofFlowFilesortotalsizeof
FlowFiles• Upstreamprocessornolongerscheduledtorun
untilbelowthreshold
21 ©HortonworksInc.2011– 2016.AllRightsReserved
Prioritization
• Configure a prioritizer per connection• Determine what is important for your
data – time based, arrival order, importance of a data set
• Funnel many connections down to a single connection to prioritize across data sets
• Develop your own prioritizer if needed
22 ©HortonworksInc.2011– 2016.AllRightsReserved
Latency vs. Throughput
• Choose between lower latency, or higher throughput on each processor• Higher throughput allows framework to batch together all operations for
the selected amount of time for improved performance• Processor developer determines whether to support this by using
@SupportsBatching annotation
23 ©HortonworksInc.2011– 2016.AllRightsReserved
Extension/IntegrationPoints
NiFi Term DescriptionFlow FileProcessor
Push/Pull behavior.CustomUI
ReportingTask
Used topushdatafromNiFi tosomeexternalservice(metrics,provenance,etc..)
ControllerService
Usedtoenablereusablecomponents/ sharedservicesthroughouttheflow
RESTAPI Allowsclientstoconnecttopullinformation,changebehavior,etc..
25 ©HortonworksInc.2011– 2016.AllRightsReserved
NiFiPositioning
ApacheNiFi/MiNiFi
ETL(Informatica,etc.)
EnterpriseServiceBus(Fuse,Mule,etc.)
MessagingBus
(Kafka,MQ,etc.)
ProcessingFramework(Storm,Spark,etc.)
26 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheNiFi/ProcessingFrameworks
NiFiSimpleeventprocessing• Primarilyfeeddataintoprocessing
frameworks,canprocessdata,withafocusonsimpleeventprocessing
• Operateonasinglepieceofdata,orincorrelationwithanenrichmentdataset(enrichment,parsing,splitting,andtransformations)
• Canscaleout,butscaleupbettertotakefulladvantageofhardwareresources,runconcurrentprocessingtasks/threads(processingterabytesofdataperdayonasinglenode)
⚠ Notanotherdistributedprocessingframework,buttofeeddataintothose
ProcessingFrameworks(Storm,Spark,etc.)Complexanddistributedprocessing• Complexprocessingfrommultiplestreams(JOIN
operations)
• Analyzingdataacrosstimewindows(rollingwindowaggregation,standarddeviation,etc.)
• Scaleouttothousandsofnodesifneeded
⚠ Notdesignedtocollectdataormanagedataflow
27 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheNiFi/MessagingBusServices
NiFiProvidedataflowsolution• Centralizedmanagement,fromedgetocore
• Greattraceability,eventleveldataprovenancestartingwhendataisborn
• Interactivecommandandcontrol– realtimeoperationalvisibility
• Dataflowmanagement,includingprioritization,backpressure,andedgeintelligence
• Visualrepresentationofglobaldataflow
⚠ Notamessagingbus,flowmaintenanceneededwhenyouhavefrequentconsumersideupdates
MessagingBus(Kafka,JMS,etc.)Providemessagingbusservice• Lowlatency
• Greatdatadurability
• Decentralizedmanagement(producers&consumers)
• Lowbrokermaintenancefordynamicconsumersideupdates
⚠ Notdesignedtosolvedataflowproblems(prioritization,edgeintelligence,etc.)
⚠ Traceabilitylimitedtoin/outoftopics,nolineage
⚠ Lackofglobalviewofcomponents/connectivities
28 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheNiFi/Integration,oringestion,Frameworks
NiFiEnduserfacingdataflowmanagementtool• Outoftheboxsolutionfordataflow
management
• Interactivecommandandcontrolinthecore,designanddeployontheedge
• Flexiblefailurehandlingateachpointoftheflow
• Visualrepresentationofglobaldataflowandconnectivities
• Nativecrossdatacentercommunication
• Dataprovenancefortraceability
⚠ Notalibrarytobeembeddedinotherapplications
Integrationframework(SpringIntegration,Camel,etc),ingestionframework(Flume,etc)Developerfacingintegrationtoolwithafocusondataingestion• Asetoftoolstoorchestrateworkflow
• Afixeddesignanddeploypattern
• Leveragemessagingbusacrossdisconnectednetworks
⚠ Developerfacing,customcodingneededtooptimize
⚠ Pre-builtfailurehandling,lackofflexibility
⚠ Noholisticviewofglobaldataflow
⚠ Nobuilt-indatatraceability
29 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheNiFi/ETLTools
NiFiNOTschemadependent• Dataflowmanagementforbothstructuredand
unstructureddata,poweredbyseparationofmetadataandpayload
• Schemaisnotrequired,butyoucanhaveschema
• Minimummodelingeffort,justenoughtomanagedataflows
• Dotheplumbingjob,maximizedevelopers’brainpowerforcreativework
⚠ NotdesignedtodoheavyliftingtransformationworkforDBtables(JOINdatasets,etc.).Youcancreatecustomprocessorstodothat,butlongwaytogotocatchupwithexistingETLtoolsfromuserexperienceperspective(GUIfordatawrangling,cleansing,etc.)
ETL(Informatica,etc.)Schemadependent• TailoredforDatabases/WH
• ETLoperationsbasedonschema/datamodeling
• Highlyefficient,optimizedperformance
⚠ Mustpre-prepareyourdata,timeconsumingtobuilddatamodeling,andmaintainschemas
⚠ Notgearedtowardshandlingunstructureddata,PDF,Audio,Video,etc.
⚠ Notdesignedtosolvedataflowproblems
31 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheMiNiFi
à NiFi livesinthedatacenter.Giveitanenterpriseserveroraclusterofthem.
à MiNiFi livesasclosetowheredataisbornandisaguestonthatdeviceorsystem
“LetmegetthekeypartsofNiFi closetowheredatabeginsandprovidebidirectionaldatatransfer"
32 ©HortonworksInc.2011– 2016.AllRightsReserved
Realitiesofcomputingoutsidethecomfortsofthedatacenterà Limited
– computingcapability– power/network
à Restrictedsoftwarelibrary/platformavailability
à NoUI
à Physicallyinaccessible
à Notfrequentlyupdated
à Competingstandards/protocols
à Scalability
à Privacy&Security
33 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheNiFiMiNiFiKeyFeatures
• Guaranteeddelivery• Databuffering
- Backpressure- Pressurerelease
• Prioritizedqueuing• FlowspecificQoS
- Latencyvs.throughput- Losstolerance
• Dataprovenance
• Recovery/recordingarollinglogoffine-grainedhistory
• Designedforextension
• DesignandDeploy• Warmre-deploys
34 ©HortonworksInc.2011– 2016.AllRightsReserved
MiNiFi:PrecedentfromNiFi
à ProvidesthesemanticsbetweentwoNiFi componentsacrossnetworkboundaries– Acustomprotocolforinter-NiFi communication– Secure,Extensible,LoadBalanced&ScalableDeliverytoCluster
à ExtractedouttoaclientlibrarywhichpowersintegrationintopopularframeworkslikeApacheSpark,ApacheStorm,ApacheFlink,andApacheApex
à AttributesandtheFlowFile formatmaintained
AquicklookatNiFi SitetoSite
https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site
35 ©HortonworksInc.2011– 2016.AllRightsReserved
MiNiFi:PrecedentfromNiFi
à Fine-grained,eventlevelaccessofinteractionswithFlowFiles– CREATE,RECEIVE,FETCH,SEND,DOWNLOAD,DROP,EXPIRE,FORK,JOIN…
à Capturestheassociatedattributes/metadataatthetimeoftheevent
à AmapofaFlowFile’s journeyandhowtheyrelatetootherFlowFiles inasystem– MiNiFi enablesustogetmoreandfurtherilluminatethemapofdataprocessing
Adeeperdiveintoprovenance
http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#data-provenance
37 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheMiNiFi
à Thefeedbackloopislongerandnotguaranteed– RemovalofWebServerandUI
à Declarativeconfiguration– LendsitselfwelltoCMprocesses– Extensibleinterfacetosupportvaryingformats
• CurrentlyprovidedinYAML
à Reducedsetofbundledcomponents
DeparturesfromNiFi ingettingtherightfit
38 ©HortonworksInc.2011– 2016.AllRightsReserved
ApacheMiNiFi:Scoping
à Gosmall:Java–Writeonce,runanywhere*– FeatureparityandreuseofcoreNiFi libraries
à Gosmaller:C++–Writeonce**,runanywhere
à Gosmallest:Writen-manytimes,runanywhereLanguagelibrariestosupporttagging,FlowFile format,SitetoSiteprotocol,andprovenancegenerationwithoutaprocessingframework– Mobile:Android&iOS– LanguageSDKs
ProvideallthekeyprinciplesofNiFi invarying,smallerfootprints
39 ©HortonworksInc.2011– 2016.AllRightsReserved
à ConstrainedÃHigh-latencyà Localizedcontext
ÃHybrid– cloud/on-premisesà Low-latencyÃGlobalcontext
CoreInfrastructure
HarnessingDatainMotion
RegionalInfrastructure
Sources
41 ©HortonworksInc.2011– 2016.AllRightsReserved
Learnmoreandjoinus!
Apache NiFi sitehttps://nifi.apache.org
Subproject MiNiFi sitehttps://nifi.apache.org/minifi/
Subscribe to and collaborate [email protected]@nifi.apache.org
Submit Ideas or Issueshttps://issues.apache.org/jira/browse/NIFIhttps://issues.apache.org/jira/browse/MINIFI
Follow on Twitter@apachenifi