Hands-on with Apache NiFi and MiNiFi - Berlin Buzzwords · PDF fileHands-on with Apache NiFi...

42
Hands-on with Apache NiFi and MiNiFi Andrew Psaltis - @itmdata Berlin Buzzwords 2017

Transcript of Hands-on with Apache NiFi and MiNiFi - Berlin Buzzwords · PDF fileHands-on with Apache NiFi...

Hands-onwithApacheNiFiandMiNiFi

AndrewPsaltis- @itmdata

BerlinBuzzwords2017

2 ©HortonworksInc.2011– 2016.AllRightsReserved

Dataflow and the associated problems

3 ©HortonworksInc.2011– 2016.AllRightsReserved

SimplisticViewofDataflows:Easy,Definitive

AcquireData

StoreData

DataFlow

ProcessAnalyzeData

4 ©HortonworksInc.2011– 2016.AllRightsReserved

RealisticViewofDataflows:Complex,Convoluted

AcquireData

StoreData

AcquireData

StoreData

StoreData

StoreData

StoreData

Processand

AnalyzeData

DataFlow

AcquireData

AcquireData

5 ©HortonworksInc.2011– 2016.AllRightsReserved

Movingdataeffectivelyishard

Standards:http://xkcd.com/927/

6 ©HortonworksInc.2011– 2016.AllRightsReserved

Apache NiFi

7 ©HortonworksInc.2011– 2016.AllRightsReserved

• Web-based User Interface for creating, monitoring, & controlling data flows

• Directed graphs of data routing and transformation

• Highly configurable - modify data flow at runtime, dynamically prioritize data

• Easily extensible through development of custom components

• Data Provenance tracks data through entire system

[1]https://nifi.apache.org/

Apache NiFiDataflow

8 ©HortonworksInc.2011– 2016.AllRightsReserved

NiFi - Terminologyà FlowFile

• Unitofdatamovingthroughthesystem• Content+Attributes(key/valuepairs)

à Processor• Performs thework,canaccessFlowFiles

à Connection• Linksbetweenprocessors• Queuesthatcanbedynamicallyprioritized

à ProcessGroup• Setofprocessorsandtheirconnections• Receivedataviainputports,senddataviaoutputports

9 ©HortonworksInc.2011– 2016.AllRightsReserved

FlowFiles arelikeHTTPdataHTTPData FlowFile

HTTP/1.1200OKDate:Sun,10Oct201023:26:07GMTServer:Apache/2.2.8(CentOS)OpenSSL/0.9.8gLast-Modified:Sun,26Sep201022:04:35GMTETag:"45b6-834-49130cc1182c0"Accept-Ranges:bytesContent-Length:13Connection:closeContent-Type:text/html

Helloworld!

StandardFlowFile AttributesKey:'entryDate’ Value:'FriJun1717:15:04EDT2016'Key:'lineageStartDate’Value:'FriJun1717:15:04EDT2016'Key:'fileSize’ Value:'23609'FlowFile AttributeMapContentKey:'filename’ Value:'15650246997242'Key:'path’ Value:'./’

BinaryContent*

Header

Content

10 ©HortonworksInc.2011– 2016.AllRightsReserved

FlowFiles &DataAgnosticism

à NiFi isdataagnostic!

à But,NiFi wasdesignedunderstandingthatusers

cancareaboutspecificsandprovidestooling

tointeractwithspecificformats,protocols,etc.

ISO8601- http://xkcd.com/1179/

Robustnessprinciple

Beconservativeinwhatyoudo,beliberalinwhatyouacceptfromothers

11 ©HortonworksInc.2011– 2016.AllRightsReserved

NiFi - Terminologyà FlowFile

• Unitofdatamovingthroughthesystem• Content+Attributes(key/valuepairs)

à Processor• Performsthework,canaccessFlowFiles

à Connection• Linksbetweenprocessors• Queuesthatcanbedynamicallyprioritized

à ProcessGroup• Setofprocessorsandtheirconnections• Receivedataviainputports,senddataviaoutputports

12 ©HortonworksInc.2011– 2016.AllRightsReserved

NiFi isbasedonFlowBasedProgramming(FBP)

FBPTerm NiFi Term DescriptionInformationPacket

FlowFile Each objectmovingthroughthesystem.

Black Box FlowFileProcessor

Performsthework, doingsomecombinationofdatarouting,transformation,ormediationbetweensystems.

BoundedBuffer

Connection Thelinkage betweenprocessors, actingasqueuesandallowingvariousprocessestointeractatdifferingrates.

Scheduler FlowController

Maintainstheknowledgeofhowprocessesareconnected, andmanagesthethreadsandallocationsthereofwhichallprocessesuse.

Subnet ProcessGroup

Asetofprocessesandtheirconnections,whichcanreceiveandsenddataviaports.Aprocess groupallowscreationofentirelynewcomponentsimplybycompositionofits components.

13 ©HortonworksInc.2011– 2016.AllRightsReserved

14 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheNiFi

KeyFeaturesandPrinciples

• Guaranteeddelivery• Databuffering

- Backpressure- Pressurerelease

• Prioritizedqueuing• FlowspecificQoS

- Latencyvs.throughput- Losstolerance

• Dataprovenance

• Recovery/recordingarollinglogoffine-grainedhistory

• Visualcommandandcontrol

• Flowtemplates• Pluggable/multi-role

security• Designedforextension• Clustering

15 ©HortonworksInc.2011– 2016.AllRightsReserved

Theneedfordataprovenance

For Operators• Traceability, lineage• Recovery and replayFor Compliance• Audit trail• RemediationFor Business / Mission• Value sources • Value IT investment

BEGIN

ENDLINEAGE

16 ©HortonworksInc.2011– 2016.AllRightsReserved

DataProvenance(NotjustLineage)

• Viewattributesandcontentatgivenpointsintime(beforeandaftereachprocessor)!!!

• Records,indexes,andmakeseventsavailablefordisplay

17 ©HortonworksInc.2011– 2016.AllRightsReserved

Provenance

§ Constrained§ High-latency§ Localizedcontext

§ Hybrid– cloud/on-premises§ Low-latency§ Globalcontext

SOURCES REGIONALINFRASTRUCTURE

COREINFRASTRUCTURE

Origin– attributionReplay– recovery

EvolutionoftopologiesLongretention

TypesofLineage• Event(runtime)• Configuration(designtime)

18 ©HortonworksInc.2011– 2016.AllRightsReserved

Theneedforfine-grainedsecurityandcompliance

It’s not enough to say you have encrypted communications• Enterprise authorization

services –entitlements change often

• People and systems with different roles require difference access levels

• Tagged/classified data

19 ©HortonworksInc.2011– 2016.AllRightsReserved

Security

§ Constrained§ High-latency§ Localizedcontext

§ Hybrid– cloud/on-premises§ Low-latency§ Globalcontext

SOURCES REGIONALINFRASTRUCTURE

COREINFRASTRUCTURE

Sign,encrypt,static(dataandcontrol)

TLS,obfuscation,dynamicentitlementsKerberos,PKI,AD/DS,etc.

20 ©HortonworksInc.2011– 2016.AllRightsReserved

BackPressure

• Configureback-pressureperconnection• BasedonnumberofFlowFilesortotalsizeof

FlowFiles• Upstreamprocessornolongerscheduledtorun

untilbelowthreshold

21 ©HortonworksInc.2011– 2016.AllRightsReserved

Prioritization

• Configure a prioritizer per connection• Determine what is important for your

data – time based, arrival order, importance of a data set

• Funnel many connections down to a single connection to prioritize across data sets

• Develop your own prioritizer if needed

22 ©HortonworksInc.2011– 2016.AllRightsReserved

Latency vs. Throughput

• Choose between lower latency, or higher throughput on each processor• Higher throughput allows framework to batch together all operations for

the selected amount of time for improved performance• Processor developer determines whether to support this by using

@SupportsBatching annotation

23 ©HortonworksInc.2011– 2016.AllRightsReserved

Extension/IntegrationPoints

NiFi Term DescriptionFlow FileProcessor

Push/Pull behavior.CustomUI

ReportingTask

Used topushdatafromNiFi tosomeexternalservice(metrics,provenance,etc..)

ControllerService

Usedtoenablereusablecomponents/ sharedservicesthroughouttheflow

RESTAPI Allowsclientstoconnecttopullinformation,changebehavior,etc..

24 ©HortonworksInc.2011– 2016.AllRightsReserved

NiFi Positioning

25 ©HortonworksInc.2011– 2016.AllRightsReserved

NiFiPositioning

ApacheNiFi/MiNiFi

ETL(Informatica,etc.)

EnterpriseServiceBus(Fuse,Mule,etc.)

MessagingBus

(Kafka,MQ,etc.)

ProcessingFramework(Storm,Spark,etc.)

26 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheNiFi/ProcessingFrameworks

NiFiSimpleeventprocessing• Primarilyfeeddataintoprocessing

frameworks,canprocessdata,withafocusonsimpleeventprocessing

• Operateonasinglepieceofdata,orincorrelationwithanenrichmentdataset(enrichment,parsing,splitting,andtransformations)

• Canscaleout,butscaleupbettertotakefulladvantageofhardwareresources,runconcurrentprocessingtasks/threads(processingterabytesofdataperdayonasinglenode)

⚠ Notanotherdistributedprocessingframework,buttofeeddataintothose

ProcessingFrameworks(Storm,Spark,etc.)Complexanddistributedprocessing• Complexprocessingfrommultiplestreams(JOIN

operations)

• Analyzingdataacrosstimewindows(rollingwindowaggregation,standarddeviation,etc.)

• Scaleouttothousandsofnodesifneeded

⚠ Notdesignedtocollectdataormanagedataflow

27 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheNiFi/MessagingBusServices

NiFiProvidedataflowsolution• Centralizedmanagement,fromedgetocore

• Greattraceability,eventleveldataprovenancestartingwhendataisborn

• Interactivecommandandcontrol– realtimeoperationalvisibility

• Dataflowmanagement,includingprioritization,backpressure,andedgeintelligence

• Visualrepresentationofglobaldataflow

⚠ Notamessagingbus,flowmaintenanceneededwhenyouhavefrequentconsumersideupdates

MessagingBus(Kafka,JMS,etc.)Providemessagingbusservice• Lowlatency

• Greatdatadurability

• Decentralizedmanagement(producers&consumers)

• Lowbrokermaintenancefordynamicconsumersideupdates

⚠ Notdesignedtosolvedataflowproblems(prioritization,edgeintelligence,etc.)

⚠ Traceabilitylimitedtoin/outoftopics,nolineage

⚠ Lackofglobalviewofcomponents/connectivities

28 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheNiFi/Integration,oringestion,Frameworks

NiFiEnduserfacingdataflowmanagementtool• Outoftheboxsolutionfordataflow

management

• Interactivecommandandcontrolinthecore,designanddeployontheedge

• Flexiblefailurehandlingateachpointoftheflow

• Visualrepresentationofglobaldataflowandconnectivities

• Nativecrossdatacentercommunication

• Dataprovenancefortraceability

⚠ Notalibrarytobeembeddedinotherapplications

Integrationframework(SpringIntegration,Camel,etc),ingestionframework(Flume,etc)Developerfacingintegrationtoolwithafocusondataingestion• Asetoftoolstoorchestrateworkflow

• Afixeddesignanddeploypattern

• Leveragemessagingbusacrossdisconnectednetworks

⚠ Developerfacing,customcodingneededtooptimize

⚠ Pre-builtfailurehandling,lackofflexibility

⚠ Noholisticviewofglobaldataflow

⚠ Nobuilt-indatatraceability

29 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheNiFi/ETLTools

NiFiNOTschemadependent• Dataflowmanagementforbothstructuredand

unstructureddata,poweredbyseparationofmetadataandpayload

• Schemaisnotrequired,butyoucanhaveschema

• Minimummodelingeffort,justenoughtomanagedataflows

• Dotheplumbingjob,maximizedevelopers’brainpowerforcreativework

⚠ NotdesignedtodoheavyliftingtransformationworkforDBtables(JOINdatasets,etc.).Youcancreatecustomprocessorstodothat,butlongwaytogotocatchupwithexistingETLtoolsfromuserexperienceperspective(GUIfordatawrangling,cleansing,etc.)

ETL(Informatica,etc.)Schemadependent• TailoredforDatabases/WH

• ETLoperationsbasedonschema/datamodeling

• Highlyefficient,optimizedperformance

⚠ Mustpre-prepareyourdata,timeconsumingtobuilddatamodeling,andmaintainschemas

⚠ Notgearedtowardshandlingunstructureddata,PDF,Audio,Video,etc.

⚠ Notdesignedtosolvedataflowproblems

30 ©HortonworksInc.2011– 2016.AllRightsReserved

Apache MiNiFi

31 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheMiNiFi

à NiFi livesinthedatacenter.Giveitanenterpriseserveroraclusterofthem.

à MiNiFi livesasclosetowheredataisbornandisaguestonthatdeviceorsystem

“LetmegetthekeypartsofNiFi closetowheredatabeginsandprovidebidirectionaldatatransfer"

32 ©HortonworksInc.2011– 2016.AllRightsReserved

Realitiesofcomputingoutsidethecomfortsofthedatacenterà Limited

– computingcapability– power/network

à Restrictedsoftwarelibrary/platformavailability

à NoUI

à Physicallyinaccessible

à Notfrequentlyupdated

à Competingstandards/protocols

à Scalability

à Privacy&Security

33 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheNiFiMiNiFiKeyFeatures

• Guaranteeddelivery• Databuffering

- Backpressure- Pressurerelease

• Prioritizedqueuing• FlowspecificQoS

- Latencyvs.throughput- Losstolerance

• Dataprovenance

• Recovery/recordingarollinglogoffine-grainedhistory

• Designedforextension

• DesignandDeploy• Warmre-deploys

34 ©HortonworksInc.2011– 2016.AllRightsReserved

MiNiFi:PrecedentfromNiFi

à ProvidesthesemanticsbetweentwoNiFi componentsacrossnetworkboundaries– Acustomprotocolforinter-NiFi communication– Secure,Extensible,LoadBalanced&ScalableDeliverytoCluster

à ExtractedouttoaclientlibrarywhichpowersintegrationintopopularframeworkslikeApacheSpark,ApacheStorm,ApacheFlink,andApacheApex

à AttributesandtheFlowFile formatmaintained

AquicklookatNiFi SitetoSite

https://nifi.apache.org/docs/nifi-docs/html/user-guide.html#site-to-site

35 ©HortonworksInc.2011– 2016.AllRightsReserved

MiNiFi:PrecedentfromNiFi

à Fine-grained,eventlevelaccessofinteractionswithFlowFiles– CREATE,RECEIVE,FETCH,SEND,DOWNLOAD,DROP,EXPIRE,FORK,JOIN…

à Capturestheassociatedattributes/metadataatthetimeoftheevent

à AmapofaFlowFile’s journeyandhowtheyrelatetootherFlowFiles inasystem– MiNiFi enablesustogetmoreandfurtherilluminatethemapofdataprocessing

Adeeperdiveintoprovenance

http://nifi.apache.org/docs/nifi-docs/html/user-guide.html#data-provenance

36 ©HortonworksInc.2011– 2016.AllRightsReserved

MiNiFi:PrecedentfromNiFi

RECEIVEevent

37 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheMiNiFi

à Thefeedbackloopislongerandnotguaranteed– RemovalofWebServerandUI

à Declarativeconfiguration– LendsitselfwelltoCMprocesses– Extensibleinterfacetosupportvaryingformats

• CurrentlyprovidedinYAML

à Reducedsetofbundledcomponents

DeparturesfromNiFi ingettingtherightfit

38 ©HortonworksInc.2011– 2016.AllRightsReserved

ApacheMiNiFi:Scoping

à Gosmall:Java–Writeonce,runanywhere*– FeatureparityandreuseofcoreNiFi libraries

à Gosmaller:C++–Writeonce**,runanywhere

à Gosmallest:Writen-manytimes,runanywhereLanguagelibrariestosupporttagging,FlowFile format,SitetoSiteprotocol,andprovenancegenerationwithoutaprocessingframework– Mobile:Android&iOS– LanguageSDKs

ProvideallthekeyprinciplesofNiFi invarying,smallerfootprints

39 ©HortonworksInc.2011– 2016.AllRightsReserved

à ConstrainedÃHigh-latencyà Localizedcontext

ÃHybrid– cloud/on-premisesà Low-latencyÃGlobalcontext

CoreInfrastructure

HarnessingDatainMotion

RegionalInfrastructure

Sources

40 ©HortonworksInc.2011– 2016.AllRightsReserved

Demo

41 ©HortonworksInc.2011– 2016.AllRightsReserved

Learnmoreandjoinus!

Apache NiFi sitehttps://nifi.apache.org

Subproject MiNiFi sitehttps://nifi.apache.org/minifi/

Subscribe to and collaborate [email protected]@nifi.apache.org

Submit Ideas or Issueshttps://issues.apache.org/jira/browse/NIFIhttps://issues.apache.org/jira/browse/MINIFI

Follow on Twitter@apachenifi

42 ©HortonworksInc.2011– 2016.AllRightsReserved

ThankYou