Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

46
Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi Bryan Bende – So>ware Engineer @Hortonworks Future of Data NY – December 5 th 2016

Transcript of Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

Page 1: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

Taking  DataFlow  Management  to  the  Edge  with  Apache  NiFi/MiNiFi  Bryan  Bende  –  So>ware  Engineer  @Hortonworks  Future  of  Data  NY  –  December  5th  2016  

Page 2: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

2   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Agenda  

Ã  Problem  DefiniHon  

Ã  IntroducHon  to  Apache  NiFi  

Ã  IntroducHon  to  Apache  MiNiFi  

Ã  Demo!!  

Ã  Q&A  

Page 3: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

3   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

About  Me  

Ã  SoPware  Engineer  @  Hortonworks  

Ã  Apache  NiFi  PMC  &  CommiTer  

Ã Working  with  NiFi  since  2011  

Ã  Recent  focus  on  integraHons  with  Hadoop  ecosystem  

Ã  [email protected]  /  TwiTer  @bbende  /  bryanbende.com  

Ã  Bethpage  Class  of  2001!  

Page 4: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

4   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

The  Problem  

Page 5: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

5   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Team  2  

It  starts  out  so  simple…  

Hey!  We  have  some  important  data  to  

send  you!    

Cool!  Your  data  is  really  important  to  

us!  

Team  1  

This  should  be  easy  right?...  

Page 6: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

6   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

But  what  about  formats  &  protocols?  

Team  2  

We  can  publish  Avro  records  to  a  Ka\a  topic,  does  

that  work?  

Oh,  well  we  have  a  REST  service  that  accepts  

JSON…  

Team  1  

Page 7: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

7   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

And  what  about  security  &  authenKcaKon?  

Team  2  

Hmm  what  about  security?  We  can  authenHcate  via  

Kerberos  

Sorry,  we  only  support  2-­‐Way  

TLS  with  cerHficates  

Team  1  

Page 8: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

8   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

And  what  about  all  these  devices  at  the  edge?  

We  also  need  to  grab  data  from  all  these  devices,  how  are  we  going  to  do  

that?  

Team  2  

Page 9: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

9   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

And  What  About…  

Ã  OrganizaHonal  PoliHcs  (my  data)  Ã  BriTle  ConnecHvity  Ã  Firewalls/Security  Domains  Ã  Partnerships  bring  new  data  /  need  

different  formats  Ã  Data  has  to  be  masked  for  

compliance  purposes  Ã  Where  is  this  data  even  from?  Ã  Data  is  in  that  other  system  –  I  need  

it  over  here    

Ã  Bandwidth  between  those  sites  is  limited  

Ã  My  Big  Data  system  needs  it  in  this  other  beTer/faster/stronger  format  

Ã  What  schema  is  that  from?  Ã  It  needs  to  be  enriched  first!  Ã  No  not  that  reference  set  –  this  one!  Ã  I  didn’t  even  know  that  system  

existed    

Page 10: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

10   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Ok  so  let’s  fix  this  

•  Enterprise  Architecture  –  Standardize  on    •  …format  •  …a  schema  (one  that  can  evolve)  •  …a  protocol  •  …an  ontology  

But  now…  •  Standard  schema  becomes  complex  

•  Hard  to  agree  on  common  changes  

•  Some  teams  stuck  on  older  versions  

•  ProducHvity  starts  slowing…  

Page 11: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

11   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Something  to  ponder  –  the  disconnect  is  healthy  

•  Having  Corporate  Standards  is  a  good  thing.  

•  InnovaHon  is  a  good  thing.  

Innova&on  o(en  does  not  follow  the  Corporate  Standard  

Page 12: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

12   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

What  is  Dataflow  Management?  

Page 13: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

13   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Dataflow  Management  

The  systemaKc  process  by  which  data  is  acquired  from  all  producers  and  delivered  to  all  consumers    

Page 14: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

14   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Dataflow  Management  ConsideraKons  

•  Promote  Loosely  Coupled  Systems  •  Types  of  coupling:  Format,  Schema,  Protocol,  Priority,  Size,  Interest,  …  

•  Promote  Highly  Cohesive  Systems  •  Producers  should  focus  on  producHon  (not  the  intricacies  of  consumpHon)  •  Consumers  should  focus  on  storage  or  processing  (not  the  details  of  producHon)  

•  Provide  Provenance  •  The  who/what/when/where/why  of  data  •  Inter  and  Intra  Process  Latency  •  Enable  enterprise  version  control  for  data  

Page 15: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

15   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Dataflow  Management  ConsideraKons  

•  Empower  Understanding  and  InteracKon  •  Ability  to  see  the  flow,  safely  and  quickly  iterate  and  experiment  •  Breaking  producHon  is  bad  –  so  too  is  not  being  able  to  evolve  fast  enough  

•  Secure  •  Bridge  between  security  domains  •  Data  Plane  (transport)  •  Control  Plane  (C&C,  Monitoring)  

•  Self  Service  •  Centralized  teams  –  hard  to  scale  –  slow  turnaround  Hmes  •  Centralized  systems  –  mulH-­‐tenant  management  works  

Page 16: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

16   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

The  role  of  messaging  systems  

•  Reduce  variables:  Fix  protocol,  Data  Size,  Provide  Buffering  

•  Historically  not  very  fast  or  replayable:  Apache  Ka]a  solved  that  

•  Strong  soluKon  within  a  controlled  domain  

•  But  numerous  challenges  remain  •  Topics  do  not  separate  key  concerns  between  producer  and  consumer  pairs  such  as  

§  AuthorizaHon  §  Format  §  Schema  §  Interest  §  PrioriHzaHon  

•  Flow  control  

Page 17: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

17   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

IntroducKon  to  Apache  NiFi  

Page 18: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

18   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

The NSA Years

•  Created in 2006 •  Improved over eight years

•  Simple  IniHal  vision  –  Visio  for  real-­‐Hme  dataflow  management  

•  Key Lessons Learned •  What  scale  means  –  down,  up,  and  out  

•  The  fearsome  force  known  as  Compliance  Requirements  

•  The  power  of  provenance!  

•  OperaHonal  best-­‐pracHces  and  anH-­‐paTerns  

•  NSA donated the codebase to the ASF in late 2014

Page 19: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

19   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

NiFi Key Features

•  Guaranteed  delivery  •  Data  buffering    

-  Backpressure  -  Pressure  release  

•  PrioriKzed  queuing  •  Flow  specific  QoS  

-  Latency  vs.  throughput  -  Loss  tolerance  

•  Data  provenance  

•  Recovery/recording    a  rolling  log  of  fine-­‐grained  history  

•  Visual  command  and  control  •  Flow  templates  •  Pluggable/mulK-­‐role  security  •  Designed  for  extension  •  Clustering  

Page 20: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

20   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

NiFi Core Concepts

FBP  Term   NiFi  Term   DescripKon  InformaHon  Packet  

FlowFile   Each  object  moving  through  the  system.  

Black  Box   FlowFile  Processor  

Performs  the  work,  doing  some  combinaHon  of  data  rouHng,  transformaHon,  or  mediaHon  between  systems.  

Bounded  Buffer  

ConnecHon   The  linkage  between  processors,  acHng  as  queues  and  allowing  various  processes  to  interact  at  differing  rates.  

Scheduler   Flow  Controller  

Maintains  the  knowledge  of  how  processes  are  connected,  and  manages  the  threads  and  allocaHons  thereof  which  all  processes  use.  

Subnet   Process  Group  

A  set  of  processes  and  their  connecHons,  which  can  receive  and  send  data  via  ports.  A  process  group  allows  creaHon  of  enHrely  new  component  simply  by  composiHon  of  its  components.  

Page 21: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

21   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Visual  Command  &  Control  

•  Drag  &  drop  processors  to  build  a  flow  

•  Start,  stop,  &  configure  components  in  real-­‐Hme  

 •  View  errors  &  corresponding  messages  

•  View  staHsHcs  &  health  of  the  dataflow  

•  Create  shareable  templates  of  common  flows  

 

Page 22: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

22   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Provenance/Lineage  

•  Tracks  data  at  each  point  as  it  flows  through  the  system  

•  Records,  indexes,  and  makes  events  available  for  display  

•  Handles  fan-­‐in/fan-­‐out,  i.e.  merging  and  splisng  data  

•  View  aTributes  and  content  at  given  points  in  Hme  

Page 23: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

23   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

PrioriKzaKon  

•  Configure  a  prioriHzer  per  connecHon  

•  Determine  what  is  important  for  your  data  –  Hme  based,  arrival  order,  importance  of  a  data  set  

•  Funnel  many  connecHons  down  to  a  single  connecHon  to  prioriHze  across  data  sets  

Page 24: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

24   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Back-­‐Pressure  

•  Configure  back-­‐pressure  per  connecHon  

•  Based  on  number  of  FlowFiles  or  total  size  of  FlowFiles  

•  Upstream  processor  no  longer  scheduled  to  run  unHl  below  threshold  

Page 25: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

25   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Latency  vs.  Throughput  

•  Choose  between  lower  latency,  or  higher  throughput  on  each  processor  

•  Higher  throughput  allows  framework  to  batch  together  all  operaHons  for  the  selected  amount  of  Hme  for  improved  performance  

•  Processor  developer  determines  whether  to  support  this  by  using  @SupportsBatching  annotaHon  

Page 26: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

26   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Security  

Ã  Control  Plane  –  Pluggable  authenHcaHon  

•  2-­‐Way  TLS/SSL,  LDAP,  Kerberos  –  Pluggable  authorizaHon  with  mulH-­‐tenancy  

•  NiFi  Policy  Based  Authorizer  •  Apache  Ranger  Authorizer  

–  Audit  trail  of  all  user  acHons  

Ã  Data  Plane  –  OpHonal  2-­‐Way  TLS/SSL  between  cluster  nodes  –  OpHonal  2-­‐Way  TLS/SSL  on  Site-­‐To-­‐Site  connecHons  (NiFi-­‐to-­‐NiFi)  –  EncrypHon/DecrypHon  of  data  through  processors  –  Provenance  for  audit  trail  of  data  

Page 27: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

27   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Extensibility  

Ã  Built  from  the  ground  up  with  extensions  in  mind  

Ã  Service-­‐loader  paTern  for…  •  Processors  •  Controller  Services  •  ReporHng  Tasks  

Ã  Extensions  packaged  as  NiFi  Archives  (NARs)  •  Deploy  NiFi  lib  directory  and  restart  •  Provides  ClassLoader  isolaHon  •  Same  model  as  standard  components  

Page 28: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

28   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Architecture  -­‐  Standalone  

OS/Host  

JVM  

Flow  Controller  

Web  Server  

Processor  1   Extension  N  

FlowFile  Repository  

Content  Repository  

Provenance  Repository  

Local  Storage  

Ã  FlowFile  Repository  –  Write  Ahead  Log    –  State  of  every  FlowFile  –  Pointers  to  content  repository  

(pass-­‐by-­‐reference)  

Ã  Content  Repository  –  FlowFile  content  –  Copy-­‐on-­‐write  

Ã  Provenance  Repository  –  Write  Ahead  Log  +  Lucene  Indexes  –  Store  &  search  lineage  events  

Page 29: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

29   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

OS/Host  

JVM  

Flow  Controller  

Web  Server  

Processor  1   Extension  N  

FlowFile  Repository  

Content  Repository  

Provenance  Repository  

Local  Storage  

OS/Host  

JVM  

Flow  Controller  

Web  Server  

Processor  1   Extension  N  

FlowFile  Repository  

Content  Repository  

Provenance  Repository  

Local  Storage  

Architecture  -­‐  Cluster  

OS/Host  

JVM  

Flow  Controller  

Web  Server  

Processor  1   Extension  N  

FlowFile  Repository  

Content  Repository  

Provenance  Repository  

Local  Storage  

ZooKeeper  

Ã  Same  dataflow  on  each  node,  data  parHHoned  across  cluster  

Ã  Access  the  UI  from  any  node  Ã  ZooKeeper  for  auto-­‐elecHon  of  

Cluster  Coordinator  &  Primary  Node    

Ã  Cluster  Coordinator  receives  heartbeats  from  other  nodes,  manages  joining/  disconnecHng  

Ã  Primary  Node  for  scheduling  processors  on  a  single  node  

Page 30: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

30   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Site-­‐To-­‐Site  

Ã  Direct  communicaHon  between  two  NiFi  instances  

Ã  Push  to  Input  Port  on  receiver,  or  Pull  from  Output  Port  on  source  

Ã  Communicate  between  clusters,  standalone  instances,  or  both  

Ã  Handles  load  balancing  and  reliable  delivery  

Ã  Secure  connecHons  using  cerHficates  (opHonal)  

Ã  Communicate  over  TCP  or  HTTP  

 

Page 31: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

31   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Site-­‐To-­‐Site  Push  Model  

Ã  Source  connects  Remote  Process  Group  to  Input  Port  on  desHnaHon  

Ã  Site-­‐To-­‐Site  takes  care  of  load  balancing  across  the  nodes  in  the  cluster  

NiFi  Cluster  -­‐  Node  2  

Input  Port  

NiFi  Cluster  -­‐  Node  3  

Input  Port  

Standalone  NiFi  

RPG  

NiFi  Cluster  -­‐  Node  1  

Input  Port  

Page 32: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

32   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Site-­‐To-­‐Site  Pull  Model  

Ã  DesHnaHon  connects  Remote  Process  Group  to  Output  Port  on  the  source  

Ã  If  source  was  a  cluster,  each  node  would  pull  from  each  node  in  cluster  

NiFi  Cluster  -­‐  Node  2  

RPG  

NiFi  Cluster  -­‐  Node  3  

RPG  

Standalone  NiFi  

Output  Port  

NiFi  Cluster  -­‐  Node  1  

RPG  

Page 33: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

33   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

IntroducKon  to  Apache  MiNiFi  

Page 34: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

34   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Apache  MiNiFi    

Ã  Sub-­‐project  of  Apache  NiFi  

Ã  Created  to  more  effecHvely  collect  data  at  the  edge  

Ã  Smaller  footprint,  run  where  the  JVM  can’t  

Ã  Design  &  Deploy  vs.  Command  &  Control  

Page 35: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

35   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

MiNiFi  DistribuKons  

Ã  Java  –  <40MB  binary  distribuHon  –  Requires  Java  1.8  –  More  feature  complete  –  Targeted  for  any  systems  that  can  run  a  JVM  (ie.  Servers,  Raspberry  Pi)  

Ã  C++  –  600KB  code  size  and  staHc  data  ~50KB  –  Dynamic  heap  of  ~1MB  based  on  use-­‐case  –  Targeted  for  resource  constrained  environments  (ie.  edge  IoT  devices)    

Ã  Both  use  same  config  format  and  use  NiFi  terminology  

Different  focuses  depending  on  requirements  

Page 36: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

36   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

MiNiFi  Java  

NiFi  Framework  

Components  

MiNiFi  

NiFi  Framework  

User  Interface  

Components  

NiFi  

Page 37: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

37   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

MiNiFi  Java    

Ã  Uses  same  NAR  structure  as  NiFi  

Ã  Use  any  NAR  from  NiFi  with  MiNiFi  Java  

Ã  NiFi  standard  processors  are  bundled  by  default  –  TailLog  –  UpdateATribute  –  Route  on  content  and  aTributes  –  PutEmail  –  ….  

Page 38: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

38   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

MiNiFi  C++    

Ã  IniHal  set  of  processors    –  TailFile  –  GetFile  –  GenerateFlowFile  –  LogATribute  –  ListenSyslog  

Ã  Site  to  Site  Client  implementaHon  in  C++  for  talking  to  NiFi  instances  

 

Page 39: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

39   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Design  &  Deploy  

Same  approach  for  Java  &  C++…  1.  Design  a  flow  in  NiFi  UI  

2.  Export  template  to  XML  file  

3.  Run  MiNiFi  Toolkit  to  convert  NiFi  template  to  MiNiFi  YAML  

4.  Deploy  config.yaml  to  MiNiFi  instances  

IniHally  targeHng  flows  like…  1.  GetFile/TailFile  

2.  RouHng  Decision  

3.  Site-­‐To-­‐Site  Back  to  core  NiFi  

Page 40: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

40   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Simple  config.yml  Tail  a  rolling  file  -­‐>  Site  to  Site  

Page 41: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

41   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

MiNiFi  Command  and  Control  

Ã  Design  Flow  at  a  centralized  place,  deploy  on  the  edge  

Ã  Version  control  of  flows    –  Align  with  NiFi  SDLC  work  

Ã  Agent  status  monitoring  

Ã  Bi-­‐direcHonal  command  and  control  

Currently  a  feature  proposal,  iniKal  version  being  architected  

hTps://cwiki.apache.org/confluence/display/MINIFI/MiNiFi+Command+and+Control  

Page 42: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

42   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Demo!  

Page 43: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

43   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Demo  Scenario  

Raspberry  Pi  

MiNiFi  Java  

Temp/Humidity  Sensor  

NiFi  

Raspberry  Pi  

MiNiFi  Java  

Temp/Humidity  Sensor  

site-­‐to-­‐site  

Solr  

Banana  

Page 44: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

44   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

QuesKons?  

Page 45: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

45   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Learn  more  and  join  us!  

Apache NiFi site http://nifi.apache.org Subproject MiNiFi site http://nifi.apache.org/minifi/ Subscribe to and collaborate at [email protected] [email protected] Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI https://issues.apache.org/jira/browse/MINIFI Follow us on Twitter @apachenifi

Page 46: Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

46   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Thank  you!