Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

Post on 12-Jan-2017

263 views 4 download

Transcript of Taking DataFlow Management to the Edge with Apache NiFi/MiNiFi

Taking  DataFlow  Management  to  the  Edge  with  Apache  NiFi/MiNiFi  Bryan  Bende  –  So>ware  Engineer  @Hortonworks  Future  of  Data  NY  –  December  5th  2016  

2   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Agenda  

Ã  Problem  DefiniHon  

Ã  IntroducHon  to  Apache  NiFi  

Ã  IntroducHon  to  Apache  MiNiFi  

Ã  Demo!!  

Ã  Q&A  

3   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

About  Me  

Ã  SoPware  Engineer  @  Hortonworks  

Ã  Apache  NiFi  PMC  &  CommiTer  

Ã Working  with  NiFi  since  2011  

Ã  Recent  focus  on  integraHons  with  Hadoop  ecosystem  

Ã  bbende@hortonworks.com  /  TwiTer  @bbende  /  bryanbende.com  

Ã  Bethpage  Class  of  2001!  

4   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

The  Problem  

5   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Team  2  

It  starts  out  so  simple…  

Hey!  We  have  some  important  data  to  

send  you!    

Cool!  Your  data  is  really  important  to  

us!  

Team  1  

This  should  be  easy  right?...  

6   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

But  what  about  formats  &  protocols?  

Team  2  

We  can  publish  Avro  records  to  a  Ka\a  topic,  does  

that  work?  

Oh,  well  we  have  a  REST  service  that  accepts  

JSON…  

Team  1  

7   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

And  what  about  security  &  authenKcaKon?  

Team  2  

Hmm  what  about  security?  We  can  authenHcate  via  

Kerberos  

Sorry,  we  only  support  2-­‐Way  

TLS  with  cerHficates  

Team  1  

8   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

And  what  about  all  these  devices  at  the  edge?  

We  also  need  to  grab  data  from  all  these  devices,  how  are  we  going  to  do  

that?  

Team  2  

9   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

And  What  About…  

Ã  OrganizaHonal  PoliHcs  (my  data)  Ã  BriTle  ConnecHvity  Ã  Firewalls/Security  Domains  Ã  Partnerships  bring  new  data  /  need  

different  formats  Ã  Data  has  to  be  masked  for  

compliance  purposes  Ã  Where  is  this  data  even  from?  Ã  Data  is  in  that  other  system  –  I  need  

it  over  here    

Ã  Bandwidth  between  those  sites  is  limited  

Ã  My  Big  Data  system  needs  it  in  this  other  beTer/faster/stronger  format  

Ã  What  schema  is  that  from?  Ã  It  needs  to  be  enriched  first!  Ã  No  not  that  reference  set  –  this  one!  Ã  I  didn’t  even  know  that  system  

existed    

10   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Ok  so  let’s  fix  this  

•  Enterprise  Architecture  –  Standardize  on    •  …format  •  …a  schema  (one  that  can  evolve)  •  …a  protocol  •  …an  ontology  

But  now…  •  Standard  schema  becomes  complex  

•  Hard  to  agree  on  common  changes  

•  Some  teams  stuck  on  older  versions  

•  ProducHvity  starts  slowing…  

11   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Something  to  ponder  –  the  disconnect  is  healthy  

•  Having  Corporate  Standards  is  a  good  thing.  

•  InnovaHon  is  a  good  thing.  

Innova&on  o(en  does  not  follow  the  Corporate  Standard  

12   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

What  is  Dataflow  Management?  

13   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Dataflow  Management  

The  systemaKc  process  by  which  data  is  acquired  from  all  producers  and  delivered  to  all  consumers    

14   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Dataflow  Management  ConsideraKons  

•  Promote  Loosely  Coupled  Systems  •  Types  of  coupling:  Format,  Schema,  Protocol,  Priority,  Size,  Interest,  …  

•  Promote  Highly  Cohesive  Systems  •  Producers  should  focus  on  producHon  (not  the  intricacies  of  consumpHon)  •  Consumers  should  focus  on  storage  or  processing  (not  the  details  of  producHon)  

•  Provide  Provenance  •  The  who/what/when/where/why  of  data  •  Inter  and  Intra  Process  Latency  •  Enable  enterprise  version  control  for  data  

15   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Dataflow  Management  ConsideraKons  

•  Empower  Understanding  and  InteracKon  •  Ability  to  see  the  flow,  safely  and  quickly  iterate  and  experiment  •  Breaking  producHon  is  bad  –  so  too  is  not  being  able  to  evolve  fast  enough  

•  Secure  •  Bridge  between  security  domains  •  Data  Plane  (transport)  •  Control  Plane  (C&C,  Monitoring)  

•  Self  Service  •  Centralized  teams  –  hard  to  scale  –  slow  turnaround  Hmes  •  Centralized  systems  –  mulH-­‐tenant  management  works  

16   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

The  role  of  messaging  systems  

•  Reduce  variables:  Fix  protocol,  Data  Size,  Provide  Buffering  

•  Historically  not  very  fast  or  replayable:  Apache  Ka]a  solved  that  

•  Strong  soluKon  within  a  controlled  domain  

•  But  numerous  challenges  remain  •  Topics  do  not  separate  key  concerns  between  producer  and  consumer  pairs  such  as  

§  AuthorizaHon  §  Format  §  Schema  §  Interest  §  PrioriHzaHon  

•  Flow  control  

17   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

IntroducKon  to  Apache  NiFi  

18   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

The NSA Years

•  Created in 2006 •  Improved over eight years

•  Simple  IniHal  vision  –  Visio  for  real-­‐Hme  dataflow  management  

•  Key Lessons Learned •  What  scale  means  –  down,  up,  and  out  

•  The  fearsome  force  known  as  Compliance  Requirements  

•  The  power  of  provenance!  

•  OperaHonal  best-­‐pracHces  and  anH-­‐paTerns  

•  NSA donated the codebase to the ASF in late 2014

19   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

NiFi Key Features

•  Guaranteed  delivery  •  Data  buffering    

-  Backpressure  -  Pressure  release  

•  PrioriKzed  queuing  •  Flow  specific  QoS  

-  Latency  vs.  throughput  -  Loss  tolerance  

•  Data  provenance  

•  Recovery/recording    a  rolling  log  of  fine-­‐grained  history  

•  Visual  command  and  control  •  Flow  templates  •  Pluggable/mulK-­‐role  security  •  Designed  for  extension  •  Clustering  

20   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

NiFi Core Concepts

FBP  Term   NiFi  Term   DescripKon  InformaHon  Packet  

FlowFile   Each  object  moving  through  the  system.  

Black  Box   FlowFile  Processor  

Performs  the  work,  doing  some  combinaHon  of  data  rouHng,  transformaHon,  or  mediaHon  between  systems.  

Bounded  Buffer  

ConnecHon   The  linkage  between  processors,  acHng  as  queues  and  allowing  various  processes  to  interact  at  differing  rates.  

Scheduler   Flow  Controller  

Maintains  the  knowledge  of  how  processes  are  connected,  and  manages  the  threads  and  allocaHons  thereof  which  all  processes  use.  

Subnet   Process  Group  

A  set  of  processes  and  their  connecHons,  which  can  receive  and  send  data  via  ports.  A  process  group  allows  creaHon  of  enHrely  new  component  simply  by  composiHon  of  its  components.  

21   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Visual  Command  &  Control  

•  Drag  &  drop  processors  to  build  a  flow  

•  Start,  stop,  &  configure  components  in  real-­‐Hme  

 •  View  errors  &  corresponding  messages  

•  View  staHsHcs  &  health  of  the  dataflow  

•  Create  shareable  templates  of  common  flows  

 

22   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Provenance/Lineage  

•  Tracks  data  at  each  point  as  it  flows  through  the  system  

•  Records,  indexes,  and  makes  events  available  for  display  

•  Handles  fan-­‐in/fan-­‐out,  i.e.  merging  and  splisng  data  

•  View  aTributes  and  content  at  given  points  in  Hme  

23   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

PrioriKzaKon  

•  Configure  a  prioriHzer  per  connecHon  

•  Determine  what  is  important  for  your  data  –  Hme  based,  arrival  order,  importance  of  a  data  set  

•  Funnel  many  connecHons  down  to  a  single  connecHon  to  prioriHze  across  data  sets  

24   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Back-­‐Pressure  

•  Configure  back-­‐pressure  per  connecHon  

•  Based  on  number  of  FlowFiles  or  total  size  of  FlowFiles  

•  Upstream  processor  no  longer  scheduled  to  run  unHl  below  threshold  

25   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Latency  vs.  Throughput  

•  Choose  between  lower  latency,  or  higher  throughput  on  each  processor  

•  Higher  throughput  allows  framework  to  batch  together  all  operaHons  for  the  selected  amount  of  Hme  for  improved  performance  

•  Processor  developer  determines  whether  to  support  this  by  using  @SupportsBatching  annotaHon  

26   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Security  

Ã  Control  Plane  –  Pluggable  authenHcaHon  

•  2-­‐Way  TLS/SSL,  LDAP,  Kerberos  –  Pluggable  authorizaHon  with  mulH-­‐tenancy  

•  NiFi  Policy  Based  Authorizer  •  Apache  Ranger  Authorizer  

–  Audit  trail  of  all  user  acHons  

Ã  Data  Plane  –  OpHonal  2-­‐Way  TLS/SSL  between  cluster  nodes  –  OpHonal  2-­‐Way  TLS/SSL  on  Site-­‐To-­‐Site  connecHons  (NiFi-­‐to-­‐NiFi)  –  EncrypHon/DecrypHon  of  data  through  processors  –  Provenance  for  audit  trail  of  data  

27   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Extensibility  

Ã  Built  from  the  ground  up  with  extensions  in  mind  

Ã  Service-­‐loader  paTern  for…  •  Processors  •  Controller  Services  •  ReporHng  Tasks  

Ã  Extensions  packaged  as  NiFi  Archives  (NARs)  •  Deploy  NiFi  lib  directory  and  restart  •  Provides  ClassLoader  isolaHon  •  Same  model  as  standard  components  

28   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Architecture  -­‐  Standalone  

OS/Host  

JVM  

Flow  Controller  

Web  Server  

Processor  1   Extension  N  

FlowFile  Repository  

Content  Repository  

Provenance  Repository  

Local  Storage  

Ã  FlowFile  Repository  –  Write  Ahead  Log    –  State  of  every  FlowFile  –  Pointers  to  content  repository  

(pass-­‐by-­‐reference)  

Ã  Content  Repository  –  FlowFile  content  –  Copy-­‐on-­‐write  

Ã  Provenance  Repository  –  Write  Ahead  Log  +  Lucene  Indexes  –  Store  &  search  lineage  events  

29   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

OS/Host  

JVM  

Flow  Controller  

Web  Server  

Processor  1   Extension  N  

FlowFile  Repository  

Content  Repository  

Provenance  Repository  

Local  Storage  

OS/Host  

JVM  

Flow  Controller  

Web  Server  

Processor  1   Extension  N  

FlowFile  Repository  

Content  Repository  

Provenance  Repository  

Local  Storage  

Architecture  -­‐  Cluster  

OS/Host  

JVM  

Flow  Controller  

Web  Server  

Processor  1   Extension  N  

FlowFile  Repository  

Content  Repository  

Provenance  Repository  

Local  Storage  

ZooKeeper  

Ã  Same  dataflow  on  each  node,  data  parHHoned  across  cluster  

Ã  Access  the  UI  from  any  node  Ã  ZooKeeper  for  auto-­‐elecHon  of  

Cluster  Coordinator  &  Primary  Node    

Ã  Cluster  Coordinator  receives  heartbeats  from  other  nodes,  manages  joining/  disconnecHng  

Ã  Primary  Node  for  scheduling  processors  on  a  single  node  

30   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Site-­‐To-­‐Site  

Ã  Direct  communicaHon  between  two  NiFi  instances  

Ã  Push  to  Input  Port  on  receiver,  or  Pull  from  Output  Port  on  source  

Ã  Communicate  between  clusters,  standalone  instances,  or  both  

Ã  Handles  load  balancing  and  reliable  delivery  

Ã  Secure  connecHons  using  cerHficates  (opHonal)  

Ã  Communicate  over  TCP  or  HTTP  

 

31   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Site-­‐To-­‐Site  Push  Model  

Ã  Source  connects  Remote  Process  Group  to  Input  Port  on  desHnaHon  

Ã  Site-­‐To-­‐Site  takes  care  of  load  balancing  across  the  nodes  in  the  cluster  

NiFi  Cluster  -­‐  Node  2  

Input  Port  

NiFi  Cluster  -­‐  Node  3  

Input  Port  

Standalone  NiFi  

RPG  

NiFi  Cluster  -­‐  Node  1  

Input  Port  

32   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Site-­‐To-­‐Site  Pull  Model  

Ã  DesHnaHon  connects  Remote  Process  Group  to  Output  Port  on  the  source  

Ã  If  source  was  a  cluster,  each  node  would  pull  from  each  node  in  cluster  

NiFi  Cluster  -­‐  Node  2  

RPG  

NiFi  Cluster  -­‐  Node  3  

RPG  

Standalone  NiFi  

Output  Port  

NiFi  Cluster  -­‐  Node  1  

RPG  

33   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

IntroducKon  to  Apache  MiNiFi  

34   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Apache  MiNiFi    

Ã  Sub-­‐project  of  Apache  NiFi  

Ã  Created  to  more  effecHvely  collect  data  at  the  edge  

Ã  Smaller  footprint,  run  where  the  JVM  can’t  

Ã  Design  &  Deploy  vs.  Command  &  Control  

35   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

MiNiFi  DistribuKons  

Ã  Java  –  <40MB  binary  distribuHon  –  Requires  Java  1.8  –  More  feature  complete  –  Targeted  for  any  systems  that  can  run  a  JVM  (ie.  Servers,  Raspberry  Pi)  

Ã  C++  –  600KB  code  size  and  staHc  data  ~50KB  –  Dynamic  heap  of  ~1MB  based  on  use-­‐case  –  Targeted  for  resource  constrained  environments  (ie.  edge  IoT  devices)    

Ã  Both  use  same  config  format  and  use  NiFi  terminology  

Different  focuses  depending  on  requirements  

36   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

MiNiFi  Java  

NiFi  Framework  

Components  

MiNiFi  

NiFi  Framework  

User  Interface  

Components  

NiFi  

37   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

MiNiFi  Java    

Ã  Uses  same  NAR  structure  as  NiFi  

Ã  Use  any  NAR  from  NiFi  with  MiNiFi  Java  

Ã  NiFi  standard  processors  are  bundled  by  default  –  TailLog  –  UpdateATribute  –  Route  on  content  and  aTributes  –  PutEmail  –  ….  

38   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

MiNiFi  C++    

Ã  IniHal  set  of  processors    –  TailFile  –  GetFile  –  GenerateFlowFile  –  LogATribute  –  ListenSyslog  

Ã  Site  to  Site  Client  implementaHon  in  C++  for  talking  to  NiFi  instances  

 

39   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Design  &  Deploy  

Same  approach  for  Java  &  C++…  1.  Design  a  flow  in  NiFi  UI  

2.  Export  template  to  XML  file  

3.  Run  MiNiFi  Toolkit  to  convert  NiFi  template  to  MiNiFi  YAML  

4.  Deploy  config.yaml  to  MiNiFi  instances  

IniHally  targeHng  flows  like…  1.  GetFile/TailFile  

2.  RouHng  Decision  

3.  Site-­‐To-­‐Site  Back  to  core  NiFi  

40   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Simple  config.yml  Tail  a  rolling  file  -­‐>  Site  to  Site  

41   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

MiNiFi  Command  and  Control  

Ã  Design  Flow  at  a  centralized  place,  deploy  on  the  edge  

Ã  Version  control  of  flows    –  Align  with  NiFi  SDLC  work  

Ã  Agent  status  monitoring  

Ã  Bi-­‐direcHonal  command  and  control  

Currently  a  feature  proposal,  iniKal  version  being  architected  

hTps://cwiki.apache.org/confluence/display/MINIFI/MiNiFi+Command+and+Control  

42   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Demo!  

43   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Demo  Scenario  

Raspberry  Pi  

MiNiFi  Java  

Temp/Humidity  Sensor  

NiFi  

Raspberry  Pi  

MiNiFi  Java  

Temp/Humidity  Sensor  

site-­‐to-­‐site  

Solr  

Banana  

44   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

QuesKons?  

45   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Learn  more  and  join  us!  

Apache NiFi site http://nifi.apache.org Subproject MiNiFi site http://nifi.apache.org/minifi/ Subscribe to and collaborate at dev@nifi.apache.org users@nifi.apache.org Submit Ideas or Issues https://issues.apache.org/jira/browse/NIFI https://issues.apache.org/jira/browse/MINIFI Follow us on Twitter @apachenifi

46   ©  Hortonworks  Inc.  2011  –  2016.  All  Rights  Reserved  

Thank  you!