into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform...

15
1 “Transform Real Time Data into Real Time Decisions” “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

Transcript of into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform...

Page 1: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

1 “Transform Real Time Data into Real Time Decisions”

“Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

Page 2: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

2 “Transform Real Time Data into Real Time Decisions”

CUSTOMERS PARTNERSHIPS

OPEN SOURCE

Page 3: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

3 “Transform Real Time Data into Real Time Decisions”

RDBMS  

RDBMS  

•  Only  structured  data  •  $50K  –  100K  per  TB  •  Limited  Analy?cs  

ü  Both  structured  and  unstructured  data  ü  50x-­‐100x  cost  savings:  $1K  per  TB  ü  Expanded  analy?cs  with  MapReduce/NoSQL  etc.  

FROM  

TO  

EDW  

EDW  Hadoop/SPARK  ETL  +  Long  Term  Storage   Query  +  Present  

ETL  

Sensor  Data    

Web  Logs  

Page 4: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

4 “Transform Real Time Data into Real Time Decisions”

ETL  Goals  

•  Make  data  processing  more  powerful  •  Make  data  processing  more  simple  •  Make  data  processing  100x  faster  than  before  •  What  are  the  op?ons  ?  

 

Page 5: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

5 “Transform Real Time Data into Real Time Decisions”

What  steered  us  into  Spark  

•  Powerful  in-­‐memory  Processing  •  Simple  operator  on  Data  •  Debuggable  API  •  Efficient  Execu?on  •  Universally  distributed  

Page 6: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

6 “Transform Real Time Data into Real Time Decisions”

What  steered  us  into  Pig  

•  DSL  for  ETL  •  Rich  Operator  Library  •  Extendable  •  Pluggable  •  Powerful  ETL  

Page 7: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

7 “Transform Real Time Data into Real Time Decisions”

Operator  Mapping  

Pig   Spark  

Load   HadoopRDD  

Store   saveasObjectFile  

Filter   MappedRDD  +  filter  func  

GroupBY  (Local  rearrange,  global  rearrange  &  package)   Sort  +  Group  by  

….   …  

Page 8: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

8 “Transform Real Time Data into Real Time Decisions”

Current  Flow  

Page 9: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

9 “Transform Real Time Data into Real Time Decisions”

Issues    

•  Scaling  •  Performance  •  Spark  Specific  Operators  (Cache)  •  Pig  on  Spark  Unit  test  •  Some  specific  joins  &  rank  opera?on  

Page 10: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

10 “Transform Real Time Data into Real Time Decisions”

Filter  Code  implementa?on  

•  hcps://bitbucket.org/SigmoidDev/spork/src/80a3e4626e4504c1829568942e0690abc79d239a/src/org/apache/pig/backend/hadoop/execu?onengine/spark/converter/FilterConverter.java?at=spork-­‐1.0  

Page 11: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

11 “Transform Real Time Data into Real Time Decisions”

Contribute  

•  Pig  on  Spark  Umbrella  Jira  •  hcps://issues.apache.org/jira/browse/PIG-­‐4059  

•  hcps://github.com/sigmoidanaly?cs/spork  •  Issues  

Page 12: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

12 “Transform Real Time Data into Real Time Decisions”

Benchmark  

Dis?nct  opera?on  on  the  data  is  a  wikistats  dump  for  25  days  with  size  270G    took  4.25mins  on  Pig  on  Spark,  as  compared  to  30mins  in  MapReduce  .  

Page 13: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

13 “Transform Real Time Data into Real Time Decisions”

Mixing  Streaming  &  Batch  Processing      

•  Current  State  –    Different  code  for  batch  and  stream  •  Lambda  Architecture  •  One  unified  language  to  perform  both    

Page 14: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

14 “Transform Real Time Data into Real Time Decisions”

What  else  is  cool  

CloudFlux   SigmaStream  Cloud  Deployment   PIG/SQL  Like  DSL  Fault  Tolerance   Rich  Stream  operators  AutoScaling   Mul?ple  Data  source/Sink  Programma?c  interface     Add  custom  Operators    Cloud  Agnos?c   Apache  Spark  Based  Apache  License   Apache  License  

 

Page 15: into Real · 2017. 12. 14. · “Transform RealTime Data into Time Decisions” 1 “Transform Real Time Data into Real Time Decisions” Asit Parija(@asitparija)

15 “Transform Real Time Data into Real Time Decisions”

Thank You

Gulmohar Enclave Road,

Silver Spring Layout, Munnekollal

Bengaluru, Karnataka 560037

+1 (760) 203 3257

[email protected]

US Office

1343 Kingfisher Way

Sunnyvale, CA, 94087 India Office