Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

25
Copyright © 2014 Splunk Inc. Sagi Zelnick Principal Architect, Yahoo Exploratory AnalyAcs for Sharedservice Hadoop Clusters

Transcript of Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Page 1: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Copyright  ©  2014  Splunk  Inc.  

Sagi  Zelnick  Principal  Architect,  Yahoo  

Exploratory  AnalyAcs  for  Shared-­‐service  Hadoop  Clusters  

Page 2: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Disclaimer  

2  

During  the  course  of  this  presentaAon,  we  may  make  forward-­‐looking  statements  regarding  future  events  or  the  expected  performance  of  the  company.  We  cauAon  you  that  such  statements  reflect  our  current  expectaAons  and  

esAmates  based  on  factors  currently  known  to  us  and  that  actual  events  or  results  could  differ  materially.  For  important  factors  that  may  cause  actual  results  to  differ  from  those  contained  in  our  forward-­‐looking  statements,  

please  review  our  filings  with  the  SEC.  The  forward-­‐looking  statements  made  in  the  this  presentaAon  are  being  made  as  of  the  Ame  and  date  of  its  live  presentaAon.  If  reviewed  aRer  its  live  presentaAon,  this  presentaAon  may  not  contain  current  or  accurate  informaAon.  We  do  not  assume  any  obligaAon  to  update  any  forward-­‐looking  statements  we  may  make.  In  addiAon,  any  informaAon  about  our  roadmap  outlines  our  general  product  direcAon  and  is  subject  to  change  at  any  Ame  without  noAce.  It  is  for  informaAonal  purposes  only,  and  shall  not  be  incorporated  into  any  contract  or  other  commitment.  Splunk  undertakes  no  obligaAon  either  to  develop  the  features  or  funcAonality  described  or  to  

include  any  such  feature  or  funcAonality  in  a  future  release.  

Page 3: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Overview  !   Hadoop  @  Yahoo:  8+  years  of  innovaAon  !   Hunk  @  Yahoo:  organizaAon-­‐wide  investment  for  next  3+  years    !   Yahoo  providing  Hunk  as  a  self-­‐service  to  explore,  analyze  &  visualize  data  in  HDFS  –  Hunk  allows  for  visually  browsing  very  complex  tables  (250+  fields)  –  Rapid  prototyping  for  new  jobs  with  almost  instant  results  for  searches,  

without  having  to  wait  for  the  enAre  job/query  to  finish  –  Cuts  down  on  the  development  cycles  by  faster  interacAon  with  results  –  Built-­‐in  graphs/charts  makes  for  a  powerful  soluAon  for  many  situaAons  

Page 4: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

   

History  of  Hadoop  InnovaAon  @  Yahoo  

Page 5: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Over  600PB  of  Hadoop  Storage    (Over  Half  an  Exabyte)  

!   Very  large  clusters  used  by  many  groups  across  the  enterprise  !   More  than  35,000  individual  datanodes  !   Hadoop  is  provided  as  a  service  !   MulAple  cluster  types  such  as  research,  dev,  sandbox  and  producAon  

!   Services  such  as  HBase,  Hive,  Oozie,  etc…  !   Users  are  free  to  run  jobs,  but  have  resource  constraints  !   Maintained  by  the  Grid  OperaAons  Group  

Page 6: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Integrated  AnalyAcs  Plajorm  for  Diverse  Data  Stores  

Full-­‐featured,  Integrated  Product  

Fast  Insights    for  Everyone  

Works  with  What  You  Have  Today  

Explore   Visualize   Dashboards  

Share  Analyze  

Hadoop  Clusters   NoSQL  and  Other  Data  Stores  

Hadoop  Client  Libraries   Streaming  Resource  Libraries  

Page 7: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Improving  OperaAonal  Visibility  with  Hunk  !   We  pointed  Hunk  at  many  operaAonal  logs  and  event  data  we  already  had  on  the  grid  

!   This  includes  system  metrics,  HDFS  ops,  JVM  stats  and  YARN  metrics  !   Created  instrumentaAon  to  measure  usage  per  user  and  job  !   Analyzed  terabytes  of  NameNode  audit  logs  !   Job  history  leveraged  for  visualizing  usage/growth  and  historical  views  !   Custom  events  for  HBase  staAsAcs  

Page 8: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

   

   

   

Tracking  Hadoop  Performance  &  Metrics  in  Hunk  

Use  Case   Customer    Benefits  System  metrics  from  35k  nodes   Grid  Ops  /  Grid  Customers   IdenAfy  slow  tasks/nodes  

when  debugging  

Historical  insights  of  resources   All  Grid  Customers   Track  organic  growth  

Job  performance   All  Grid  Customers   Improved  job  SLAs    

HBase  metrics   All  Grid  Customers   Track  region/RS/table  metrics…  

Job  logs  in  near  real-­‐Ame   All  Grid  Customers  /  Ops   Search  for  errors  directly  from  the  YARN  logs  

Page 9: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Measuring  NameNode  Performance  Pre  &  Post  Upgrades  

!   Historical  visualizaAons  of  all  operaAons  !   Search  data  in  Hunk  from  billions  of  NameNode  events  !   Measure  JVM  and  memory  usage  !   Insights  into  operaAonal  performance  

Page 10: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

New Searchindex="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="DFS" #hdfs=hdfs) | timechart spa

n=1h avg(number*) as num_*

Last 7 days

✓ 10,086 events (5/15/14 1:00:00.000 AM to 5/22/14 1:36:34.000 AM)

_time

num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perationsnum_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp

Fri May 162014

Sun May 18 Tue May 20

200,000,000

400,000,000

600,000,000

_time ↕

num_BlockReports ↕

num_CopyBlockOpera

tions ↕

num_HeartBeats ↕

num_ReadBlockOpera

tions ↕

num_ReadMetadataOperati

ons ↕

num_ReplaceBlockOperat

ions ↕

num_WriteBlockOpera

tions ↕

num_blockChecksumOp ↕

2014-05-15 01:00 1124437.7359

02

46721126.819672

514957.3840

98

12930433.077869

0.000000 94210832.786885

63512425.967213

13975.306557

2014-05-15 02:00 1115496.2904

92

53597000.262295

298717.6370

49

10402176.717213

0.000000 94109944.655738

93916552.393443

35459.288689

2014-05-15 03:00 1110372.4173

56566721.704918

428494.9449

13296385.590164

0.000000 94141430.295082

97353478.229508

20307.549344

Visualization VisualizaAon  Using  Hunk  

Page 11: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

New Searchindex="simon_blue_new_all" this_cluster="dilithiumblue*" (log_subtype="DFS" #hdfs=hdfs) | timechart spa

n=5m avg(number*) as num_*

Last 2 days

✓ 2,753 events (5/20/14 1:14:21.000 AM to 5/22/14 1:14:21.000 AM)

_time

num_BlockReports num_CopyBl...perations num_HeartBeats num_ReadBl...perationsnum_ReadMe...perations num_Replac...Operations num_WriteB...Operations num_blockChecksumOp

12:00 PMTue May 202014

12:00 AMWed May 21

12:00 PM

1,000,000,000

250,000,000

500,000,000

750,000,000

_time ↕

num_BlockReports ↕

num_CopyBlockOpera

tions ↕

num_HeartBeats ↕

num_ReadBlockOpera

tions ↕

num_ReadMetadataOperati

ons ↕

num_ReplaceBlockOperat

ions ↕

num_WriteBlockOpera

tions ↕

num_blockChecksumOp ↕

2014-05-20 01:15:00 1056047.0240

00

34677652.000000

124121.2640

00

26242490.800000

0.000000 88112292.800000

126478486.400000

51405.346000

2014-05-20 01:20:00 1055517.9240

00

30920700.800000

1065390.086

000

22756041.800000

0.000000 87745422.400000

92323387.200000

32070.482000

2014-05-20 01:25:00 1055457.2000

33068504.400000

27622.56200

11396610.700000

0.000000 88569211.200000

94593716.800000

28873.618000

Visualization    

Sample  TroubleshooAng  in  Hunk  of  750  Million  Events      

Page 12: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

     

 

Big  Picture  Plus  Granular  Details  

Page 13: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Analyzing  NameNode  RPC  Calls  (TroubleshooAng)  

!   Who  is  making  what  RPC  call  (open,  listStatus,  create,  etc.)  !   How  oRen  are  they  making  these  RPC  calls  !   From  which  IP/host  are  they  coming  from  !   Search  and  visualize  historical  data  from  billions  of  events  !   Prevent  NameNode  abuse/misuse  

Page 14: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Visualizing  834  Million  Discrete  Events  …    

Page 15: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

ConAnued  

Page 16: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Queue  Insights  (Capacity  &  Provisioning)  !   Each  Hadoop  job  runs  in  a  specific  queue  !   We  track  every  aspect  of  the  YARN  framework  !   Immediate  queue  performance  and  configuraAon  profiling  via  job  history  server  

!   Historical  views  and  trends  that  enable  beper  capacity  management  !   Improved  queue  uAlizaAon  and  allocaAon  management  

Page 17: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

   

New Searchindex="jobsummary_logs_all_red" cluster="dilithium*" | eval total_slot_seconds=(mapSlotSeconds + reduceSlotSec

onds) | eval gb_hours=((total_slot_seconds * 0.5) / 3600) | eval gb_hours=round(gb_hours) | timechart span=6h sum

(gb_hours) as gb_hours by queue

Last 7 days

✓ 1,175,726 events (5/20/14 8:00:00.000 PM to 5/27/14 8:26:26.000 PM)

200,000

400,000

600,000

_time ↕

OTHER

apg_dailyhigh_

p3 ↕

apg_dailymedium

_p5 ↕

apg_hourlyhigh_

p1 ↕

apg_hourlylow_

p4 ↕

apg_hourlymedium

_p2 ↕

apg_p7

curveball_larg

e ↕

curveball_me

d ↕

slingshot

slingstone

2014-05-20 18:00 4154

45512 7071 25643 12111 29664 3473

26547 14192 60875

45376

2014-05-21 00:00 19341

92661 18005 41008 22944 88115 10896

38648 8693 48186

87670

2014-05-21 06:00 21160

108137 38398 35627 14934 101925 24458

29269 14066 24344

47831

2014-05-21 12:00 24238

74849 22695 47431 17731 53673 17332

37079 14479 44873

96909

2014-05-21 18:00 5792

95449 2737 44214 20325 48339 10222

34390 4605 168593

24298

2014-05-22 00:00 10177

68048 12853 36921 23248 57740 16005

44138 9142 88121

34544

2014-05-22 06:00 12720

85048 21977 35870 15503 100364 7823

35179 8086 33973

19802

2014-05-22 12:00 5459

76489 13154 34703 11204 34877 20178

22631 40567 98 24250

2014-05-22 18:00 8169

38394 2211 49840 19977 52438 4050

38066 27973 49333

31312

2014-05-23 00:00 12898

117518 7354 36422 16426 52918 8179

28202 21798 79808

37078

2014-05-23 06:00 6572

105431 26941 48614 29159 120424 14317

26011 12433 16745

35928

Visualization

_time

Wed May 212014

Thu May 22 Fri May 23 Sat May 24 Sun May 25 Mon May 26

Search | Splunk 6.1.0 http://spbl103n01.blue.ygrid.yahoo.com:9999/en-US/app/search...

1 of 2 5/27/14, 3:20 PM

Visualizing  Queues    

Page 18: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

Self-­‐Service  Job  Reports  !   Each  job  is  unique  and  so  are  the  map  and  reduce  elements    !   How  to  start  analyzing  jobs?    !   Historical  job  performance  and  profiling  enables  in-­‐depth  performance  tuning  

!   Long  terms  historical  views  and  trending  of  growth  

Page 19: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

   

cluster

user

queue

↕ jobName ↕ jobId ↕ status

↕ gb-hours ↕

run_mins

cobalt

gmon

grideng

PigLatin:findRemoteHDFSFromAudits.pig job_1398982765383_315271

SUCCEEDED

108.00

33.07

cobalt

gmon

grideng

PigLatin:findRemoteHDFSFromAudits.pig job_1398982765383_312700

SUCCEEDED

104.00

37.37

cobalt

gmon

grideng

PigLatin:findRemoteHDFSFromAudits.pig job_1398982765383_309715

SUCCEEDED

88.00 29.83

cobalt

gmon

gridops

distcp: job_1398982765383_309921

SUCCEEDED

36.00 68.49

cobalt

gmon

gridops

SPLK_spbl103n01.blue.ygrid.yahoo.com_1401125953.2076_0 job_1398982765383_313570

SUCCEEDED

25.00 14.26

cobalt

gmon

gridops

nnaudit_DR_2014_05_25 job_1398982765383_308938

SUCCEEDED

25.00 15.43

cob g grid nnaudit_DB_2014_05_25 job_1398982765 SUCCE 24.00 18.07

New Searchindex="jobsummary_logs_all_blue" cluster="*" user="gmon" |

eval total_slot_seconds=(mapSlotSeconds + reduceSlotSeconds) |

eval gb_hours=((total_slot_seconds * 0.5) / 3600) |

eval gb_hours=round(gb_hours,2) |

eval runtime=(finishTime-submitTime)/1000 | stats sum(gb_hours) as gb-hours

avg(runtime) as run_mins

by cluster user queue jobName jobId status| eval run_mins=round(run_mins/60,2) | sort -gb-hours

Yesterday

✓ 4,871 events (5/26/14 12:00:00.000 AM to 5/27/14 12:00:00.000 AM)

Statistics (4,871)

Page 20: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

   

Page 21: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!
Page 22: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

   

Page 23: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

...It’s  Not  Just  Logs  We’re  Looking  At    

!   Using  the  metastore  we  can  setup  virtual  indexes  to  any  table(s)  in  Hive,  without  the  need  to  define  the  schema  up-­‐front  

!   Visualize  very  complex  tables  (250+  fields)  !   Rapid  prototyping  for  new  jobs  with  almost  instant  results  for  searches,  

without  having  to  wait  for  the  enAre  job/query  to  finish  !   Built-­‐in  aggregates  and  graphs/charts  !   Accelerates  development  workflow  by  providing  faster  interacAon  with  data  

More  data  to  tap  into  with  the  metastore  /  Hive  sources  

Page 24: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!
Page 25: Exploratory*AnalyAcs* for*SharedEservice* … Hadoop*@Yahoo:*8+years*of*innovaon*!

THANK  YOU