Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02...

71
Introduc)on to Apache Hadoop Tom Wheeler | St. Louis Java User Group

Transcript of Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02...

Page 1: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Introduc)on  to  Apache  Hadoop  Tom  Wheeler    |    St.  Louis  Java  User  Group  

Page 2: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

About  the  Presenter…  

Tom  Wheeler  Software Engineer, etc.!Greater St. Louis Area | Information Technology and Services!

Current:! Senior Curriculum Developer at Cloudera!

Past:! Principal Software Engineer at Object Computing, Inc. (Boeing)!Software Engineer, Level IV at WebMD!Senior Software Engineer at Teralogix!Senior Programmer/Analyst at A.G. Edwards and Sons, Inc.!

Contract Web development, full-time instructor, worked at an ISP!Distant Past:!

Page 3: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

• Helps  companies  profit  from  their  data  •  Founded  in  2008  to  commercialize  Hadoop  •  Staff  includes  commiJers  to  Hadoop  and  related  projects  •  Ac)ve  contributors  to  the  Apache  SoLware  Founda)on    

• Our  mission  is  to  help  organiza)ons  profit  from  their  data  •  SoLware  (CDH  distribu)on  and  Cloudera  Manager)  •  Consul)ng  and  support  services  •  Training  and  cer)fica)on  

About  Cloudera  

Page 4: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

About  the  Presenta)on…  

• What’s  ahead  •  Fundamental  Concepts  •  HDFS:  The  Hadoop  Distributed  File  System  •  Data  Processing  with  MapReduce  •  The  Hadoop  Ecosystem  •  Conclusion  +  Q&A  

Page 5: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Fundamental  Concepts  Why  the  World  Needs  Hadoop  

Page 6: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Volume  

•  Every  day…  • More  than  1.5  billion  shares  are  traded  on  the  NYSE  

•  Every  minute…  •  TransUnion  makes  nearly  70,000  updates  to  credit  files  

•  Every  second…  •  Banks  process  more  than  10,000  credit  card  transac)ons  

Page 7: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

• We  are  genera)ng  data  faster  than  ever  • People  are  increasingly  interac)ng  online  • Processes  are  increasingly  automated  •  Systems  are  increasingly  interconnected  

Velocity  

Page 8: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Variety  

• We’re  producing  a  variety  of  data,  including  •  Audio  •  Video  •  Images  •  Log  files  • Web  pages  •  Product  ra)ngs  •  Social  network  connec)ons    

• Not  all  of  this  maps  cleanly  to  the  rela)onal  model  

Page 9: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Big  Data  Can  Mean  Big  Opportunity  

• One  tweet  is  an  anecdote  •  But  a  million  tweets  may  signal  important  trends    

• One  person’s  product  review  is  an  opinion  •  But  a  million  reviews  might  reveal  a  design  flaw    

• One  person’s  diagnosis  is  an  isolated  case  •  But  a  million  medical  records  could  lead  to  a  cure  

Page 10: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

We  Just  Need  a  System  that  Scales  

•  Too  much  data  for  tradi)onal  tools    •  Two  key  problems  

•  How  to  reliably  store  this  data  at  a  reasonable  cost  •  How  to  we  process  all  the  data  we’ve  stored  

Page 11: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

What  is  Apache  Hadoop?  

•  Scalable  data  storage  and  processing  •  Distributed  and  fault-­‐tolerant    •  Runs  on  standard  hardware    

•  Two  main  components  •  Storage:  Hadoop  Distributed  File  System  (HDFS)  •  Processing:  MapReduce    

• Hadoop  clusters  are  composed  of  computers  called  nodes  •  Clusters  range  in  size  from  one  to  thousands  of  nodes  

Page 12: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

How  Did  Apache  Hadoop  Originate?  

•  Spun  off  from  an  open  source  search  engine  project  •  Papers  published  by  Google  strongly  influenced  its  design    

• Other  Web  companies  quickly  saw  the  benefits  •  Early  adop)on  by  Yahoo,  Facebook  and  others  

2002 2003 2004 2005 2006 2007

Google publishes MapReduce paper

Nutch rewritten for MapReduce

Hadoop becomesLucene subproject

Nutch spun off from Lucene

Google publishes GFS paper

1,000 nodecluster at Yahoo

Page 13: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

How  Are  Organiza)ons  Using  Hadoop?  

•  Let’s  look  at  a  few  common  uses…  

Page 14: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Hadoop  Use  Case:  Customer  Analy)cs  

Example Inc. Public Web Site (February 9 - 15)Category Unique Visitors Page Views Bounce Rate Conversion RateAverage Time on PageTelevision 1,967,345 8,439,206 23% 51% 17 seconds

Smartphone 982,384 3,185,749 47% 41% 23 secondsMP3 Player 671,820 2,174,913 61% 12% 42 seconds

Stereo

472,418

1,627,843 74% 19% 26 secondsMonitor

327,018

1,241,837 56% 17% 37 seconds

Tablet

217,328

816,545 48% 28% 53 secondsPrinter 127,124

535,261 27% 64% 34 seconds

Page 15: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Hadoop  Use  Case:  Sen)ment  Analysis  

00 01 02 03 04 05 06 07 08 09 10

Negative

NeutralPositive

References to Product in Social Media (Hourly)

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

Hour 11 12 13 14 15 16 17 18 19 20 21 22 23

Page 16: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Hadoop  Use  Case:  Product  Recommenda)ons  

Grand Theft Auto - V

Tom, we suggest the following...

$54.99

(7,268 ratings)

Death of a Decade (CD)

$11.99

(372 ratings)

St. Louis Cardinals Hat

$15.99

(15,916 ratings)

Page 17: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Hadoop  Use  Case:  Fraud  Detec)on  

Page 18: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

—    Grace  Hopper,  early  advocate  of  distributed  compu)ng  

“In  pioneer  days  they  used  oxen  for  heavy  pulling,  and  when  one  ox  couldn’t  budge  a  log,  

we  didn’t  try  to  grow  a  larger  ox”  

Page 19: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Comparing  Hadoop  to  Other  Systems  

• Monolithic  systems  don’t  scale  

• Modern  high-­‐performance  compu)ng  systems  are  distributed  •  They  spread  computa)ons  across  many  machines  in  parallel  • Widely-­‐used  used  for  scien)fic  applica)ons  •  Let’s  examine  how  a  typical  HPC  system  works  

Page 20: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Architecture  of  a  Typical  HPC  System  

Storage System

Compute Nodes

Fast Network

Page 21: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Architecture  of  a  Typical  HPC  System  

Storage System

Compute Nodes

Step 1: Copy input data

Fast Network

Page 22: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Architecture  of  a  Typical  HPC  System  

Storage System

Compute Nodes

Step 2: Process the data

Fast Network

Page 23: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Architecture  of  a  Typical  HPC  System  

Storage System

Compute Nodes

Step 3: Copy output data

Fast Network

Page 24: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

You  Don’t  Just  Need  Speed…  

•  The  problem  is  that  we  have  way  more  data  than  code  

$ du -ks code/ 1,087 $ du –ks data/ 854,632,947,314

Page 25: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

You  Need  Speed  At  Scale  

Storage System

Compute Nodes

Bottleneck

Page 26: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Hadoop  Design  Fundamental:  Data  Locality  

•  This  is  a  hallmark  of  Hadoop’s  design  •  Don’t  bring  the  data  to  the  computa)on  •  Bring  the  computa)on  to  the  data!  

• Hadoop  uses  the  same  machines  for  storage  and  processing  •  Significantly  reduces  need  to  transfer  data  across  network  

Page 27: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

The  Hadoop  Distributed  Filesystem  

HDFS  

Page 28: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

HDFS:  Hadoop  Distributed  File  System  

•  Inspired  by  the  Google  File  System  •  Reliable,  low-­‐cost  storage  for  massive  amounts  of  data  

•  Similar  to  a  UNIX  filesystem  in  some  ways  •  Hierarchical,  with  a  root  directory  •  UNIX-­‐style  paths  (e.g.,  /sales/west/accounts.txt)  •  UNIX-­‐style  file  ownership  and  permissions  

Page 29: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

HDFS:  Hadoop  Distributed  File  System  

•  There  are  also  some  major  devia)ons  from  UNIX  filesystems  •  Designed  for  sequen)al  access  to  large  files  •  Cannot  modify  file  content  once  wriJen  

• HDFS  is  implemented  as  a  user-­‐space  Java  process  •  Requires  the  use  of  special  commands  and  APIs  

Page 30: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Accessing  HDFS  via  the  Command  Line  

• Users  typically  access  HDFS  via  the  hadoop fs  command  •  Ac)ons  specified  with  subcommands  (prefixed  with  a  minus  sign)  • Most  are  immediately  familiar  to  UNIX  users  

$ hadoop fs -ls /sales $ hadoop fs -cat /sales/billing_data.csv $ hadoop fs -rm -r -f /webdata/logfiles $ hadoop fs -mkdir /reports/marketing

Page 31: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Copying  Data  To  and  From  HDFS  

• HDFS  is  dis)nct  from  your  local  filesystem  •  hadoop fs –put  copies  local  files  to  HDFS  •  hadoop fs –get  fetches  a  local  copy  of  a  file  from  HDFS  

$ hadoop fs -put sales.txt /reports

Hadoop Cluster

Client Machine

$ hadoop fs -get /reports/sales.txt

Page 32: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

HDFS  Blocks  

•  Files  added  to  HDFS  are  split  into  fixed-­‐size  blocks  •  Block  size  is  configurable,  but  defaults  to  64  megabytes  

•  Each  block  is  then  replicated  to  three  nodes  (by  default)...  

150 MB input fileBlock #1 (64 MB)

Block #2 (64 MB)

Block #3 (remaining 22 MB)

Page 33: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

HDFS  Replica)on  (cont’d)  

Block #1

Block #2

Block #3

C

D

A

B

E

Page 34: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

HDFS  Replica)on  (cont’d)  

Block #1

Block #2

Block #3

C

D

A

B

E

Page 35: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

HDFS  Replica)on  (cont’d)  

Block #1

Block #2

Block #3

C

D

A

B

E

Page 36: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

HDFS  Replica)on  (cont’d)  

•  Copies  of  the  block  remain  even  if  a  node  fails  •  Data  will  be  re-­‐replicated  to  other  nodes  automa)cally  

•  Replica)on  improves  both  reliability  and  performance  

C

D

A

B

E

This failed node held blocks #1 and #3 X

Blocks #1 and #3 are still available here

Block #3 is still available here

Block #1 is still available here

Page 37: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

A  Scalable  Data  Processing  Framework  

Data  Processing  with  MapReduce  

Page 38: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

What  is  MapReduce?  

• MapReduce  is  a  programming  model  •  It’s  a  way  of  processing  data    •  You  can  implement  MapReduce  in  any  language  

Page 39: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Understanding  Map  and  Reduce  

•  You  supply  two  func)ons  to  process  data:  Map  and  Reduce  •  Each  piece  is  simple,  but  can  be  powerful  when  combined  

• Map:  typically  used  to  transform,  parse,  or  filter  data  •  The  map  func)on  always  runs  first  

•  Reduce:  typically  used  to  summarize  results  from  map  func)on  •  The  reduce  func)on  is  op)onal  

Page 40: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

MapReduce  Benefits  

•  Scalability  •  Hadoop  divides  the  processing  job  into  individual  tasks  •  Tasks  execute  in  parallel  (independently)  across  cluster  •  Hadoop  provides  job  scheduling  and  other  infrastructure  

•  Simplicity  •  Processes  one  record  at  a  )me  •  No  explicit  I/O  •  Far  simpler  for  developers  than  typical  distributed  compu)ng  

Page 41: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

MapReduce  in  Hadoop  

• MapReduce  processing  in  Hadoop  is  batch-­‐oriented  •  High  throughput,  but  also  high  latency    

• A  MapReduce  job  is  broken  down  into  smaller  tasks  •  A  task  runs  either  the  map  func)on  or  reduce  func)on  •  Each  processes  a  subset  of  overall  input  data  

• MapReduce  code  for  Hadoop  is  usually  wriJen  in  Java  •  Can  do  basic  MapReduce  in  other  languages  too    

Page 42: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Visual  Overview  (1  -­‐  Map  Phase)  Input Data for Job

Soft kitty, warm kitty,

little ball of fur. Happy kitty,

Sleepy kitty, Purr Purr Purr.

Soft kitty, warm kitty,

little ball of fur.

Happy kitty, Sleepy kitty,

Purr Purr Purr.

(soft, 1)(kitty, 1)(warm, 1)(kitty, 1)(little, 1)(ball, 1)(of, 1)(fur, 1)

(happy, 1)(kitty, 1)(sleepy, 1)(kitty, 1)(purr, 1)(purr, 1)(purr, 1)

Map Task #1

Map Task #2

Input Split #1

Input Split #2

Page 43: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Visual  Overview  (2  -­‐  Shuffle  and  Sort  Phase)  

(soft, 1)(kitty, 1)(warm, 1)(kitty, 1)(little, 1)(ball, 1)(of, 1)(fur, 1)

(happy, 1)(kitty, 1)(sleepy, 1)(kitty, 1)(purr, 1)(purr, 1)(purr, 1)

(ball, [1])(little, [1])(of, [1])(purr, [1, 1, 1])(sleepy, [1])

(happy, [1])(kitty, [1, 1, 1, 1, 1, 1])(fur, [1])(soft, [1])(warm, [1])

Keys are shuffled, sorted, and grouped

Page 44: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Visual  Overview  (3  -­‐  Reduce  Phase)  

(ball, [1])(little, [1])(of, [1])(purr, [1, 1, 1])(sleepy, [1])

(happy, [1])(kitty, [1, 1, 1, 1, 1, 1])(fur, [1])(soft, [1])(warm, [1])

ball 1 little 1 of 1 purr 3 sleepy 1

happy 1 kitty 6 fur 1 soft 1 warm 1

Final Output

Reduce Task #1

Reduce Task #2

Page 45: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Can  Coun)ng  Words  Really  Be  Useful?  

•  Log  file  analysis  is  a  common  task  in  Hadoop  

•  Rela)onal  databases  aren't  good  at  this  •  Not  good  at  parsing  semi-­‐structured  format  •  Content  does  not  fit  into  rela)onal  model  •  Cannot  handle  large  volume  of  data  at  low  cost  

• Applica)on  logs  are  a  valuable  source  of  data  •  Let's  see  how  we  could  summarize  log  events  in  MapReduce  

Page 46: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Map  Phase  

•  The  input  to  our  job  is  in  a  typical  Log4J  format  

•  The  map  func)on  parses  this  to  extract  the  level  (highlighted)  •  It  then  emits  the  level  and  a  literal  1

 

 

2014-02-12 22:16:49.391 CDT INFO "Here's something I thought you should know"2014-02-12 22:16:52.143 CDT INFO "Blah blah, yadda, yadda, yadda"2014-02-12 22:16:54.276 CDT WARN "Uh oh. This seems bad"2014-02-12 22:16:57.471 CDT INFO "It's cold in the server room. Brrr..."2014-02-12 22:17:01.290 CDT WARN "Things are looking down for the program"2014-02-12 22:17:03.812 CDT INFO "And now for something fairly unimportant"2014-02-12 22:17:05.314 CDT ERROR "Too late. I'm out of memory!"

INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1

Page 47: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Java  Code  for  Map  Func)on  (1)  

package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Mapper;

public class LogEventParseMapper extends Mapper<LongWritable, Text, Text, IntWritable> {

1 2 3 4 5 6 7 8 9

10

Input  key  and  value  types  

Output  key  and  value  types  

Page 48: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Java  Code  for  Map  Func)on  (2)  

private enum Level { TRACE, DEBUG, INFO, WARN, ERROR, FATAL };

@Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString(); String[] fields = line.split(" "); String levelField = fields[3]; for (Level level : Level.values()) {

String levelName = level.name(); if (levelName.equalsIgnoreCase(levelField)) { context.write(new Text(levelName), new IntWritable(1)); } } } }

11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Convert  Text  value  to  a  String  

Emit  key  and  value  

Page 49: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

The  “Shuffle  and  Sort”  

•  Remember,  this  phase  is  automa)c  •  The  result  is  passed  as  input  to  the  reduce  func)on  

INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1

Shuffle  and  Sort  

Map  Output  

Reduce  Input  

ERROR [1]INFO [1, 1, 1, 1]WARN [1, 1]

Page 50: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Reduce  Phase  

•  The  reduce  func)on  can  iterate  over  all  values  for  a  key  •  It  simply  sums  them  up  to  produce  the  final  results

 

 

Reduce  ERROR [1]INFO [1, 1, 1, 1]WARN [1, 1]

ERROR 1INFO 4WARN 2

Page 51: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Java  Code  for  Reduce  Func)on  (1)  

package com.cloudera.example; import java.io.IOException; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Reducer; public class LogEventSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

1 2 3 4 5 6 7 8 9

10

Output  key  and  value  types  

Input  key  and  value  types  

Page 52: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Java  Code  for  Reduce  Func)on  (2)  

@Override public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable value : values) { sum += value.get(); }

// Our output is the event type (key) and the sum (value) context.write(key, new IntWritable(sum)); }

}

11 12 13 14 15 16 17 18 19 20 21 22 23 24

Output  key  and  value  

Page 53: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Complete  Data  Flow  Diagram  

2014-02-12 22:16:49.391 CDT INFO "Here's something I thought you should know"2014-02-12 22:16:52.143 CDT INFO "Blah blah, yadda, yadda, yadda"2014-02-12 22:16:54.276 CDT WARN "Uh oh. This seems bad"2014-02-12 22:16:57.471 CDT INFO "It's cold in the server room. Brrr..."2014-02-12 22:17:01.290 CDT WARN "Things are looking down for the program"2014-02-12 22:17:03.812 CDT INFO "And now for something fairly unimportant"2014-02-12 22:17:05.314 CDT ERROR "Too late. I'm out of memory!"

INFO 1INFO 1WARN 1INFO 1WARN 1INFO 1ERROR 1

ERROR [1]INFO [1, 1, 1, 1]WARN [1, 1]

ERROR 1INFO 4WARN 2

Map

Shuffle & Sort Reduce

Page 54: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Java  Code  for  Driver  Func)on  (1)  

package com.cloudera.example; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.IntWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.Job; // The driver is just a regular Java class with a "main" method public class Driver { public static void main(String[] args) throws Exception {

1 2 3 4 5 6 7 8 9

10 11 12 13

Page 55: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Java  Code  for  Driver  Func)on  (2)  

// validate commandline arguments (we require the user // to specify the HDFS paths to use for the job; see below) if (args.length != 2) { System.out.printf("Usage: Driver <input dir> <output dir>\n"); System.exit(-1); } // Instantiate a Job object for our job's configuration. Job job = new Job(); // configure input and output paths based on supplied arguments FileInputFormat.setInputPaths(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1]));

14 15 16 17 18 19 20 21 22 23 24 25 26

Page 56: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Java  Code  for  Driver  Func)on  (3)  

// tells Hadoop to copy the JAR containing this class // to cluster nodes, as required to run this job job.setJarByClass(Driver.class); // give the job a descriptive name. This is optional, but // helps us identify this job on a busy cluster job.setJobName("Log Event Counter Driver");

// Specify which classes to use for the Mapper and Reducer job.setMapperClass(LogEventParseMapper.class); job.setReducerClass(LogEventSumReducer.class);

27 28 29 30 31 32 33 34 35 36 37

Page 57: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Java  Code  for  Driver  Func)on  (4)  

// specify the job's output key and value classes job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); // start the MapReduce job and wait for it to finish. // if it finishes successfully, return 0; otherwise 1. boolean success = job.waitForCompletion(true); System.exit(success ? 0 : 1); } }

38 39 40 41 42 43 44 45 46 47

Page 58: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Packaging,  Deployment,  and  Execu)on  

1.  The  mapper,  reducer,  and  driver  are  then  compiled  2.  The  resul)ng  classes  are  packaged  into  a  JAR  3.  Execute  using  the  hadoop jar  command  

$ hadoop jar logsummary.jar \ com.cloudera.example.Driver \ /data/logs \ /data/logsummary

Page 59: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Open  Source  Tools  that  Complement  Hadoop  

The  Hadoop  Ecosystem  

Page 60: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

The  Hadoop  Ecosystem  

•  "Core  Hadoop"  consists  of  HDFS  and  MapReduce  •  These  are  the  kernel  of  a  much  broader  plaworm  

• Hadoop  has  many  related  projects  •  Some  help  you  integrate  Hadoop  with  other  systems  •  Others  help  you  analyze  your  data  

•  These  are  not  considered  “core  Hadoop”  •  Rather,  they’re  part  of  the  Hadoop  ecosystem  • Many  are  also  open  source  Apache  projects  

Page 61: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Apache  Sqoop  

•  Sqoop  exchanges  data  between  RDBMS  and  Hadoop  

•  Can  import  en)re  DB,  a  single  table,  or  a  table  subset  into  HDFS  •  Does  this  very  efficiently  via  a  Map-­‐only  MapReduce  job  •  Can  also  export  data  from  HDFS  back  to  the  database  

Database Hadoop Cluster

Page 62: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Apache  Flume  

   § Flume  imports  data  into  HDFS  as  it  is  being  generated  by  various  sources  

Hadoop Cluster

ProgramOutput

UNIXsyslog

Log Files

CustomSources

And manymore...

Page 63: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Apache  Mahout  

• Mahout  is  a  scalable  machine  learning  library  

•  Support  for  several  categories  of  problems,  including  •  Classifica)on  •  Clustering  •  Collabora)ve  filtering  

 • Many  algorithms  implemented  in  MapReduce  

•  Can  parallelize  inexpensively  with  Hadoop  

Page 64: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Apache  Pig  

•  Pig  offers  high-­‐level  data  processing  on  Hadoop  •  An  alterna)ve  to  wri)ng  low-­‐level  MapReduce  code  

         

•  Pig  turns  this  into  MapReduce  jobs  that  run  on  Hadoop  

people = LOAD '/data/customers' AS (cust_id, name); orders = LOAD '/data/orders' AS (ord_id, cust_id, cost); groups = GROUP orders BY cust_id; totals = FOREACH groups GENERATE group, SUM(orders.cost) AS t; result = JOIN totals BY group, people BY cust_id; DUMP result;

Page 65: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Apache  Hive  

• Hive  is  another  abstrac)on  on  top  of  MapReduce  •  Like  Pig,  it  also  reduces  development  )me    •  Hive  uses  a  SQL-­‐like  language  called  HiveQL  

SELECT customers.cust_id, SUM(cost) AS total FROM customers JOIN orders ON customers.cust_id = orders.cust_id GROUP BY customers.cust_id ORDER BY total DESC;

Page 66: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Cloudera  Impala  

• Massively  parallel  SQL  engine  that  runs  on  a  Hadoop  cluster  •  Inspired  by  Google’s  Dremel  project  •  Can  query  data  stored  in  HDFS  or  HBase  tables  

• High  performance  /  low  latency  •  Typically  10x  -­‐  50x  )mes  faster  than  Pig  or  Hive  •  Query  syntax  is  virtually  iden)cal  to  HiveQL  /  SQL  

•  Impala  is  100%  open  source  (Apache-­‐licensed)  

Page 67: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Visual  Overview  of  a  Complete  Workflow  Import Transaction Data

from RDBMSSessionize WebLog Data with Pig

Analyst uses Impala forbusiness intelligence

Sentiment Analysis on Social Media with Hive

Hadoop Cluster with Impala

Generate Nightly Reports using Pig, Hive, or Impala

Build product recommendations for

Web site

Page 68: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Conclusion  

Page 69: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Key  Points  

• Hadoop  supports  large-­‐scale  data  storage  and  processing  •  Heavily  influenced  by  Google's  architectural  papers  •  Already  in  produc)on  at  thousands  of  companies  •  HDFS  is  Hadoop's  storage  layer  • MapReduce  is  Hadoop's  processing  framework  

• Many  ecosystem  projects  complement  Hadoop  •  Some  help  you  to  integrate  Hadoop  with  exis)ng  systems  •  Others  help  you  analyze  the  data  you’ve  stored  

Page 70: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Ques)ons?  

•  Thank  you  for  aJending!  

•  I’ll  be  happy  to  answer  any  addi)onal  ques)ons  now…  

• Want  to  learn  even  more?    •  Cloudera  training:  developers,  analysts,  sysadmins,  and  more  •  Offered  in  more  than  50  ci)es  worldwide,  and  online  too!  •  See  hJp://university.cloudera.com/  for  more  info  

Page 71: Tom Wheeler: Introduction to Apache Hadoop (St. Louis Java ...files.meetup.com/1566558/2014-02 IntroToHadoop.pdf · Nutch rewritten for MapReduce Hadoop becomes Lucene subproject

Highly  Recommended  Book  

Author:  Tom  White  ISBN:  1-­‐449-­‐31152-­‐0