Large scale computing with mapreduce

63
Large Scale Computing with MapReduce Sen Han 1

description

MapReduce, Large Scale Data processing, Hadoop

Transcript of Large scale computing with mapreduce

Page 1: Large scale computing with mapreduce

1

Large Scale Computing with MapReduceSen Han

Page 2: Large scale computing with mapreduce

2Data Explosion

• By 2020, there will be 5,200 GB of data for every person on Earth

• next eight years, the amount of digital data produced will exceed 40 zetabytes, which is the equivalent of 5200 GB of data for every man.

• The data recorded by each of the big experiments at the Large Hadron Colider (LHC) at Gern in Geneva is enough to fill around 100000 DVDs every year

• Source: Facebook, Google, etc.

Page 3: Large scale computing with mapreduce

3Data Explosion

• Big Data in Fields:

Sport Finance

Banking Science

Marketing Journalism

Medicine Education

Page 4: Large scale computing with mapreduce

4

A case study of Google

Downloading a large amount of web pages

Creating Indexes

Retrieve Most related pages

Page 5: Large scale computing with mapreduce

5Large Data Set

• Single-thread performance doesn’t matter• Throughput more important than peak performance.

• Stuff Breaks• 1 server can run many years but large cluster of

servers, like lose 10 a day.• “Ultra-reliable” hardware doesn’t really help. • Software needs to be fault tolerant.• Commodity machine with lower price is better.

Page 6: Large scale computing with mapreduce

6Streaming Data

  Traditional RDBMS MapReduceData Size Gigabytes PetabytesAccess Interactive and batch BatchUpdates Read and write many times Write once, read many timesStructure Static Schema Dynamic SchemaIntegrity High LowScaling Nonlinear Linear

Page 7: Large scale computing with mapreduce

7

MapReduce in Google

Page 8: Large scale computing with mapreduce

8

Functional MapReduce

• Map:• produces a set of intermediate key/value pairs

• Reduce:• Deliver Results from key/value pairs

Page 9: Large scale computing with mapreduce

9

Functional MapReduce

• map(String key, String value):// key: document name// value:document conents:EmitIntermediate(w,”1”);

• reduce(String key, Iterator values)://key: a word//value: a list of countsint result = 0;for each v in values:

result += ParseInt(v);Emit(AsString(result));

Page 10: Large scale computing with mapreduce

10

Discover Parallelism in MapReduce

• Parallel map over input. • Parallel grouping of intermediate data.• Parallel map over groups.• Parallel reduction per group.

Page 11: Large scale computing with mapreduce

11

Distributed MapReduce

Page 12: Large scale computing with mapreduce

12

Distributed MapReduce

Page 13: Large scale computing with mapreduce

13

MapReduce: Job Scheduling

• One master, many workers• Input data split into M map tasks (typically 64 MB in size)• Reduce phase partitioned into R reduce tasks.• Tasks are assigned to workers dynamically. • Often M = 200,000; R = 4,000; workers=2,000

• Master assigns each map task to a free worker. • Consider locality of data to worker then assigning task.• Worker reads task input (often from local disk)• Worker produces R local files containing intermediate k/v pairs.

• Master assigns each reduce task to a free worker.• Worker reads intermediate k/v pairs from map workers.• Worker sorts&applies user’s Reduce op to produce the output.

Page 14: Large scale computing with mapreduce

14

MapReduce: Job Scheduling

Page 15: Large scale computing with mapreduce

15

MapReduce: Fault Tolerance

• On Worker Failure:• Detect failure via periodic heartbeats.• Re-execute complete and in-progress map tasks.• Re-execute in progress reduce tasks• Task completion committed through master.• On master failure:• State is check pointed to GFS: new master recovers &

continues

Page 16: Large scale computing with mapreduce

16

MapReduce: Locality Optimization

• Master scheduling:• Asks GFS for locations of replicas of input file blocks.• Map tasks typically split into 64MB (==GFS block size)• Map tasks scheduled so GFS input block replica are on

same machine or same rack.

Page 17: Large scale computing with mapreduce

17

MapReduce: Other refinements

• Optional secondary keys for ordering.• Compression of intermediate data.• Combiner: useful for saving network bandwidth• User-defined counters.

Page 18: Large scale computing with mapreduce

18

MapReduce: examples

• Distributed Grep• The map function emits a line if it matches a given pattern. The reduce function

is an identity function that just copies the supplied intermediate data to the output.

• Count of URL Access Frequency• Input is web page logs. Output is <URL, 1> The reduce function adds together all

values for the same URL and emits a <URL, total count> pair. • Reverse Web-Link Graph

• The map function outputs (target, source) pairs for each link to target URL found in a page named source. The reduce function concatenates the lst of all source URLs associated with a given target URL and emits the pair: (target, list(source)).

• Term-Vector per Host• A term vector summarizes the most important words that occur in a document or

a set of documents as a list of (word, frequency) pairs.

Page 19: Large scale computing with mapreduce

19

MapReduce: runtime library

• MapReduce runtime library[8]:• Automatic parallelization. • Load balancing.• Network and disk transfer optimization. • Handling of machine failure. • Robustness.

Page 20: Large scale computing with mapreduce

20

Hadoop: a opensource library

• Economy: a cluster of commodity computers• Usability: a simpler user interface of submitting

computing jobs and all distributed computing are carried out on the back. No need of dealing with these issues.

• Reliable: Fault tolerant.

Page 21: Large scale computing with mapreduce

21

Existing Limitations

Page 22: Large scale computing with mapreduce

22

What was required?

• Built-in back up became a necessity.• Built-in automated recovery mechanism.• Running things in parallel.(Distributed Programming)• Easy to Administrate.• Something that is Cost effective.

Page 23: Large scale computing with mapreduce

23Origin of HADOOP

• Google’s MapReduce

• Apache Nutch (Open source web search engine)

• Apache Lucene (Text search Library)

Page 24: Large scale computing with mapreduce

24HDFS

File System component of Hadoop.

• Streaming Data Access• Hardware Failure• Commodity Hardware• Moving Data is Expensive

Page 25: Large scale computing with mapreduce

25Hadoop

• Scalable• Fault tolerant • Distributed file system• Data Storage• Cost effective processing

Page 26: Large scale computing with mapreduce

26

HDFS Core Architecture

HDFS client

Name Node(Master)

Data Node(Slave)

Data Node(Slave)

User

Data

Page 27: Large scale computing with mapreduce

27NameNode

• Only one NameNode.• Selects DataNodes to create replicas.• Image• Checkpoint• Journal• CheckpointNode / BackupNode

Page 28: Large scale computing with mapreduce

28DataNode

• Variable block size (default is 128mb).• Replicas at multiple locations (default 3).• Namespaces of all the blocks stored in NameNode.• Handshake with NameNode at startup.• Storage ID – To identify a DataNode.• Update of block replicas every one hour.• Heartbeat – Normal operation of DataNode

Page 29: Large scale computing with mapreduce

29Snapshot

• Backup of the state of the file system.• To protect from data loss during

software upgrade.• DataNode copies storage directories

and hardlinks blocks into it.

DataNode NameNode

Heartbeat

Snapshot

Page 30: Large scale computing with mapreduce

30Reads and Write

• Data in file cannot be modified once saved. (Only Append)

• Only one client can have write access to a file at a time.• Soft limit and Hard limit.• Bytes sent in pipeline to Datablocks (in form of packets).• Optimized for Batch programming systems.

Page 31: Large scale computing with mapreduce

31

Replica Management

• Two rules:1. One DataNode should contain more than one replica.

2. No rack contains more than two replicas of the same block.

• Placement of replicas play a vital role.• Block report gives the number of replicas.• Replication priority queue.

Page 32: Large scale computing with mapreduce

32Security

• Represents POSIX model(read, write and execute).• Latest version uses Kerberos authentication. • Does not travel on untrusted networks.• Very weak security features but working on it.

Page 33: Large scale computing with mapreduce

33

Who all use Hadoop?

• Yahoo• Facebook• Twitter• Ebay• LinkedIn• Amazon(A9)

Page 34: Large scale computing with mapreduce

34

• Yahoo played a vital role in the development

of Hadoop.• Initially used for indexing of web crawl results.• To block spams entering into the mail server,

filters, content optimization etc..,

Page 35: Large scale computing with mapreduce

35

• When facebook first started – commerical RDBMS.• Need for infrastructure to handle such huge data. • Days turned into hours.• Log processing, Recommendation systems, Data

warehouse and archiving.

Page 36: Large scale computing with mapreduce

36

• Uses LZO compression to store data.• Used for analyzing and collecting information.• Uses Scala programming language along with

Hadoop.• Tweets, Log information etc..,

Page 37: Large scale computing with mapreduce

37

• Huge data.• Teradata and Hadoop together to store data.• Uses Hadoop to understand customer needs.• Search queries, server logs, click throughs etc..,

Page 38: Large scale computing with mapreduce

38

• Uses Hadoop to analyze data.• New data products like

• People you may know• Jobs matching your skills• Profile visitors etc..,

Page 39: Large scale computing with mapreduce

39Other Applications

• Amazon A9• The NewYork Times• IBM• Last.fm• Veoh• And the list goes on…..

Page 40: Large scale computing with mapreduce

40

Where Hadoop doesnot work?

• Optimized for high throughput of data at the expense of latency.

• Single Point Failure and limited NameNode memory.• No modification to data in file• Hadoop is not a substitute for a database.• Consumes immense power.

Page 41: Large scale computing with mapreduce

41Which is the best?

YOU CHOOSE YOURSELF

Page 42: Large scale computing with mapreduce

42

Hadoop is supplemented by an eco-system of Apache projects such as • PIG• HIVE• ZOOKEEPER• HBASE• SQOOP

Page 43: Large scale computing with mapreduce

Hadoop applications

• Pig• Hive• Hbase• ZooKeeper• Sqoop

43

Page 44: Large scale computing with mapreduce

Pig decription

• Pig is a large-scale data analysis platform based on Hadoop

• Provides SQL-LIKE language called Pig Latin• Convert SQL data request into a series optimized

MapReduce computing• Pig complex massive data parallel computing • Provides a simple operation and programming interface

44

Page 45: Large scale computing with mapreduce

The scope of Pig 45

Page 46: Large scale computing with mapreduce

•Who use Pig?

• Amazon/A9• AOL• Facebook• Fox interactive media• Google • IBM• New York Times• PowerSet (now Microsoft)• Quantcast• Rackspace/Mailtrust• Veoh• Yahoo!

46

Page 47: Large scale computing with mapreduce

•Pig characteristics

• Ad-hoc analysis• Running in cluster computing architecture• Operation similar SQL syntax• Open source code

47

Page 48: Large scale computing with mapreduce

•Pig interface

• Ad-hoc analysis,

• Running in cluster computing architecture• Operation similar SQL syntax• Open source code

48

Page 49: Large scale computing with mapreduce

•Pig usage

• Connect to the local Hadoop cluster• Install Pig (Pig script, Grunt and embedded method)

49

Page 50: Large scale computing with mapreduce

•Pig usage

• records = Load 'first.txt' as (itemname: chararray, price: int, quality: int);

• filter_records = FILTER records BY price! = 999 AND quality == 0;

• group_records = GROUP filter_records BY itemname;• max_temp = FOREACH group_records GENERATE

group, MAX (filter_records.price);• DUMP max_temp;

50

Page 51: Large scale computing with mapreduce

•Pig and SQL comparison 51

SQL Pig

SQL is a description of the type of programming language

Pig is data flow programming language

Relational database management system (RDBMS) stores data in a strictly defined mode table

Pig requires data in a looser mode which can be defined at running time

Simple data structure Pig supports complex nested data

Support transaction, index and random read

Does not support transaction, index and random read

Page 52: Large scale computing with mapreduce

•Pig Latin

• Procedures constitute a series of statements• Operations and commands are case insensitive• Aliases and function names are case-sensitive• Multi-line statement in the entire program logic programs

52

Page 53: Large scale computing with mapreduce

•Hive

• The hive is a data warehouse tool• Map structured data file into a database table• Provides complete sql queries• Converts sql statement into MapReduce tasks to execute

53

Page 54: Large scale computing with mapreduce

•Hive Framework

• Storage (Hadoop Distributed File System HDFS)• Computing (MapReduce computing framework)

54

Page 55: Large scale computing with mapreduce

•Hive File System

• Stored in HDFS is divided into blocks • Distribute on multiple machines

55

Page 56: Large scale computing with mapreduce

•About the pig and hive

• Pig is a programming language that simplifies Hadoop common tasks

• Hive in Hadoop plays the role of the data warehouse• Pig use of Hadoop Java APIs can significantly reduce the

amount of code• Pig attract a large number of software developers

56

Page 57: Large scale computing with mapreduce

HBase

• HBase is a distributed, open source column-oriented database

• A structured data distributed storage system • Bigtable-like ability• Subproject of the Apache Hadoop project• Suitable for unstructured data storage• HBase is column-based rather than line-based mode• Require random access, real-time read and write

57

Page 58: Large scale computing with mapreduce

•Who use Pig?

• Amazon/A9• AOL• Facebook• Fox interactive media• Google • IBM• New York Times• PowerSet (now Microsoft)• Quantcast• Rackspace/Mailtrust• Veoh• Yahoo!

58

Page 59: Large scale computing with mapreduce

ZooKeeper

• Hadoop Distributed Coordination Service• Provides simple operations and additional abstract

operations such as sorting and notice• Implement a lot of coordination data structures and

protocols• Provides a generic coordination modes and methods of

open source shares repository• High-performance, which has more than 10,000 ops to

write the main benchmark throughput is even higher then mainly to read several times

59

Page 60: Large scale computing with mapreduce

Sqoop

• Aimed to assist in efficient data exchange between RDBMS and Hadoop

• View database tables and other useful gadgets• Support JDBC specification databases, such as DB2,

MySQL

60

Page 61: Large scale computing with mapreduce

•Who use Pig?

• Amazon/A9• AOL• Facebook• Fox interactive media• Google • IBM• New York Times• PowerSet (now Microsoft)• Quantcast• Rackspace/Mailtrust• Veoh• Yahoo!

61

Page 62: Large scale computing with mapreduce

62References

• Data analysis. Retrieve from: http://public.web.cern.ch/public/en/research/DataAnalysis-en.html• James Gallagher. DNA sequencing of MRSA used to stop outbreak.

http://www.bbc.co.uk/news/health-20314024• Shankland. (2009) Google uncloaks once-secret server. Retrieve from:

http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/mapreduce-osdi04.pdf

• J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI’04, 6th Symposium on Operating Systems Design and Implementation, Sponsored by USENIX, in cooperation with ACM SIGOPS, pages 137–150, 2004.

• Ralf Lammel. (2007). Google’s MapReduce programming model—Revised. Science of Computer Porgramming, Volume 68 Issue 3, October, 2007.

• Lucas Mearian. By 2020, there will be 5,200 GB of data for every person on Earth. http://www.computerworld.com/s/article/9234563/By_2020_there_will_be_5_200_GB_of_data_for_every_person_on_Earth

• Tom White. Hadoop: the definitive guide. http://books.google.com/books?id=Wu_xeGdU4G8C&pg=PA648&dq=hadoop&hl=en&sa=X&ei=6mfKUPW7Je3U2QWtzIDgCg&ved=0CDcQ6AEwAA

• Ilan Horn. Introduction to MapReduce, an Abstraction for Large-Scale Computation. http://www.slideshare.net/rantav/introduction-to-map-reduce#btnNext

Page 63: Large scale computing with mapreduce

References

• A brief view to the Platform. Retrieve from: http://hadooper.blogspot.com/• Hadoop. Retrieve from: http://pig.apache.org/• Applications and organizations using Hadoop. Retrieve from:

http://wiki.apache.org/hadoop/PoweredBy• Installing and Running Pig. Retrieve from:

http://ofps.oreilly.com/titles/9781449302641/running_pig.html• Alan, Gates. Programming Pig. 1 st ed. O'Reilly Media, 2009. 11-50. Print.• What is Hive? Retrieve from: http://hive.apache.org/docs/r0.8.1/• Hive vs. Pig. Retrieve from: http://www.larsgeorge.com/2009/10/hive-vs-pig.html• George , Lars . HBase: The Definitive Guide. 1 st ed. O'Reilly Media, 2011. 212-

215. Print. • White , Tom . Hadoop: The Definitive Guide. 1 st ed. O'Reilly Media, 2009. 312-

368. Print.

63