Presentation by Carl Erhard & Zahid Mian

HaLoop: Efficient Iterative Data Processing On Large Scale Clusters by Yingyi Bu, Bill Howe, Magdalena Balazinska, & Michael D. Ernst

Many of the slides in this presentation were taken from the author’s website, which can be found here: http://www.ics.uci.edu/~yingyib/

Citations

• Introduction / Motivation• Example Algorithms Used• Architecture of HaLoop• Task Scheduling• Caching and Indexing• Experiments & Results• Conclusion / Discussion

Agenda

Motivation

• MapReduce can’t express recursion/iteration• Lots of interesting programs need loops

– graph algorithms– clustering– machine learning– recursive queries

• Dominant solution: Use a driver program outside of MapReduce

• Hypothesis: making MapReduce loop-aware affords optimization– …and lays a foundation for scalable implementations of

recursive languages

Bill Howe, UW

Thesis – Make a Loop Framework

• Observation: MapReduce has proven successful as a common runtime for non-recursive declarative languages– HIVE (SQL)– Pig (RA with nested types)

• Observation: Many people roll their own loops– Graphs, clustering, mining, recursive queries – iteration managed by external script

• Thesis: With minimal extensions, we can provide an efficient common runtime for recursive languages– Map, Reduce, Fixpoint

Bill Howe, UW

Example 1: PageRank

url rankwww.a.com

1.0www.b.com

1.0www.c.com

1.0www.d.com

1.0www.e.com

url_src url_destwww.a.com

www.b.comwww.a.co

mwww.c.comwww.c.co

mwww.a.comwww.e.co

mwww.d.comwww.d.co

mwww.b.comwww.c.co

mwww.e.comwww.e.co

mwww.c.omwww.a.co

mwww.d.com

Rank Table R0

Linkage Table L

url rankwww.a.com

2.13www.b.com

3.89www.c.com

2.60www.d.com

2.60www.e.com

Rank Table R3

Ri.rank = Ri.rank/γurlCOUNT(url_dest)

Ri.url = L.url_src

π(url_dest, γurl_destSUM(rank))

Bill Howe, UW

PageRank in MapReduce

L-split1

L-split0M

i=i+1 Converged?

Join & compute rank Aggregate fixpoint evaluation

Client

Bill Howe, UW

PageRank in MapReduce

1. L is loaded on each iteration2. L is shuffled on each iteration3. Fixpoint evaluated as a separate MapReduce job per iteration

L-split1

L-split0M

L is loop invariant, but

What’s the Problem?

Bill Howe, UW

Example 2: Descendant Query

Friend Find all friends within two hops of Eric

{Eric, Elisa}

{Eric, Tom Eric, Harry}

R0 {Eric, Eric}

Bill Howe, UW

Descendant Query in MapReduce

Friend1

Friend0

Anything new?

JoinDupe-elim

Client

(compute next generation of friends) (remove the ones we’ve already seen)

Bill Howe, UW

Descendant Query in MapReduce

1. Friend is loaded on each iteration2. Friend is shuffled on each iteration

Friend is loop invariant, but

Friend1

Friend0

JoinDupe-elim

(compute next generation of friends) (remove the ones we’ve already seen)

What’s the Problem?

Bill Howe, UW

HaLoop – The Solution

HaLoop offers the following solutions to these problems:

1. A New Programming Model & Architecture for Iterative Programs

2. Loop-Aware Task Scheduling3. Caching for Loop Invariant Data4. Caching for Fixpoint Evaluation

HaLoop Architecture

Note: The loop control (i.e. determining when execution has finished) is pushed from the application into the infrastructure.

• HaLoop will work given the following is true:

• In other words, the next result is a join of the previous result and loop-invariant data L.

HaLoop Architecture

• AddMap and AddReduce Used to add a Map Reduce loop

• SetFixedPointThreshold Set a bound on the distance between

iterations• ResultDistance

A function that returns the distance between iterations

• SetMaxNumOfIterations Set the maximum number of iterations the

loop can take.

HaLoop Programming Interface

• SetIterationInput A function which returns the input for a

certain iteration• AddStepInput

A function will allows injection of additional data in between the Map and Reduce

• AddInvariantTable Add a table which is loop invariant

API’s in Action – PageRank in HaLoop

PageRank In HaLoop

MapRank

ReduceRank

MapAggregate

ReduceAggregate

PageRank In HaLoop

MapRank

ReduceRank

MapAggregate

ReduceAggregate

Source URL

Dest/Rank Source File

a.com 1.0 #2

a.com b.com,c.com,d.com

b.com 1.0 #2

b.com #1

c.com 1.0 #2

c.com a.com, e.com #1

d.com 1.0 #2

d.com b.com #1

e.com 1.0 #2

e.com d.com,c.com #1

R0 U L

Only these values are given to reducer

PageRank In HaLoop

MapRank

ReduceRank

MapAggregate

ReduceAggregate

R0 U L

Destination New Rank

b.com 1.5

c.com 1.5

d.com 1.5

a.com 1.5

e.com 1.5

b.com 1.5

d.com 1.5

c.com 1.5

Calculate New Rank

PageRank In HaLoop

MapRank

ReduceRank

MapAggregate

ReduceAggregate

R0 U L

b.com 1.5

c.com 1.5

d.com 1.5

a.com 1.5

e.com 1.5

b.com 1.5

d.com 1.5

c.com 1.5

Calculate New Rank

Identity

PageRank In HaLoop

MapRank

ReduceRank

MapAggregate

ReduceAggregate

R0 U L

a.com 1.5

b.com 3.0

c.com 3.0

d.com 3.0

e.com 1.5

Calculate New Rank

Identity Aggregate

PageRank In HaLoop

MapRank

ReduceRank

MapAggregate

ReduceAggregate

R0 U L

a.com 1.5

b.com 3.0

c.com 3.0

d.com 3.0

e.com 1.5

Calculate New Rank

Identity Aggregate

Compare R0 and R1. If not under

threshold, repeat.

a.com 1.0

b.com 1.0

c.com 1.0

d.com 1.0

e.com 1.0

PageRank In HaLoop

MapRank

ReduceRank

MapAggregate

ReduceAggregate

R1 U L Calculate New Rank

Identity AggregateR1

Source URL

Dest/Rank Source File

a.com 1.5 #2

a.com b.com,c.com,d.com

b.com 1.5 #2

b.com #1

c.com 1.5 #2

c.com a.com, e.com #1

d.com 1.5 #2

d.com b.com #1

e.com 1.5 #2

e.com d.com,c.com #1

• One goal of HaLoop is to schedule map/reduce tasks on the same machine as the data.– Scheduling the first

iteration is no different than Hadoop.

– Subsequent iterations put tasks that access the same data on the same physical node.

HaLoop Inter-Iteration Locality

• The master node keeps a map of node ID Filesystem Partition

• When a node becomes free, the master tries to assign a task related to data contained on that node.

• If a task is required on a node with a full load, it will utilize a nearby node.

HaLoop Scheduling

• Mapper Input Cache• Reducer Input Cache• Reducer Output Cache• Why is there no Mapper Output

Cache?• Haloop Indexes Cached Data

Keys and values stored in separate local files

Reduces I/O seek time (forward only)

Caching And Indexing

Approach: Inter-iteration caching

Mapper input cache (MI)

Mapper output cache (MO)

Reducer input cache (RI)

Reducer output cache (RO)

Loop body

Bill Howe, UW

RI: Reducer Input Cache

Bill Howe, UW

• Provides:– Access to loop invariant data without

map/shuffle• Used By:

– Reducer function• Assumes:

1. Mapper output for a given table constant across iterations

2. Static partitioning (implies: no new nodes)

• PageRank– Avoid shuffling the network at every step

• Transitive Closure– Avoid shuffling the graph at every step

• K-means– No help

RO: Reducer Output Cache

Bill Howe, UW

• Provides:– Distributed access to output of previous

iterations• Used By:

– Fixpoint evaluation• Assumes:

1.Partitioning constant across iterations2.Reducer output key functionally

determines Reducer input key

• PageRank– Allows distributed fixpoint evaluation– Obviates extra MapReduce job

• Transitive Closure– No help

• K-means– No help

MI: Mapper Input Cache

Bill Howe, UW

• Provides:– Access to non-local mapper input on later

iterations• Used:

– During scheduling of map tasks• Assumes:

1. Mapper input does not change

• PageRank– Subsumed by use of Reducer Input Cache

• Transitive Closure– Subsumed by use of Reducer Input Cache

• K-means– Avoids non-local data reads on iterations > 0

• Cache Must be Reconstructed: Hosting Node Fails Hosting Node has Full Node (M/R Job Needs to be Scheduled

on a Different Substitution Node)

• Process is Transparent

Cache Rebuilding

Results: Page Rank

-Run only for 10 iterations-Join and aggregate in every iteration-Overhead in first step for caching input

-Catches up soon and outperforms Hadoop.-Low shuffling time: time between RPC invocation by reducer and sorting of keys.

Results: Descendant Query

-Join and duplicate elimination in every iteration.-Less striking Performance on LiveJournal: social network, high fan out, excessive duplicate generation which dominates the join cost and reducer input caching is less useful.

Reducer Input Cache Benefit

Transitive Closure

Billion Triples Dataset (120GB)

90 small instances on EC2

Overall run time

Bill Howe, UW

Transitive Closure

Overall run time

Livejournal, 12GB

Bill Howe, UW

Transitive Closure

Reduce and Shuffle of Join Step

Livejournal, 12GB

Join & compute rank

L-split1

L-split0M

Aggregate fixpoint evaluation

Reducer Output Cache Benefit

Bill Howe, UW

Iteration # Iteration #

Livejournal dataset

50 EC2 small instances

Freebase dataset

90 EC2 small instances

Mapper Input Cache Benefit

Bill Howe, UW

5% non-local data reads; ~5% improvement

Conclusions

• Relatively simple changes to MapReduce/Hadoop can support arbitrary recursive programs– TaskTracker (Cache management) – Scheduler (Cache awareness)– Programming model (multi-step loop bodies, cache control)

• Optimizations– Caching loop invariant data realizes largest gain– Good to eliminate extra MapReduce step for termination checks– Mapper input cache benefit inconclusive; need a busier cluster

• Future Work– Analyze expressiveness of Map Reduce Fixpoint– Consider a model of Map (Reduce+) Fixpoint

The Good …

• Haloop extends MapReduce:– Easier programming of iterative algorithms– Efficiency improvement due to loop

awareness and caching– Lets users reuse major building blocks from

existing application implementations in Hadoop.

– Fully backward compatible with Hadoop.

The Questionable …

• Only useful for algorithms which can be expressed as:

• Imposes constraints: fixed partition function for each iteration.

• Does not improve asymptotic running time. Still O(M+R) scheduling decisions, keep O(M*R) state in memory. And more overhead…

• Not completely novel: iMapReduce and Twister.• People still do iteration using traditional Map Reduce.

Google, Nutch, Mahout…

BACKUP

API’s in Action – Descendant Query

Descendant Query in HaLoop

MapJoin

ReduceJoin

MapDistinct

ReduceDistinct

MapJoin

ReduceJoin

MapDistinct

ReduceDistinct

Eric Eric #2

Eric Elisa #1

Elisa Tom #1

Elisa Harry #1

MapJoin

ReduceJoin

MapDistinct

ReduceDistinct

Eric Eric 1

Eric Elisa 1

Elisa Tom 1

Elisa Harry 1

Eric Tom 1

Eric Harry 1

Elisa Eric 1

MapJoin

ReduceJoin

MapDistinct

ReduceDistinct

Eric Eric

Eric Elisa

Elisa Tom

Elisa Harry

Eric Tom

Eric Harry

Elisa Eric

MapJoin

ReduceJoin

MapDistinct

ReduceDistinct

Eric Eric #2

Elisa Eric #2

Tom Eric #2

Harry Eric #2

Tom Elisa #2

Harry Elisa #2

MapJoin

ReduceJoin

MapDistinct

ReduceDistinct

Eric Eric 2

Elisa Eric 2

Tom Eric 2

Harry Eric 2

Tom Elisa 2

Harry Elisa 2

MapJoin

ReduceJoin

MapDistinct

ReduceDistinct

Eric Eric 2

Elisa Eric 2

Tom Eric 2

Harry Eric 2

Tom Elisa 2

Harry Elisa 2

Step Input

Eric Eric 1

Eric Elisa 1

Elisa Tom 1

Elisa Harry 1

Eric Tom 1

Eric Harry 1

Elisa Eric 1Eric Eric 0

Eric Eric

Tom Eric

Harry Eric

Tom Elisa

Harry Elisa

Eric Elisa

Eric Tom

Eric Harry

Elisa Tom

Elisa Harry

Step Input allows the union of all iterations to be input

Fixpoint Algorithm Example

Presentation by Carl Erhard & Zahid Mian

Documents

Transcript of Presentation by Carl Erhard & Zahid Mian

Interview With Werner Erhard 1976

Werner Erhard What is Est

Zahid e3 final

Data sheet ERHARD ROCO wave - SIALCO · ERHARD is a company of Data sheet ERHARD ROCO wave 99999EN_DS_ROCO_wave.indd 1 08.10.2014 13:37:56

Sixty Seconds with Zahid Tractor… - Kyriba...Sixty Seconds with Zahid Tractor… CASE STUDY Zahid Tractor Zahid Tractor Optimizes Visibility & Investment of Cash Leading to a 50%

Vanessa Zahid Collection

VINAY MIAN PROJECT

CarTel (“Car Telecommunications”) : A Distributed Mobile Sensor Computing System A Review by Zahid Mian WPI CS525D.

Werner Erhard 4

S# M # Company Name Joining Date Mailing Address Tel # Fax # … · 1 0026 Star Textile Mills LTD. 01-03-1981 A-41 SITE, Karachi. 32561127-29, 3251149 32580836 Mian MUHAMMAD ZAHID

Mian muhammad

Erhard Integrity

ERHARD needle valves

mian awais arif

Zahid Project

Zahid Sohaib PROJECT

Notice of EOGM 2017 - Nishat Groupnew.nishatmillsltd.com/wp/wp-content/uploads/2017/pdf/... · 2017. 3. 4. · 1. Mian Umer Mansha 2. Mian Hassan Mansha 3. Mr. Syed Zahid Hussain

Data sheet ERHARD butterfly valves - TALIS UK · ERHARD butterfly valves 2 This table contains the most important specifications for the standard products in the ERHARD butterfly

Maldives By Zahid

Erhard product range