EC2, MapReduce, and Distributed Processing

173
EC2, MapReduce, and Distributed Processing Jonathan Dahl (and Rail Spikes, Slantwise, Zencoder, etc.)

description

RailsConf Europe talk on MapReduce and EC2.

Transcript of EC2, MapReduce, and Distributed Processing

Page 1: EC2, MapReduce, and Distributed Processing

EC2, MapReduce, and Distributed Processing

Jonathan Dahl

(and Rail Spikes,Slantwise,Zencoder,

etc.)

Page 2: EC2, MapReduce, and Distributed Processing

distributed processing /dis'trib'ut'ed prŏs'ěs'ĭz/ noun Refers to any of a variety of computer systems that use more than one computer, or processor, to run an application. This includes parallel processing, in which a single computer uses more than one CPU to execute programs. More often, however, distributed processing refers to local-area networks (LANs) designed so that a single program can run simultaneously at various

Page 3: EC2, MapReduce, and Distributed Processing

asynchronous processing/a·syn·chro·nous prŏs'ěs'ĭz/ noun Computations that run independently of each other, without requiring constant synchronization. Each operation

Page 4: EC2, MapReduce, and Distributed Processing

parallel processing/ p a r · a l · l e l p rŏs 'ěs ' ĭz / n o u n Simultaneous computation of a single problem or system running across separate CPU cores. Includes

Page 5: EC2, MapReduce, and Distributed Processing

distributed processing/dis'trib'ut'ed prŏs'ěs'ĭz/ noun Just like parallel processing, but utilizing separate full systems, not just separate CPU cores.

Page 6: EC2, MapReduce, and Distributed Processing

You

Page 7: EC2, MapReduce, and Distributed Processing

Me

Page 8: EC2, MapReduce, and Distributed Processing
Page 9: EC2, MapReduce, and Distributed Processing

Map______...

Page 10: EC2, MapReduce, and Distributed Processing

Transcoder 3

Transcoder 2

Rails DB

Transcoder 1

1. Poll Queue

2. Get job

Message

Queue

3. Result

Page 11: EC2, MapReduce, and Distributed Processing

Roadmap:I. Functional ProgrammingII. MapReduceIII. EC2IV. Distributed Processing

Page 12: EC2, MapReduce, and Distributed Processing

1. Functional Programming

Page 13: EC2, MapReduce, and Distributed Processing

ƒ(x) vs. i++;

Page 14: EC2, MapReduce, and Distributed Processing

ƒ(x) = 2x + 1

Page 15: EC2, MapReduce, and Distributed Processing

ƒ(person) = first name + last name

Page 16: EC2, MapReduce, and Distributed Processing

lambda {|x| x*2 + 1 }

Page 17: EC2, MapReduce, and Distributed Processing

lambda do |user| "#{user.firstname} #{user.lastname}"end

Page 18: EC2, MapReduce, and Distributed Processing

ƒ(users) = ∑ of logins for each user

Page 19: EC2, MapReduce, and Distributed Processing

users.sum { |user| user.number_of_logins }

Page 20: EC2, MapReduce, and Distributed Processing

var total_logins = 0;

for (i = 0; i < users.size; i++) { total_logins += number_of_logins(users[i])}

Page 21: EC2, MapReduce, and Distributed Processing

users.sum(&:number)

Page 22: EC2, MapReduce, and Distributed Processing

users.sum(&:number)

Page 23: EC2, MapReduce, and Distributed Processing

users.each {}

Page 24: EC2, MapReduce, and Distributed Processing

result = Array.new

users.each {|user| result << user.email }

result

Page 25: EC2, MapReduce, and Distributed Processing

reduce

Page 26: EC2, MapReduce, and Distributed Processing

reduce == inject == fold

Page 27: EC2, MapReduce, and Distributed Processing

reduce(list, function, init)

Page 28: EC2, MapReduce, and Distributed Processing

reduce(list, function, init)

(1..10)[“a”, “b”, “c”, “d”][#<User id: 19>, #<User id=43>]

Page 29: EC2, MapReduce, and Distributed Processing

reduce(list, function, init)

ƒ(x,y) = x + yƒ(x,y) = x << y if y > 0ƒ(x,y) = x << y.upcase

Page 30: EC2, MapReduce, and Distributed Processing

reduce(list, function, init)

lambda {|result, i| result + i}

lambda do |result, i| result << i if i > 0end

lambda {|r, i| r << i.upcase }

Page 31: EC2, MapReduce, and Distributed Processing

reduce(list, function, init)

0[]Hash.new(“”)

Page 32: EC2, MapReduce, and Distributed Processing

list.reduce(init) {}

Page 33: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 34: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 35: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 36: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 37: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 38: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 39: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 40: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend

Page 41: EC2, MapReduce, and Distributed Processing

(1..10).reduce(0) do |r, x| r + xend# 55

Page 42: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 43: EC2, MapReduce, and Distributed Processing

reduceinjectfold

list -> valuereduceinjectfold

Page 44: EC2, MapReduce, and Distributed Processing

reduceinjectfold

reduceinjectfold

Page 45: EC2, MapReduce, and Distributed Processing

reduceinjectfold

reduceinjectfold

|result, x|

Page 46: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 47: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 48: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 49: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 50: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 51: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 52: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 53: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 54: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 55: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 56: EC2, MapReduce, and Distributed Processing

reduceinjectfold

Page 57: EC2, MapReduce, and Distributed Processing

map(list, function)

Page 58: EC2, MapReduce, and Distributed Processing

map(list, function)

(1..10)[“a”, “b”, “c”, “d”][#<User id: 19>, #<User id=43>]

Page 59: EC2, MapReduce, and Distributed Processing

map(list, function)

lambda {|x| x + 1 }lambda {|x| x.upcase }lambda {|x| x.nil? }

Page 60: EC2, MapReduce, and Distributed Processing

list.map {}

Page 61: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 62: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 63: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 64: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 65: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 66: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }

Page 67: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| x > 5 }# [false, false, false, false, false, true, true, true, true, true]

Page 68: EC2, MapReduce, and Distributed Processing

[“a”,”b”,”c”]

Page 69: EC2, MapReduce, and Distributed Processing

[“a”,”b”,”c”] [“A”,”B”,”C”]=>

Page 70: EC2, MapReduce, and Distributed Processing

[“a”,”b”,”c”] [“A”,”B”,”C”]

User.all

=>

Page 71: EC2, MapReduce, and Distributed Processing

[“a”,”b”,”c”] [“A”,”B”,”C”]

User.all [“david”, “stanley”, “anna”]=>

=>

Page 72: EC2, MapReduce, and Distributed Processing

(1..5).map {|x| x * x}

1 * 12 * 23 * 34 * 45 * 5

Page 73: EC2, MapReduce, and Distributed Processing

parallelizable!

Page 74: EC2, MapReduce, and Distributed Processing

(1..5).reduce(0) {|i,x| i * x}

Page 75: EC2, MapReduce, and Distributed Processing

map: parallelizable

reduce: not (?)

Page 76: EC2, MapReduce, and Distributed Processing

II. MapReduce

Page 77: EC2, MapReduce, and Distributed Processing

MapReduce != map + reduce

Page 78: EC2, MapReduce, and Distributed Processing
Page 79: EC2, MapReduce, and Distributed Processing

MAP a problem across several

servers

Page 80: EC2, MapReduce, and Distributed Processing

REDUCE the results of each server to a

single result set

Page 81: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 82: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 83: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 84: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 85: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 86: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }(group)

results.reduce {|final, i| final[i.key] = i.function }

Page 87: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 88: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 89: EC2, MapReduce, and Distributed Processing

list.map {|i| i.function }

results.reduce {|final, i| final[i.key] = i.function }

Page 90: EC2, MapReduce, and Distributed Processing

key -> value

Page 91: EC2, MapReduce, and Distributed Processing

(1..10).map {|x| }

1. Initial data

Page 92: EC2, MapReduce, and Distributed Processing

(1..10).map_with_index {|i, x| }

1. Initial data

Page 93: EC2, MapReduce, and Distributed Processing

1. Initial data

• GFS chunk identifier• Book page number• Web URL• Arbitrary group ID

Page 94: EC2, MapReduce, and Distributed Processing

Map server I:‘key1’ -> 6.8‘key2’ -> 6.9‘key3’ -> 8.1

2. Intermediate data

Page 95: EC2, MapReduce, and Distributed Processing

2. Intermediate data

Map server 2:‘key1’ -> 6.2‘key4’ -> 5.5

Page 96: EC2, MapReduce, and Distributed Processing

Reduce results:‘key1’ -> 6.5‘key2’ -> 6.9‘key3’ -> 8.1‘key4’ -> 5.5

3. Final data

Page 97: EC2, MapReduce, and Distributed Processing

another view

Page 98: EC2, MapReduce, and Distributed Processing
Page 99: EC2, MapReduce, and Distributed Processing
Page 100: EC2, MapReduce, and Distributed Processing
Page 101: EC2, MapReduce, and Distributed Processing
Page 102: EC2, MapReduce, and Distributed Processing
Page 103: EC2, MapReduce, and Distributed Processing
Page 104: EC2, MapReduce, and Distributed Processing
Page 105: EC2, MapReduce, and Distributed Processing
Page 106: EC2, MapReduce, and Distributed Processing
Page 107: EC2, MapReduce, and Distributed Processing
Page 108: EC2, MapReduce, and Distributed Processing
Page 109: EC2, MapReduce, and Distributed Processing
Page 110: EC2, MapReduce, and Distributed Processing
Page 111: EC2, MapReduce, and Distributed Processing

• Stage in between ‘map’ and ‘reduce’

Page 112: EC2, MapReduce, and Distributed Processing

• All mappers must finish before reduce

Page 113: EC2, MapReduce, and Distributed Processing

• Prepare intermediate results

Page 114: EC2, MapReduce, and Distributed Processing

• (Group results by key)

Page 115: EC2, MapReduce, and Distributed Processing
Page 116: EC2, MapReduce, and Distributed Processing

Parallel reduce?

Page 117: EC2, MapReduce, and Distributed Processing
Page 118: EC2, MapReduce, and Distributed Processing

ƒ(key1), ƒ(key3), ƒ(key4)

ƒ(key2), ƒ(key5)

Page 119: EC2, MapReduce, and Distributed Processing
Page 120: EC2, MapReduce, and Distributed Processing
Page 121: EC2, MapReduce, and Distributed Processing

Example

Page 122: EC2, MapReduce, and Distributed Processing

chunky: 12bacon: 15

Page 123: EC2, MapReduce, and Distributed Processing
Page 124: EC2, MapReduce, and Distributed Processing

book = File.open("wrnpc12.txt", "r").to_awords = book.join(" ").split(" ")c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iendwords = c.sort{|a,b| b[1]<=>a[1]}

Page 125: EC2, MapReduce, and Distributed Processing

c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iend

Page 126: EC2, MapReduce, and Distributed Processing

puts words[1]puts words[100]puts words[1000]

puts word_counts[:ruby]puts word_counts[:rails]

Page 127: EC2, MapReduce, and Distributed Processing
Page 128: EC2, MapReduce, and Distributed Processing
Page 129: EC2, MapReduce, and Distributed Processing

+1 second

Page 130: EC2, MapReduce, and Distributed Processing

book = File.open("wrnpc12.txt", "r").to_awords = book.join(" ").split(" ")c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 iendwords = c.sort{|a,b| b[1]<=>a[1]}

Page 131: EC2, MapReduce, and Distributed Processing

word_chunks = input_words.chunk(200)

Page 132: EC2, MapReduce, and Distributed Processing

mapped_words = word_chunks.map do |words| distributed_count(words)end

Page 133: EC2, MapReduce, and Distributed Processing

def distributed_count(words) c = words.inject(Hash.new(0)) do |i, word| i[word.downcase.to_sym] += 1 i end c.sort{|a,b| b[1]<=>a[1]}end

Page 134: EC2, MapReduce, and Distributed Processing

grouped_words = group(mapped_words)

# :the => [1829, 887, 1523] ..# :cat => [19, 7, 36, 132] ...

Page 135: EC2, MapReduce, and Distributed Processing

final_results = grouped_words.inject({}) do |result, words| result[words.first] = words.last.inject(0) {|r, i| r + i } resultendwords = final_results.sort{|a,b| b[1]<=>a[1]}

Page 136: EC2, MapReduce, and Distributed Processing

puts words[1]puts words[100]puts words[1000]

puts word_counts[:ruby]puts word_counts[:rails]

Page 137: EC2, MapReduce, and Distributed Processing
Page 138: EC2, MapReduce, and Distributed Processing
Page 139: EC2, MapReduce, and Distributed Processing

requirements

Page 140: EC2, MapReduce, and Distributed Processing

1. Fixed problem

Page 141: EC2, MapReduce, and Distributed Processing

2. Mappable problem

Page 142: EC2, MapReduce, and Distributed Processing

3. Distributed reduce

Page 143: EC2, MapReduce, and Distributed Processing

example uses

Page 144: EC2, MapReduce, and Distributed Processing

III. EC2

Page 145: EC2, MapReduce, and Distributed Processing
Page 146: EC2, MapReduce, and Distributed Processing
Page 147: EC2, MapReduce, and Distributed Processing

Why?

Page 148: EC2, MapReduce, and Distributed Processing
Page 149: EC2, MapReduce, and Distributed Processing
Page 150: EC2, MapReduce, and Distributed Processing
Page 151: EC2, MapReduce, and Distributed Processing
Page 152: EC2, MapReduce, and Distributed Processing
Page 153: EC2, MapReduce, and Distributed Processing
Page 154: EC2, MapReduce, and Distributed Processing
Page 155: EC2, MapReduce, and Distributed Processing
Page 156: EC2, MapReduce, and Distributed Processing

Example

Page 157: EC2, MapReduce, and Distributed Processing
Page 158: EC2, MapReduce, and Distributed Processing
Page 159: EC2, MapReduce, and Distributed Processing

1851-1922

Page 160: EC2, MapReduce, and Distributed Processing

4TB

Page 161: EC2, MapReduce, and Distributed Processing

Hadoop + EC2

Hadoop

Page 162: EC2, MapReduce, and Distributed Processing

100 instances

Page 163: EC2, MapReduce, and Distributed Processing

24 hours

Page 164: EC2, MapReduce, and Distributed Processing

$240

Page 165: EC2, MapReduce, and Distributed Processing

(€164)

Page 166: EC2, MapReduce, and Distributed Processing

IV. Three Thoughts

Page 167: EC2, MapReduce, and Distributed Processing
Page 168: EC2, MapReduce, and Distributed Processing
Page 169: EC2, MapReduce, and Distributed Processing

Transcoder 3

Transcoder 2

Rails DB

Transcoder 1

1. Poll Queue

2. Get job

Message

Queue

3. Result

Page 170: EC2, MapReduce, and Distributed Processing
Page 171: EC2, MapReduce, and Distributed Processing

Hadoop

Page 172: EC2, MapReduce, and Distributed Processing
Page 173: EC2, MapReduce, and Distributed Processing

Thanks!Jonathan Dahl

Slides at Rail Spikes http://railspikes.com

Photo Credits

•Rofi: http://flickr.com/photos/rofi/

•Digital:Slurp http://flickr.com/photos/digitalslurp/

•Others stolen from Google Image search