MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science,...

61
MapReduce & Cloud PengBo Dec 6, 2010

Transcript of MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science,...

Page 1: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

MapReduce & Cloud

PengBoDec 6, 2010

Page 2: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

MapReduce

Page 3: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Imperative Programming

In computer science, imperative programming is a programming paradigm that describes computation in terms of statements that change a program state.

Page 4: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Declarative Programming

In computer science, declarative programming is a programming paradigm that expresses the logic of a computation without describing its control flow

Page 5: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Functional Language

map f lst: (’a->’b) -> (’a list) -> (’b list)把 f 作用在输入 list 的每个元

素上,输出一个新的 list.

fold f x0 lst: ('a*'b->'b)->

'b->('a list)->'b 把 f 作用在输入 list 的每个

元素和一个累加器元素上,f 返回下一个累加器的值

f f f f f f f f f f f returned

initial

Page 6: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

From Functional Language View

map f lst: (’a->’b) -> (’a list) -> (’b list)把 f 作用在输入 list 的每个元

素上,输出一个新的 list.

fold f x0 lst: ('a*'b->'b)->

'b->('a list)->'b 把 f 作用在输入 list 的每个

元素和一个累加器元素上,f 返回下一个累加器的值

f f f f f f f f f f f returned

initial

Functional 运算不修改数据,总是产生新数据 map 和 reduce 具有内在的并行性

Map 可以完全并行 Reduce 在 f 运算满足结合律时,可以乱序并发执行

Functional 运算不修改数据,总是产生新数据 map 和 reduce 具有内在的并行性

Map 可以完全并行 Reduce 在 f 运算满足结合律时,可以乱序并发执行

Reduce foldl : (a [a] a)

Page 7: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Example

fun foo(l: int list) = sum(l) + mul(l) + length(l)

fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst

Page 8: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

MapReduce is…

“MapReduce is a programming model and an associated implementation for processing and generating large data sets.”[1]

J. Dean and S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150.

Page 9: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

From Parallel Computing View

MapReduce 是一种并行编程模型

the essence is a single function that executes in parallel on independent data sets, with outputs that are eventually combined to form a single or small number of results.

the essence is a single function that executes in parallel on independent data sets, with outputs that are eventually combined to form a single or small number of results.

f 是一个 map 算子 map f (x:xs) = f x : map f xsg 是一个 reduce 算子 reduce g y (x:xs) = reduce g ( g y x) xs

homomorphic skeletons

Page 10: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Mapreduce Framework

Page 11: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Typical problem solved by MapReduce

读入数据 : key/value 对的记录格式数据 Map: 从每个记录里 extract something

map (in_key, in_value) -> list(out_key, intermediate_value) 处理 input key/value pair 输出中间结果 key/value pairs

Shuffle: 混排交换数据 把相同 key 的中间结果汇集到相同节点上

Reduce: aggregate, summarize, filter, etc. reduce (out_key, list(intermediate_value)) ->

list(out_value) 归并某一个 key 的所有 values ,进行计算 输出合并的计算结果 (usually just one)

输出结果

Page 12: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Shuffle Implementation

Page 13: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Partition and Sort Group

Partition function: hash(key)%reducer numberGroup function: sort by key

Page 14: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Word Frequencies in Web pages

输入: one document per record 用户实现 map function ,输入为

key = document URL value = document contents

map 输出 (potentially many) key/value pairs. 对 document 中每一个出现的词,输出一个记录 <word, “1”>

Page 15: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Example continued:

MapReduce 运行系统 ( 库 ) 把所有相同 key 的记录收集到一起 (shuffle/sort)

用户实现 reduce function 对一个 key 对应的 values计算

求和 sum

Reduce 输出 <key, sum>

Page 16: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Inverted Index

Page 17: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Build Inverted Index

Map: <doc#, word> ➝[<word, doc-num>]Reduce: <word, [doc1, doc3, ...]> ➝ <word, “doc1, doc3, …”>

Map: <doc#, word> ➝[<word, doc-num>]Reduce: <word, [doc1, doc3, ...]> ➝ <word, “doc1, doc3, …”>

Page 18: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Build index

Input: web page data Mapper:

<url, document content> <term, docid, locid> Shuffle & Sort:

Sort by term Reducer:

<term, docid, locid>* <term, <docid,locid>*> Result:

Global index file, can be split by docid range

Page 19: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Quiz

PageRank Algorithm Clustering Algorithm Recommendation Algorithm

1. 串行算法表述1. 算法的核心公式、步骤描述和说明2. 输入数据表示、核心数据结构

2. MapReduce 下的实现:1. map, reduce 如何写2. 各自的输入和输出是什么

1. 串行算法表述1. 算法的核心公式、步骤描述和说明2. 输入数据表示、核心数据结构

2. MapReduce 下的实现:1. map, reduce 如何写2. 各自的输入和输出是什么

Page 20: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Stories of the Cloud…

Page 21: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

A Picture is Worth…

Page 22: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

The Information Factories

Googleplex servers number 450,000,

according to the lowest estimate

200 petabytes of hard disk storage

four petabytes of RAM To handle the current load

of 100 million queries a day, input-output bandwidth

must be in the neighborhood of 3 petabits per second

Page 23: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

The Supercomputer that Connects Everything and Everyone

LARRY PAGE : And, actually, the ultimate search

engine, which would understand, you know, exactly what you wanted when you typed in a query, and it would give you the exact right thing back,

in computer science we call that artificial intelligence.

That means it would be smart, and we're a long ways from having smart computers.

Page 24: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

The Prototype (1995)

Page 25: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Early Google System

Page 26: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Spring 2000 Design

Page 27: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Late 2000 Design

Page 28: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Spring 2001 Design

Page 29: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Empty Google Cluster

Page 30: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Three Days Later…

Page 31: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Age of DataCenters

High-end MainFrame .vs. commodity PC Cluster

性价比高, scale outBut 可靠性差

性价比高, scale outBut 可靠性差

Scale in可靠性高Scale in可靠性高

Page 32: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.
Page 33: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

High Capability System

SC5832 5832 Gigaflops 7776 Gigabytes ECC memory 972 6-core 64-bit nodes 2916 2 GByte/s fabric links about 1 microsecond MPI

latency 108 8-lane PCI-Express 18 KW 1 Cabinet

Page 34: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Millicomputers 2007

Page 35: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Millicomputers 2008

Page 36: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Guesses for 2010??

Page 37: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Packaging Comparisons in 1U

Page 38: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Cloud Computing

“The desktop is dead. Welcome to the Internet cloud, where massive facilities across the globe will store all the data you'll ever use.”

Page 39: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

What is Cloud Computing?

1. First write down your own opinion about “cloud computing” , whatever you thought about in your mind.

2. Question: What ? Who? Why? How? Pros and cons?

3. The most important question is: What is the relation with me?

Page 40: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Cloud Computing is…

No software access everywhere by Internet power -- Large-scale data processing Appeal for startups

Cost efficiency 实在是太方便了 Software as platform

Cons Security Data lock-in

SaaSPaaS

Utility Computing

SaaSPaaS

Utility Computing

Page 41: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Software as a Service (SaaS)

a model of software deployment whereby a provider licenses an application to customers for use as a service on demand.

Page 42: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Platform as a Service (PaaS)

对于开发Web Application 和 Services , PaaS提供了一整套基于 Internet的,从开发,测试,部署,运营到维护的全方位的集成环境。特别它从一开始就具备了Multi-tenant architecture,用户不需要考虑多用户并发的问题,而由 platform来解决,包括并发管理,扩展性,失效恢复,安全。

Page 43: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Utility Computing

“pay-as-you-go” 好比让用户把电源插头插在墙上,你得到的电压和Microsoft得到的一样,只是你用得少,pay less ; utility computing的目标就是让计算资源也具有这样的服务能力,用户可以使用 500强公司所拥有的计算资源,只是 use less pay less。这是 cloud computing的一个重要方面

Page 44: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Cloud Computing is…

Page 45: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Key Characteristics

illusion of infinite computing resources available on demand;

elimination of an up-front commitment by Cloud users; 创业启动花费

ability to pay for use of computing resources on a short-term basis as needed 。小时间片的billing ,报告指出 utility computing 在这一点上的实践是失败的

very large datacentersvery large datacenters

large-scale software infrastructurelarge-scale software infrastructure

operational expertiseoperational expertise

Page 46: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Why now?

very large-scale datacenter的实践, 因为新的技术趋势和 Business模式

pay-as-you-go computing

Page 47: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Key Players

Amazon Web Services Google App Engine Microsoft Windows

Azure

Page 48: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Key Applications

Mobile Interactive applications, Tim O’Reilly 相信未来是属于能够实时对用户提供信息的服务。 Mobile 必定是关键。而后台在 datacenter 中运行是很自然的模式,特别是那些 mashup 融合类型的服务。

Parallel batch processing 。大规模数据处理使用 Cloud Computing 技术很自然, MapReduce , Hadoop 在这里起到重要作用。这里,数据移入 / 移出 cloud 是很大的开销,Amazon 开始尝试 host large public datasets for free 。

The rise of analytics 。数据库应用中 transaction based 应用还在增长,而 analytics 的应用增长迅速。数据挖掘,用户行为分析等应用的巨大推动。

Extension of compute-intensive desktop application 。计算密集型的任务,说 matlab, mathematica 都有了cloud computing 的扩展, woo~

Page 49: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Cloud Computing = Silver Bullet?

Google 文档在 3 月 7 日发生了大批用户文件外泄事件。美国隐私保护组织就此提请政府对 Google采取措施,使其加强云计算产品的安全性。

Problem of Data Lock-in

Page 50: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Challenges

Page 51: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Some other Voices

It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true.Richard Stallman, quoted in The Guardian, September 29, 2008

It’s stupidity. It’s worse than stupidity: it’s a marketing hype campaign. Somebody is saying this is inevitable — and whenever you hear somebody saying that, it’s very likely to be a set of businesses campaigning to make it true.Richard Stallman, quoted in The Guardian, September 29, 2008

The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do. . . . I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads.Larry Ellison, quoted in the Wall Street Journal, September 26, 2008

The interesting thing about Cloud Computing is that we’ve redefined Cloud Computing to include everything that we already do. . . . I don’t understand what we would do differently in the light of Cloud Computing other than change the wording of some of our ads.Larry Ellison, quoted in the Wall Street Journal, September 26, 2008

Page 52: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

What’s matter with ME?!

What you want to do with 1000pcs, or even 100,000 pcs?

Page 53: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Cloud is coming…

Page 54: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Cloud Computing Initiative

Google and IBM team on cloud computing initiative for universities(2007-1008) provide several hundred

computers access through the Internet to

test parallel programming projects

The idea for the program from Google senior software engineer Christophe Bisciglia Google Code University

Page 55: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

M45 : Open Academic Clusters

Collaboration with Major Research Universities

Foster open research Focus on large-scale, highly parallel

computing Seed Facility: Datacenter in a Box (DiB)

500 nodes, 4000 cores, 3TB RAM, 1.5PB disk

High bandwidth connection to Internet Located on Yahoo! corporate campus

Runs Yahoo! / Apache Grid Stack Carnegie Mellon University is Initial

Partner Public Announcement 11/12/07

Page 56: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Summary

MapReduce Distributed

Programming Model It’s fun!

Infrastructure Cloud computing Imagination!

Page 57: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Readings

[1] J. D. a. S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150.

Page 58: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Resources

[Ghemawat,2004] J. D. a. S. Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters," in Osdi, 2004, pp. 137-150.

[Gruber,2006]F. C. a. J. D. a. S. G. a. W. C. H. a. D. A. W. a. M. B. a. T. C. a. A. F. a. R. Gruber, "Bigtable: A Distributed Storage System for Structured Data (Awarded Best Paper!)," in Osdi, 2006, pp. 205-218.

[Jeffrey,2006] D. Jeffrey, "Experiences with MapReduce, an abstraction for large-scale computation," in Proceedings of the 15th international conference on Parallel architectures and compilation techniques. Seattle, Washington, USA: ACM Press, 2006.

[Sanjay, et al.,2003] G. Sanjay, G. Howard, and L. Shun-Tak, "The Google file system," in Proceedings of the nineteenth ACM symposium on Operating systems principles. Bolton Landing, NY, USA: ACM Press, 2003.

http://lucene.apache.org/hadoop/, 2008

Page 59: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Thank You!

Q&A

Page 60: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Calculate PageRank

Input: WebGraph <from , <PR,<to>*>> Iteration Until Convergence

Mapper: <from, <PR,<to>*>>

<to , PR / outDegree(from)> <from, <PR,<to>*>> <from, <0,<to>*>>

Shuffle & Sort By <to>

Reducer: <to , valude>* 以及 <to, <0, <out>*>

<to, ∑(value), <out>*> Result:

<to, ∑(value)> are PR[] , the PageRank result array

Page 61: MapReduce & Cloud PengBo Dec 6, 2010. MapReduce Imperative Programming In computer science, imperative programming is a programming paradigm that describes.

Mapreduce Framework

Data store 1 Data store nmap

(key 1, values...)

(key 2, values...)

(key 3, values...)

map

(key 1, values...)

(key 2, values...)

(key 3, values...)

Input key*value pairs

Input key*value pairs

== Barrier == : Aggregates intermediate values by output key

reduce reduce reduce

key 1, intermediate

values

key 2, intermediate

values

key 3, intermediate

values

final key 1 values

final key 2 values

final key 3 values

...