[Roblek] Distributed computing in practice

DISTRIBUTED COMPUTING INPRAXISGFS, BIGTABLE, MAPREDUCE, CHUBBY

Dominik RoblekSoftware EngineerGoogle Inc.

JavaBlend 2008, http://www.javablend.net/ 2

GOOGLE TECHNOLOGY LAYERS

Computing Platform

Distributed Computing

Services and Applications

Commodity PC HardwareLinuxPhysical Network

Google™ searchGmail™Ads systemGoogle Maps™


IMPLICATIONS OF GOOGLE ENVIRONMENT

• Single process performance does not matter– Total throughput is more important

• Stuff breaks– If you have one server, it may stay up three years– If you have 10,000 servers, expect to lose ten a day

• “Ultra-reliable” hardware doesn’t really help– At large scales, reliable hardware still fails, albeit less often– Software still needs to be fault-tolerant


BUILDING BLOCKS OF google.com?

• Distributed data– Google File System (GFS)– BigTable

• Job manager

• Distributed computation– MapReduce

• Distributed lock service– Chubby


SCALABLE DISTRIBUTED FILE SYSTEM

Google File System(GFS)

JavaBlend 2008, http://www.javablend.net/

GFS: REQUIREMENTS

• High component failure rates– Inexpensive commodity components fail all the time

• Modest number of huge files– Just a few millions, most of them multi-GB

• Files are write-once, mostly appended to– Perhaps concurrently– Large streaming reads


GFS: DESIGN DECISION

• Files stored as chunks– Fixed size (64MB)

• Reliability through replication

• Each chunk replicated 3+ times

• Single master to coordinate access, keep metadata– Simple centralized management

• No data caching– Little benefit due to large data sets, streaming reads


GFS: ARCHITECTURE

Where is a potential weaknes of this design?


GFS: WEAK POINT - SINGLE MASTER

• From distributed systems we know this is a– Single point of failure– Scalability bottleneck

• GFS solutions– Shadow masters– Minimize master involvement

• never move data through it, use only for metadata• large chunk size• master delegates authority to primary replicas in data mutations

(chunk leases)


GFS: METADATA

• Global metadata is stored on the master– File and chunk namespaces– Mapping from files to chunks

• Locations of each chunk’s replicas– All in memory (64 bytes / chunk)

• Master has an operation log for persistent logging ofcritical metadata updates– Persistent on local disk– Replicated– Checkpoints for faster recovery


GFS: MUTATIONS

• Mutations must be donefor all replicas

• Master picks one replicaas primary; gives it a“lease” for mutations– Primary defines a serial

order of mutations

• Data flow decoupled fromcontrol flow


GFS: OPEN SOURCE ALTERNATIVES

• Hadoop Distributed File System - HDFS (Java)– http://hadoop.apache.org/core/docs/current/hdfs_design.html


DISTRIBUTED STORAGE FOR LARGE STRUCTURED DATA SETS

Bigtable


BIGTABLE: REQUIREMENTS

• Want to store petabytes of structured data acrossthousands of commodity servers

• Want a simple data format that supports dynamic controlover data layout and format

• Must support very high read/write rates– millions of operations per second

• Latency requirements:– backend bulk processing– real-time data serving


BIGTABLE: STRUCTURE

• Bigtable is multi-dimensional map:– sparse– persistent– distributed

• Key:– Row name– Column name– Timestamp

• Value:– array of bytes

(rowName: string, columnName: string, timestamp: long) → byte[]


BIGTABLE: EXAMPLE

• A web crawling system might use Bigtable that stores web pages– Each row key could represent a specific URL– Columns represent page contents, the references to that page, and

other metadata– The row range for a table is dynamically partitioned between servers

• Rows are clustered together on machines by key– Using inversed URLs as keys minimizes the number of machines where

pages from a single domain are stored– Each cell is timestamped so there could be multiple versions of the

same data in the table


“<html>…"“<html>…"

“<html>…"

t3t5t6

“CNN" t9 “CNN.com" t8“com.cnn.www”

“contents:” “anchor:cnnsi.com” “anchor:my.look.ca”

BIGTABLE: EXAMPLE


BIGTABLE: ROWS

• Name is an arbitrary string– Access to data in a row is atomic– Row creation is implicit upon storing data

• Rows ordered lexicographically– Rows close together lexicographically usually

on one or a small number of machines


BIGTABLE: TABLETS

• Row range for a table is dynamicallypartitioned into tablets

• Tablet holds contiguous range of rows– Reads over short row ranges are efficient– Clients can choose row keys to achieve

locality


BIGTABLE: COLUMNS

• Columns have two-level name structure

<column_family>:[<column_qualifier>]

• Column family:– Creation must be explicit– Has associated type information and other metadata– Unit of access control

• Column qualifier– Unbounded number of columns– Creation of column within a family is implicit at updates

• Additional dimensions


BIGTABLE: TIMESTAMPS

• Used to store different versions of data in a cell– New writes default to current time– Can also be set explicitly by clients

• Lookup options– Return all values– Return most recent K values– Return all values in timestamp range

• Column families can be marked with attributes– Only retain most recent K values in a cell– Keep values until they are older than K seconds


BIGTABLE: AT GOOGLE

• Good match for most of our applications:– Google Earth™– Google Maps™– Google Talk™– Google Finance™– Orkut™


BIGTABLE: OPEN SOURCE ALTERNATIVES

• HBase (Java)– http://hadoop.apache.org/hbase/

• Hypertable (C++)– http://www.hypertable.org/


PROGRAMMING MODEL FOR PROCESSING LARGE DATA SETS

MapReduce

MAPREDUCE: REQUIREMENTS

• Want to process lots of data ( > 1 TB)

• Want to run it on thousands of commodity PCs

• Must be robust

• … And simple to use


MAPREDUCE: DESCRIPTION

• A simple programming model that applies to many large-scalecomputing problems– Based on principles of functional languages– Scalable, robust

• Hide messy details in MapReduce runtime library:– automatic parallelization– load balancing– network and disk transfer optimization– handling of machine failures– robustness

• Improvements to core library benefit all users of library!

MAPREDUCE: FUNCTIONAL PROGRAMMING

• Functions don’t change data structures– They always create new ones– Input data remain unchanged

• Functions don’t have side effects

• Data flows are implicit in program design

• Order of operations does not matter

z := f(g(x), h(x, y), k(y))


Outline stays the same, map and reduce change to fit the problem

MAPREDUCE: TYPICAL EXECUTION FLOW

• Read a lot of data

• Map: extract something you care about from each record

• Shuffle and Sort

• Reduce: aggregate, summarize, filter, or transform

• Write the results

MAPREDUCE: PROGRAMING INTERFACE

User must implement two functions

Map(input_key, input_value) → (output_key, intermediate_value)

Reduce(output_key, intermediate_value_list) → output_value_list

MAPREDUCE: MAP

• Records from the data source …– lines out of files– rows of a database– etc.

• … are fed into the map function as (key, value pairs)– filename, line– etc.

• map produces zero, one or more intermediate valuesalong with an output key from the input

MAPREDUCE: REDUCE

• After the map phase is over, all theintermediate values for a given output keyare combined together into a list

• reduce combines those intermediatevalues into zero, one or more final valuesfor that same output key


MAPREDUCE: EXAMPLE - WORD FREQUENCY 1/5

• Input is files with one document per record

• Specify a map function that takes a key/value pair– key = document name– value = document contents

• Output of map function is zero, one or more key/valuepairs– In our case, output (word, “1”) once per word in the document



“To be or not to be?”

“document1”

“to”, “1”

“be”, “1”

“or”, “1”

…


MAPREDUCE: PRIMER - WORD FREQUENCY 3/5

• MapReduce library gathers together all pairswith the same key– shuffle/sort

• reduce function combines the values for a key– In our case, compute the sum

• Output of reduce is zero, one or more valuespaired with key and saved



key = “be”values = “1”, “1”

“2”

key = “not”values = “1”

“1”

key = “or”values = “1”

“1”

key = “to”values = “1”, “1”

“2”

“be”, “2”

“not”, “1”

“or”, “1”

“to”, “2”



Map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_values: EmitIntermediate(w, "1");

Reduce(String output_key, Iterator intermediate_values): // output_key: a word, same for input and output // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));


MAPREDUCE: DISTRIBUTED EXECUTION


MAPREDUCE: LOGICAL FLOW


MAPREDUCE: PARALLEL FLOW 1/2

• map functions run in parallel, creating differentintermediate values from different input data sets

• reduce functions also run in parallel, eachworking on a different output key– All values are processed independently

• Bottleneck– reduce phase can’t start until map phase is

completely finished


MAPREDUCE: PARALLEL FLOW 2/2


MAPREDUCE: WIDELY APPLICABLE

• distributed grep• distributed sort• document clustering• machine learning• web access log stats• inverted index construction• statistical machine translation• etc.


MAPREDUCE: EXAMPLE - LANGUAGE MODEL STATISTICS

• Used in our statistical machine translation system• Ned to count # of times every 5-word sequence occurs

in large corpus of documents (and keep all those wherecount >= 4)

• map:– extract 5-word sequences => count from document

• reduce:– summarize counts– keep those where count >= 4


MAPREDUCE: EXAMPLE - JOINING WITH OTHER DATA

• Generate per-doc summary, but include per-hostinformation (e.g. # of pages on host, important terms onhost)– per-host information might involve RPC to a set of machines

containing data for all sites

• map:– extract host name from URL, lookup per-host info, combine with

per-doc data and emit• reduce:

– identity function (just emit input value directly)

MAPREDUCE: FAULT TOLERANCE

• Master detects worker failures– Re-executes failed map tasks– Re-executes reduce tasks

• Master notices particular input key/valuescause crashes in map– Skips those values on re-execution

MAPREDUCE: LOCAL OPTIMIZATIONS

• Master program divides up tasks based onlocation of data– tries to have map tasks on same machine as

physical file data, or at least same rack

MAPREDUCE: SLOW MAP TASKS

• reduce phase cannot start before the map phasecompletes– On slow disk controller can slow down the whole system

• Master redundantly starts slow-moving map task– Uses results of first copy to finish

MAPREDUCE: COMBINE

• combine is a mini-reduce phase that runson the same machine as map phase– It aggregates the results of local map phases– Saves network bandwidth


MAPREDUCE: CONCLUSION

• MapReduce proved to be extremely usefulabstraction– It greatly simplifies the processing of huge amounts of

data

• MapReduce is easy to use– Programer can focus on problem– MapReduce takes care for messy details


MAPREDUCE: OPEN SOURCE ALTERNATIVES

• Hadoop (Java)– http://hadoop.apache.org/

• Disco (Erlang, Python)– http://discoproject.org/

• etc.


LOCK SERVICE FOR LOOSELY-COUPLED DISTRIBUTED SYSTEMS

Chubby


CHUBBY: SYNCHRONIZED ACCESS TO SHARED RESOURCES

• Key element of distributed architecture at Google:– Used by GFS, Bigtable and Mapreduce

• Interface similar to distributed file system with advisory locks– Access control list– No links

• Every Chubby file can hold a small amount of data

• Every Chubby file or directory can be used as read or write lock– Locks are advisory, not mandatory

• Clients must be well-behaved• A client that does not hold a lock can still read the content of a Chubby file


CHUBBY: DESIGN

• Design emphasis not on high performance, buton availability and reliability

• Reading and writing is atomic

• Chubby service is composed of 5 active replicas– One of them elected as master– Requires the majority of replicas to be alive


CHUBBY: EVENTS

• Client can subscribe for various events:– file contents modified– child node added, removed, or modified– lock acquired– conflicting lock request from another client– etc.


REFERENCES• Bibliography:

– Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP '03: Proceedings of thenineteenth ACM symposium on Operating systems principles, pages 29-43. ACM Press.

– Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., andGruber, R. E. (2006). Bigtable: A distributed storage system for structured data. In Operating Systems Designand Implementation, pages 205-218.

– Dean, J. and Ghemawat, S (2004). Mapreduce: Simplified data processing on large clusters. In OSDI '04:Proceedings of the 6th symposium on Operating systems design and implementation, pages 137-150.

– Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06:Proceedings of the 15th international Conference on Parallel Architectures and Compilation Techniques. ACM.

– Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In OSDI '06: Proceedingsof the 7th symposium on Operating systems design and implementation, pages 335-350.

• Partially based on:– Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 2: MapReduce

Theory and Implementation. Retrieved September 6, 2008, fromhttp://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.ppt

– Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. RetrievedSeptember 6, 2008, from http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf

– Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 3: DistributedFilesystems. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce-minilecture/lec3-dfs.ppt

– Stokely, M. (2007). Distributed Computing at Google. Retrieved September 6, 2008, fromhttp://www.swinog.ch/meetings/swinog15/SRE-Recruiting-SwiNOG2007.ppt

[Roblek] Distributed computing in practice

Technology

Transcript of [Roblek] Distributed computing in practice