[Roblek] Distributed computing in practice
-
Upload
javablend -
Category
Technology
-
view
104 -
download
0
description
Transcript of [Roblek] Distributed computing in practice
DISTRIBUTED COMPUTING INPRAXISGFS, BIGTABLE, MAPREDUCE, CHUBBY
Dominik RoblekSoftware EngineerGoogle Inc.
JavaBlend 2008, http://www.javablend.net/ 2
GOOGLE TECHNOLOGY LAYERS
Computing Platform
Distributed Computing
Services and Applications
Commodity PC HardwareLinuxPhysical Network
Google™ searchGmail™Ads systemGoogle Maps™
JavaBlend 2008, http://www.javablend.net/ 3
IMPLICATIONS OF GOOGLE ENVIRONMENT
• Single process performance does not matter– Total throughput is more important
• Stuff breaks– If you have one server, it may stay up three years– If you have 10,000 servers, expect to lose ten a day
• “Ultra-reliable” hardware doesn’t really help– At large scales, reliable hardware still fails, albeit less often– Software still needs to be fault-tolerant
JavaBlend 2008, http://www.javablend.net/ 4
BUILDING BLOCKS OF google.com?
• Distributed data– Google File System (GFS)– BigTable
• Job manager
• Distributed computation– MapReduce
• Distributed lock service– Chubby
JavaBlend 2008, http://www.javablend.net/ 5
SCALABLE DISTRIBUTED FILE SYSTEM
Google File System(GFS)
JavaBlend 2008, http://www.javablend.net/
GFS: REQUIREMENTS
• High component failure rates– Inexpensive commodity components fail all the time
• Modest number of huge files– Just a few millions, most of them multi-GB
• Files are write-once, mostly appended to– Perhaps concurrently– Large streaming reads
JavaBlend 2008, http://www.javablend.net/
GFS: DESIGN DECISION
• Files stored as chunks– Fixed size (64MB)
• Reliability through replication
• Each chunk replicated 3+ times
• Single master to coordinate access, keep metadata– Simple centralized management
• No data caching– Little benefit due to large data sets, streaming reads
JavaBlend 2008, http://www.javablend.net/
GFS: ARCHITECTURE
Where is a potential weaknes of this design?
JavaBlend 2008, http://www.javablend.net/
GFS: WEAK POINT - SINGLE MASTER
• From distributed systems we know this is a– Single point of failure– Scalability bottleneck
• GFS solutions– Shadow masters– Minimize master involvement
• never move data through it, use only for metadata• large chunk size• master delegates authority to primary replicas in data mutations
(chunk leases)
JavaBlend 2008, http://www.javablend.net/
GFS: METADATA
• Global metadata is stored on the master– File and chunk namespaces– Mapping from files to chunks
• Locations of each chunk’s replicas– All in memory (64 bytes / chunk)
• Master has an operation log for persistent logging ofcritical metadata updates– Persistent on local disk– Replicated– Checkpoints for faster recovery
JavaBlend 2008, http://www.javablend.net/
GFS: MUTATIONS
• Mutations must be donefor all replicas
• Master picks one replicaas primary; gives it a“lease” for mutations– Primary defines a serial
order of mutations
• Data flow decoupled fromcontrol flow
JavaBlend 2008, http://www.javablend.net/ 12
GFS: OPEN SOURCE ALTERNATIVES
• Hadoop Distributed File System - HDFS (Java)– http://hadoop.apache.org/core/docs/current/hdfs_design.html
JavaBlend 2008, http://www.javablend.net/ 13
DISTRIBUTED STORAGE FOR LARGE STRUCTURED DATA SETS
Bigtable
JavaBlend 2008, http://www.javablend.net/ 14
BIGTABLE: REQUIREMENTS
• Want to store petabytes of structured data acrossthousands of commodity servers
• Want a simple data format that supports dynamic controlover data layout and format
• Must support very high read/write rates– millions of operations per second
• Latency requirements:– backend bulk processing– real-time data serving
JavaBlend 2008, http://www.javablend.net/ 15
BIGTABLE: STRUCTURE
• Bigtable is multi-dimensional map:– sparse– persistent– distributed
• Key:– Row name– Column name– Timestamp
• Value:– array of bytes
(rowName: string, columnName: string, timestamp: long) → byte[]
JavaBlend 2008, http://www.javablend.net/ 16
BIGTABLE: EXAMPLE
• A web crawling system might use Bigtable that stores web pages– Each row key could represent a specific URL– Columns represent page contents, the references to that page, and
other metadata– The row range for a table is dynamically partitioned between servers
• Rows are clustered together on machines by key– Using inversed URLs as keys minimizes the number of machines where
pages from a single domain are stored– Each cell is timestamped so there could be multiple versions of the
same data in the table
JavaBlend 2008, http://www.javablend.net/ 17
“<html>…"“<html>…"
“<html>…"
t3t5t6
“CNN" t9 “CNN.com" t8“com.cnn.www”
“contents:” “anchor:cnnsi.com” “anchor:my.look.ca”
BIGTABLE: EXAMPLE
JavaBlend 2008, http://www.javablend.net/ 18
BIGTABLE: ROWS
• Name is an arbitrary string– Access to data in a row is atomic– Row creation is implicit upon storing data
• Rows ordered lexicographically– Rows close together lexicographically usually
on one or a small number of machines
JavaBlend 2008, http://www.javablend.net/ 19
BIGTABLE: TABLETS
• Row range for a table is dynamicallypartitioned into tablets
• Tablet holds contiguous range of rows– Reads over short row ranges are efficient– Clients can choose row keys to achieve
locality
JavaBlend 2008, http://www.javablend.net/ 20
BIGTABLE: COLUMNS
• Columns have two-level name structure
<column_family>:[<column_qualifier>]
• Column family:– Creation must be explicit– Has associated type information and other metadata– Unit of access control
• Column qualifier– Unbounded number of columns– Creation of column within a family is implicit at updates
• Additional dimensions
JavaBlend 2008, http://www.javablend.net/ 21
BIGTABLE: TIMESTAMPS
• Used to store different versions of data in a cell– New writes default to current time– Can also be set explicitly by clients
• Lookup options– Return all values– Return most recent K values– Return all values in timestamp range
• Column families can be marked with attributes– Only retain most recent K values in a cell– Keep values until they are older than K seconds
JavaBlend 2008, http://www.javablend.net/ 22
BIGTABLE: AT GOOGLE
• Good match for most of our applications:– Google Earth™– Google Maps™– Google Talk™– Google Finance™– Orkut™
JavaBlend 2008, http://www.javablend.net/ 23
BIGTABLE: OPEN SOURCE ALTERNATIVES
• HBase (Java)– http://hadoop.apache.org/hbase/
• Hypertable (C++)– http://www.hypertable.org/
JavaBlend 2008, http://www.javablend.net/ 24
PROGRAMMING MODEL FOR PROCESSING LARGE DATA SETS
MapReduce
MAPREDUCE: REQUIREMENTS
• Want to process lots of data ( > 1 TB)
• Want to run it on thousands of commodity PCs
• Must be robust
• … And simple to use
JavaBlend 2008, http://www.javablend.net/ 26
MAPREDUCE: DESCRIPTION
• A simple programming model that applies to many large-scalecomputing problems– Based on principles of functional languages– Scalable, robust
• Hide messy details in MapReduce runtime library:– automatic parallelization– load balancing– network and disk transfer optimization– handling of machine failures– robustness
• Improvements to core library benefit all users of library!
MAPREDUCE: FUNCTIONAL PROGRAMMING
• Functions don’t change data structures– They always create new ones– Input data remain unchanged
• Functions don’t have side effects
• Data flows are implicit in program design
• Order of operations does not matter
z := f(g(x), h(x, y), k(y))
JavaBlend 2008, http://www.javablend.net/ 28
Outline stays the same, map and reduce change to fit the problem
MAPREDUCE: TYPICAL EXECUTION FLOW
• Read a lot of data
• Map: extract something you care about from each record
• Shuffle and Sort
• Reduce: aggregate, summarize, filter, or transform
• Write the results
MAPREDUCE: PROGRAMING INTERFACE
User must implement two functions
Map(input_key, input_value) → (output_key, intermediate_value)
Reduce(output_key, intermediate_value_list) → output_value_list
MAPREDUCE: MAP
• Records from the data source …– lines out of files– rows of a database– etc.
• … are fed into the map function as (key, value pairs)– filename, line– etc.
• map produces zero, one or more intermediate valuesalong with an output key from the input
MAPREDUCE: REDUCE
• After the map phase is over, all theintermediate values for a given output keyare combined together into a list
• reduce combines those intermediatevalues into zero, one or more final valuesfor that same output key
JavaBlend 2008, http://www.javablend.net/ 32
MAPREDUCE: EXAMPLE - WORD FREQUENCY 1/5
• Input is files with one document per record
• Specify a map function that takes a key/value pair– key = document name– value = document contents
• Output of map function is zero, one or more key/valuepairs– In our case, output (word, “1”) once per word in the document
JavaBlend 2008, http://www.javablend.net/ 33
MAPREDUCE: EXAMPLE - WORD FREQUENCY 2/5
“To be or not to be?”
“document1”
“to”, “1”
“be”, “1”
“or”, “1”
…
JavaBlend 2008, http://www.javablend.net/ 34
MAPREDUCE: PRIMER - WORD FREQUENCY 3/5
• MapReduce library gathers together all pairswith the same key– shuffle/sort
• reduce function combines the values for a key– In our case, compute the sum
• Output of reduce is zero, one or more valuespaired with key and saved
JavaBlend 2008, http://www.javablend.net/ 35
MAPREDUCE: EXAMPLE - WORD FREQUENCY 4/5
key = “be”values = “1”, “1”
“2”
key = “not”values = “1”
“1”
key = “or”values = “1”
“1”
key = “to”values = “1”, “1”
“2”
“be”, “2”
“not”, “1”
“or”, “1”
“to”, “2”
JavaBlend 2008, http://www.javablend.net/ 36
MAPREDUCE: EXAMPLE - WORD FREQUENCY 5/5
Map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_values: EmitIntermediate(w, "1");
Reduce(String output_key, Iterator intermediate_values): // output_key: a word, same for input and output // intermediate_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));
JavaBlend 2008, http://www.javablend.net/ 37
MAPREDUCE: DISTRIBUTED EXECUTION
JavaBlend 2008, http://www.javablend.net/ 38
MAPREDUCE: LOGICAL FLOW
JavaBlend 2008, http://www.javablend.net/ 39
MAPREDUCE: PARALLEL FLOW 1/2
• map functions run in parallel, creating differentintermediate values from different input data sets
• reduce functions also run in parallel, eachworking on a different output key– All values are processed independently
• Bottleneck– reduce phase can’t start until map phase is
completely finished
JavaBlend 2008, http://www.javablend.net/ 40
MAPREDUCE: PARALLEL FLOW 2/2
JavaBlend 2008, http://www.javablend.net/ 41
MAPREDUCE: WIDELY APPLICABLE
• distributed grep• distributed sort• document clustering• machine learning• web access log stats• inverted index construction• statistical machine translation• etc.
JavaBlend 2008, http://www.javablend.net/ 42
MAPREDUCE: EXAMPLE - LANGUAGE MODEL STATISTICS
• Used in our statistical machine translation system• Ned to count # of times every 5-word sequence occurs
in large corpus of documents (and keep all those wherecount >= 4)
• map:– extract 5-word sequences => count from document
• reduce:– summarize counts– keep those where count >= 4
JavaBlend 2008, http://www.javablend.net/ 43
MAPREDUCE: EXAMPLE - JOINING WITH OTHER DATA
• Generate per-doc summary, but include per-hostinformation (e.g. # of pages on host, important terms onhost)– per-host information might involve RPC to a set of machines
containing data for all sites
• map:– extract host name from URL, lookup per-host info, combine with
per-doc data and emit• reduce:
– identity function (just emit input value directly)
MAPREDUCE: FAULT TOLERANCE
• Master detects worker failures– Re-executes failed map tasks– Re-executes reduce tasks
• Master notices particular input key/valuescause crashes in map– Skips those values on re-execution
MAPREDUCE: LOCAL OPTIMIZATIONS
• Master program divides up tasks based onlocation of data– tries to have map tasks on same machine as
physical file data, or at least same rack
MAPREDUCE: SLOW MAP TASKS
• reduce phase cannot start before the map phasecompletes– On slow disk controller can slow down the whole system
• Master redundantly starts slow-moving map task– Uses results of first copy to finish
MAPREDUCE: COMBINE
• combine is a mini-reduce phase that runson the same machine as map phase– It aggregates the results of local map phases– Saves network bandwidth
JavaBlend 2008, http://www.javablend.net/ 48
MAPREDUCE: CONCLUSION
• MapReduce proved to be extremely usefulabstraction– It greatly simplifies the processing of huge amounts of
data
• MapReduce is easy to use– Programer can focus on problem– MapReduce takes care for messy details
JavaBlend 2008, http://www.javablend.net/ 49
MAPREDUCE: OPEN SOURCE ALTERNATIVES
• Hadoop (Java)– http://hadoop.apache.org/
• Disco (Erlang, Python)– http://discoproject.org/
• etc.
JavaBlend 2008, http://www.javablend.net/ 50
LOCK SERVICE FOR LOOSELY-COUPLED DISTRIBUTED SYSTEMS
Chubby
JavaBlend 2008, http://www.javablend.net/ 51
CHUBBY: SYNCHRONIZED ACCESS TO SHARED RESOURCES
• Key element of distributed architecture at Google:– Used by GFS, Bigtable and Mapreduce
• Interface similar to distributed file system with advisory locks– Access control list– No links
• Every Chubby file can hold a small amount of data
• Every Chubby file or directory can be used as read or write lock– Locks are advisory, not mandatory
• Clients must be well-behaved• A client that does not hold a lock can still read the content of a Chubby file
JavaBlend 2008, http://www.javablend.net/ 52
CHUBBY: DESIGN
• Design emphasis not on high performance, buton availability and reliability
• Reading and writing is atomic
• Chubby service is composed of 5 active replicas– One of them elected as master– Requires the majority of replicas to be alive
JavaBlend 2008, http://www.javablend.net/ 53
CHUBBY: EVENTS
• Client can subscribe for various events:– file contents modified– child node added, removed, or modified– lock acquired– conflicting lock request from another client– etc.
JavaBlend 2008, http://www.javablend.net/ 54
REFERENCES• Bibliography:
– Ghemawat, S., Gobioff, H., and Leung, S.-T. (2003). The google file system. In SOSP '03: Proceedings of thenineteenth ACM symposium on Operating systems principles, pages 29-43. ACM Press.
– Chang, F., Dean, J., Ghemawat, S., Hsieh, W. C., Wallach, D. A., Burrows, M., Chandra, T., Fikes, A., andGruber, R. E. (2006). Bigtable: A distributed storage system for structured data. In Operating Systems Designand Implementation, pages 205-218.
– Dean, J. and Ghemawat, S (2004). Mapreduce: Simplified data processing on large clusters. In OSDI '04:Proceedings of the 6th symposium on Operating systems design and implementation, pages 137-150.
– Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. In PACT '06:Proceedings of the 15th international Conference on Parallel Architectures and Compilation Techniques. ACM.
– Burrows, M. (2006). The chubby lock service for loosely-coupled distributed systems. In OSDI '06: Proceedingsof the 7th symposium on Operating systems design and implementation, pages 335-350.
• Partially based on:– Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 2: MapReduce
Theory and Implementation. Retrieved September 6, 2008, fromhttp://code.google.com/edu/submissions/mapreduce-minilecture/lec2-mapred.ppt
– Dean, J. (2006). Experiences with MapReduce, an abstraction for large-scale computation. RetrievedSeptember 6, 2008, from http://www.cs.virginia.edu/~pact2006/program/mapreduce-pact06-keynote.pdf
– Bisciglia, C., Kimball, A., & Michels-Slettvet, S. (2007). Distributed Computing Seminar, Lecture 3: DistributedFilesystems. Retrieved September 6, 2008, from http://code.google.com/edu/submissions/mapreduce-minilecture/lec3-dfs.ppt
– Stokely, M. (2007). Distributed Computing at Google. Retrieved September 6, 2008, fromhttp://www.swinog.ch/meetings/swinog15/SRE-Recruiting-SwiNOG2007.ppt