Cloud Computing GFS and HDFS

Cloud ComputingGFS and HDFS

Based on “the google file system”

Keke Chen

Outline Assumptions Architecture

Components Workflow

Master Server Metadata operations

Fault tolerance Main system interactions Discussion

Motivation Store big data reliably Allow parallel processing of big data

Assumptions Inexpensive components that often fail Large files Large streaming reads and small

random reads Large sequential writes Multiple users append to the same file High bandwidth is more important than

low latency.

Architecture Chunks

File chunks location of chunks (replicas)

Master server Single master Keep metadata accept requests on metadata Most management activities

Chunk servers Multiple Keep chunks of data Accept requests on chunk data

Design decisions Single master

Simplify design Single point-of-failure Limited number of files

Meta data kept in memory

Large chunk size: e.g., 64M advantages

Reduce client-master traffic Reduce network overhead – less network interactions Chunk index is smaller

Disadvantages Not favor small files hot spots

Master: meta data Metadata is stored in memory Namespaces

Directory physical location

Files chunks chunk locations Chunk locations

Not stored by master, sent by chunk servers

Operation log

Master Operations All namespace operations

Name lookup Create/remove directories/files, etc

Manage chunk replicas Placement decision Create new chunks & replicas Balance load across all chunkservers Garbage claim

Master: namespace operations Lookup table: full pathname metadata Namespace tree Locks on nodes in the tree

/d1/d2/…/dn/leaf Read locks on the parent directories, r/w

locks on full path

Advantage Concurrent mutations in the same directory Traditional inode based structure does not

allow this

Master: chunk replica placement Goals: maximize reliability, availability and

bandwidth utilization Physical location matters

Lowest cost within the same rack “Distance”: # of network switches

In practice (hadoop) If we have 3 replicas Two chunks in the same rack The third one in another rack

Choice of chunkservers Low average disk utilization Limited # of recent writes distribute write traffic

Re-replication Lost replicas for many reasons Prioritized: low # of replicas, live files,

actively used chunks Following the same principle to place

Rebalancing Redistribute replicas periodically

Better disk utilization Load balancing

Master: garbage collection Lazy mechanism

Mark deletion at once Reclaim resources later

Regular namespace scan For deleted files: remove metadata after

three days (full deletion) For orphaned chunks, let chunkservers know

they are deleted (in heartbeat messages)

Stale replica Use chunk version numbers

System Interactions Mutation

Master assign a“lease” to a replica - primary Primary knows the order of mutations

Consistency It is expensive to maintain strict

consistency duplicates, distributed

GFS uses a relaxed consistency Better support for appending Checkpointing

Fault Tolerance High availability

Fast recovery Chunk replication Master replication: inactive backup

Data integrity Checksumming Incremental update checksum to improve

performance A chunk is split into 64K-byte blocks Update checksum after adding a block

Discussion Advantages

Works well for large data processing Using cheap commodity servers

Tradeoffs Single master design Reads most, appends most

Latest upgrades (GFS II) Distributed masters Introduce the “cell” – a number of racks in the

same data center Improved performance of random r/w

Hadoop DFS (HDFS) http://hadoop.apache.org/ Mimic GFS

Same assumptions Highly similar design Different names:

Master namenode Chunkserver datanode Chunk block Operation log EditLog

Working with HDFS /usr/local/hadoop/

bin/ : scripts for starting/stopping the system conf/ : configure files log/ : system log files

Installation Single node:

http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-single-node-cluster/

Cluster: http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/

Cloud Computing GFS and HDFS

Documents

Transcript of Cloud Computing GFS and HDFS

Gfs vs hdfs

Privacy Preservation in the Context of Big Data …ey204/pubs/talks/BIGDATA...11 Distributed Infrastructure 21 HDFS, GFS, Dynamo HBase, BigTable, Cassandra MapReduce (Hadoop, Google

Lecture #9 NFS, GFS, and HDFS

GFS: Google File System - cs.iit.eduiraicu/teaching/CS554-F13/lecture18-gfs.pdf · •GFS: Google File System –Google –C/C++ •HDFS: Hadoop Distributed File System –Yahoo –Java,

HDFS: Hadoop Distributed File Systemeecs.csuohio.edu/~sschung/cis612/LectureNotes_HadoopFinal_1.pdf · Hadoop Distributed File System (HDFS) p: HDFS • HDFS Consists of data blocks

Human Development & Family Science (HDFS)catalog.okstate.edu/courses/hdfs/hdfs.pdf · 2020. 9. 3. · 2 Human Development & Family Science (HDFS) HDFS 2233 Development of Creative

Google File System (GFS) and Hadoop Distributed File System … · 2019. 6. 18. · several storage systems such as the local file system, HDFS, Amazon S3, etc.). •HDFS is Hadoop’s

Extreme computing Infrastructure• NFS, AFS, CODA, GFS, HDFS • When dinosaurs roamed the earth1... 1Either that cliche, or an obscure Tom Waits quote: “we have to go all the way

Extreme computing InfrastructureNFS, AFS, CODA, GFS, HDFS When dinosaurs roamed the earth1... 1Either that cliche, or an obscure Tom Waits quote: “we have to go all the way back

GFS: Google File Systemcs.iit.edu/~iraicu/teaching/CS550-S11/lecture23.pdf•GFS: Google File System –Google –C/C++ •HDFS: Hadoop Distributed File System –Yahoo –Java, Open

GFS : Google File System Ömer Faruk İnce Fatih University - Computer Engineering Cloud Computing 25.03.2014.

New Big Data Architectures · 2018. 11. 26. · Introduction to Big Data NoSQL data storage Big Data ingest architectures GFS, HDFS, Hadoop Types of data processing Big Data architectures

Introducing( - Carnegie Mellon University · Soon(aStar(Was(Born…(• Yahoo!,(Facebook,(and(friends(cloned(Google’s(“Big(Data”(infrastructure(from(papers(– GFS( (Hadoop(Distributed(File(System((HDFS)

Hdfs comics

5 filesystem gfs hdfs copy - GitHub Pages · 2020. 8. 26. · • Seminal Google papers: The Google file system by Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung (2003) MapReduce:

Visualising Distributed File System Architecture using ... · discussion on importance of HDFS in cloud computing and the need or importance of visualising a DFS, such as HDFS, and

Google big data techniques (2) · • GFS/HDFS are not a good fit for: • Low latency data access (in the milliseconds range) • Many small files • Constantly changing data 2016/12/10

CSE6331: Cloud Computing · 2004: Google Map-Reduce based on GFS (Google File System) 2006: Apache Hadoop 2007: HDFS, Pig 2008: Cloudera founded 2009: MapR founded 2010: HBase, Hive,

EXAMPLE...otem ’ Signs for Rest & Ser vice Ar eas GFS D1-21 Picnic Area GFS D1-22 Tourist Information GFS D1-23 Motor Car Wash GFS D1-24 Truck Wash GFS D1-25 Drinking Water GFS D1-26

Hdfs Dhruba