Post on 15-Aug-2015
Memory is King
• RAM throughput increasing exponentially
• Disk throughput increasing slowly
Memory-locality key to interactive response time
Memory as cache
• Improve READ• Cannot help much with write
• Replication for fault tolerance• Network bandwidth and latency are much worse than that of memory
• Write throughput is limited by disk I/O• Required at least one copy on disk
• Inter-job data sharing cost dominates pipeline end-to-end latency• 34% jobs output as large as input (Cloudera survey)
Different jobs share data
Slow writes to disk
Spark Task
Spark mem block manager
block 1
block 3
Spark Task
Spark mem block manager
block 3
block 1
HDFS / Amazon S3block 1
block 3
block 2
block 4
storage engine & execution enginesame process(slow writes)
4
Different frameworks share data
Spark Task
Spark mem block manager
block 1
block 3
Hadoop MR
YARN
HDFS / Amazon S3block 1
block 3
block 2
block 4
storage engine & execution enginesame process(slow writes)
5
Slow writes to disk
Tachyon: realiable data sharing at memory speed within and across frameworks/jobs
Tachyon
SparkMapRe
duceSparkSQL
H2O GraphX Impala
HDFS S3Gluster
FSOrange
FSNFS Ceph ……
……
Target workload properties
• Immutable data• Deterministic jobs• Locality based scheduling• All data vs working set• Program size vs data size
System architecture
Consists of two layer
• Lineage
• Deliver high throughput I/O
• Capture sequence of jobs/tasks that create output
• Persistence
• Asynchronous checkpoints
Facts
• One data copy in memory
• Recomputation for fault-tolerance
Master Node
• Similar to HDFS and GPS• Passive standby model
• BUT also contains a workflow manager• Track lineage information• Compute checkpoint order• Interact with cluster resource manager to allocate resources for re-
computations
Lineage metadata
• Binary program
• Configuration
• Input Files List
• Output Files List
• Dependency Type
• Narrow (filter, map)
• Wide (suffle, join)
Fault-recovery by recomputations
• Challenge• Bounding the recomputation cost for a long running storage
• Asynchronous checkpointing• Allocate resources for recomputations
• Make sure recomputation tasks get enough resources• Do not impact system performance (task priorities)
• Assumption• Input files are immutable• job executions are deterministic
• Client side caching to mitigate read hotspots
Asynchronous checkpointing
• Goals• Bounded recomputation time• Checkpointing hot files• Avoid checkpointing temp files
• Edge algoritim • Modeling relationships of files with a DAG
• Vertices are files • Edge from A to B if B is generated by a job that read A
Edge algorithm
• Checkpoint leaves• Checkpointing hot files
• Most file access are less than 3 ( yahoo survey for big data workload)• Thus, access more than twice get checkpointed
• Dealing with large dataset• 96% active job sizes fit in the cluster memory• synchronously write dataset above a defined threshold to disk• Most of the files in memory checkpointed can be evicted from memory
to make room
Resource allocation
• Depend on the scheduling policy of the running cluster• Requirements
• Priority compatibility• Resource sharing • Avoid cascading recomputation
• Best ordering recomputation• Most common policies
• priority based• weighted fair sharing