SimMatrix: SIMulator for MAny -Task computing execution fabRIc at eXascale
description
Transcript of SimMatrix: SIMulator for MAny -Task computing execution fabRIc at eXascale
SimMatrix: SIMulator for MAny-Task computing
execution fabRIc at eXascale
Ke WangData-Intensive Distributed Systems Laboratory
Computer Science DepartmentIllinois Institute of Technology
April 8th, 2013ACM HPC Symposium
Outline
• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 2
Outline
• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 3
0
50
100
150
200
250
300
2004 2006 2008 2010 2012 2014 2016 2018
Num
ber o
f Cor
es
0102030405060708090100
Man
ufac
turin
g Pr
oces
s
Number of CoresProcessing
Pat Helland, Microsoft, The Irresistible Forces Meet the Movable Objects, November 9th, 2007
Manycore Computing
• Today (2013): Multicore Computing– O(10) cores commodity architectures– O(100) cores proprietary architectures– O(1000) GPU hardware threads
• Near future (~2019): Manycore Computing– ~1000 cores/threads commodity architectures
4SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale
Exascale Computing
Top500 Performance Development,
http://top500.org/static/lists/2011/11/TOP500_201111_Poster.pdf 5
• Today (2013): 10 Petascale Computing– O(100K) nodes – O(1M) cores
• Near future (~2019): Exascale Computing– ~1M nodes (10X) – ~1B processor-cores/threads (1000X)
Major Challenges of Exascale Computing
• Memory and Storage– minimizing data movement through the memory hierarchy (e.g.
persistent storage, solid state memory, volatile memory, caches, and registers)
• Concurrency and Locality– harnessing the many magnitude orders of increased parallelism
fueled by the many-core computing era (Accelerator, GPU, MIC)
• Resiliency– making both the infrastructure (hardware) and applications fault
tolerant in face of a decreasing mean-time-to-failure (MTTF).
• Energy and Power– 20MW limitation
6SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale
MTC: Many-Task Computing
Number of Tasks
Input Data Size
Hi
Med
Low1 1K 1M
HPC(Heroic
MPI Tasks)
HTC/MTC(Many Loosely Coupled Tasks)
MapReduce/MTC(Data Analysis,
Mining)
MTC(Big Data and Many Tasks)
• Bridge the gap between HPC and HTC
• Applied in clusters, grids, and supercomputers
• Loosely coupled apps with HPC orientations
• Many activities coupled by file system ops
• Many resources over short time periods
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 7
MTC Middleware
• Falkon– Fast and
Lightweight Task Execution Framework
– http://datasys.cs.iit.edu/projects/Falkon/index.html
• Swift– Parallel
Programming System
– http://www.ci.uchicago.edu/swift/index.php
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 8
Outline
• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 9
Long-Term Aims• Address major exascale computing challenges:
– Memory and Storage– Concurrency and Locality– Resiliency
• Explore scheduling architecture and techniques to enable MTC at exascale
• Analyze, design and implement a distributed data-aware execution fabric (MATRIX) supporting HPC/MTC workloads at exascale
• Integrate MATRIX with parallel programming systems (e.g. Swift, Charm++, MapReduce) and with the FusionFS distributed file system
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 10
This Work’s Contributions
– Architect, design and implement a job scheduling system simulator, SimMatrix, at the node/core level
– Performance evaluation among SimMatrix, SimGrid and GridSim; evaluation done up to millions of nodes, billions of cores, and tens of billions of tasks
– Supports of homogenous/heterogeneous systems, various programming models (HPC/MTC), and scheduling strategies (centralized/distributed/hierarchical)
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 11
Outline
• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 12
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale
OverviewJob Scheduling Systems
• Efficiently manage the distributed computing power of workstations, servers, and supercomputers in order to maximize job throughput and system utilization.– Load balancing is critical
• Different scheduling strategies– Centralized scheduling hinders the scalability– Hierarchical scheduling has long job turnaround time – Distributed scheduling is a promising approach to exascale
• Work Stealing – a distributed scheduling strategy – Starved processors steal tasks from overloaded ones
13
SimMatrix Architecture
Client
Submit tasks
Submit tasks
ClientArbitrary Node
Figure 1: SimMatrix architectures; the left part is the centralized one with a single dispatcher (head node) talking to all compute
nodes, the right part is the distributed topology with a dispatcher sitting in each compute node
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 14
Dispatcher
Simulations
• Continuous time simulation– Abandoned the idea of creating a separate thread
per simulated node: we found that on our 48-core system with 256GB of memory, we were limited to 32K threads
• Discrete event simulation– A viable approach (today) to explore scheduling
techniques at exascale (millions of nodes and billions of cores)
– Created an unique object per simulated node, and converted any behavior (state change) to an event
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 15
Outline
• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 16
At the Heart of SimMatrixGlobal Event Queue
Figure 2: Event State Transition Diagram
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 17
• All events are inserted to the queue, sorted based on the occurrence time ascending
• Handle the first event, advance the simulation time and update the event queue
• Implemented as red-black tree based “TreeSet” in Java, which ensures Θ(log ) 𝑛time for insert & remove
LogVisual
StealAvailable
cores
Has ta
sks
First node needs
more tasks
Global Event Queue
Sorted by time
Insert Event(time:t)
No waiting tasks
TaskEnd
Has Waiting Tasks and
available cores
Failed
No Tas
ks
Dis
patc
h ta
sks
TaskRec
TaskDispStart
First node needs tasks
Simulator Features
• Node load information– Load: Number of busy cores– Nested hash map groups nodes based on load, provides
extremely fast lookup for the next available nodes
• Dynamic Task Submission– Aims to reduce the task waiting time, the memory foot-print
• Dynamic Poll interval– Exponential backoff to reduce the number of messages and
increase speed of simulation
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 18
Implementation
• SimMatrix is developed in JAVA– Sun 64-bit JDK version 1.7.0_03– Code accessible at:
• http://datasys.cs.iit.edu/~kewang/software.html • SimMatrix has no other dependencies
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 19
Outline
• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 20
Experiment Environment
• Fusion system:– fusion.cs.iit.edu– 48 AMD Opteron cores at 800MHz (Only need
one core)– 256GB RAM– 64-bit Linux kernel 2.6.31.5– Sun 64-bit JDK version 1.7.0_23
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 21
Metrics• Throughput
– Number of tasks finished per second. Calculated as total-number-of-tasks/simulation-time.
• Efficiency– The ratio between the ideal simulation time of completing a given workload
and the real simulation time. The ideal simulation time is calculated by taking the average task execution time multiplied by the number of tasks per core.
• CPU Time/Time per task• Memory/Memory per task
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 22
Workloads (Sleep tasks)
• Synthetic workloads: – Uniform distribution with average task execution time of 5000s
(AVE_5K); also homogeneous workload with all tasks having 1 sec execution time (ALL_1)
• Realistic application workloads: – Obtained from real traces taken from running MTC applications
on Blue Gene/P over a 17-month period.– 34.8M tasks with the minimum runtime of 0 seconds, maximum
runtime of 1469.62 seconds, average runtime of 95.20 seconds, and standard deviation of 188.08
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 23
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale
ValidationValidate SimMatrix against the state-of-the-art MTC systems (e.g. Falkon, MATRIX)
24
1. Simulator makes simplifying assumptions, such as the network. 2. It is also difficult to model communication congestion, resource sharing and the effects on performance,
and the variability that comes with real systems. 3. We believe the relatively small differences (2.8% and 5.85%) demonstrate that SimMatrix is accurate
enough to produce convincible results (at least at modest scales).
Resource Requirement up to Exascale1M Nodes, 1B tasks and 10B tasks
Memory• Centralized:
14.1GB • Distributed:
142.1GB
CPU Time• Centralized:
17.4 hours• Distributed:
162.8 hours
Still relatively moderate
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 25
Centralized vs. Distributed Scheduling
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 26
1. AVE_5K: efficiency drops to 0.05% for centralized, but remains 90%+ for distributed at exascale2. ALL_1: centralized saturates at 8 nodes with upper bound throughput of 1000 task/sec, distributed starts to
saturate at 32K nodes, and finally achieves throughput of 75M task/sec3. Reason of saturation: the final stage, work stealing requires too many messages as the system scales up, to
the point where the number of messages is saturating either the network and/or processing capacity4. Solution: set an upper bound of the poll interval; having sufficiently long tasks to amortize the cost of so
many messages. (AVE_12 tasks can achieve 90% efficiency at exascale with throughput of 75M task/sec)
SimMatrix vs. SimGrid and GridSim
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 27
1. Comparison: Centralized scheduling2. Scale: GridSim 256 nodes, SimGrid 65K nodes, SimMatrix 1M nodes3. Time Per Task: GridSim is increasing, SimGrid keeps constant, SimMatrix decreases and then almost keeps
constant4. Memory Per Task: GridSim and SimGrid are decreasing , then keep constant, SimMatrix keeps decreasing5. Conclusion: SimMatrix is more resource efficient at large scales
Application Domains of SimMatrix• Data Centers: large-scale data centers (e.g. Google,
Amazon) are composed of thousands of (10 to 100× in near future) servers geographically distributed around the world. Load balancing among all the servers with data-intensive workloads is very important, yet non-trivial. SimMatrix is able to study different network topologies connecting all the servers and data-aware scheduling, which could be applied in scheduling of data centers.
• Grid Environment: not only could SimMatrix be configured as homogeneous scheduling system, it can also be tuned into heterogeneous one. Different Grids could configure SimMatrix and do scheduling individually without interaction with each other.
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 28
Application Domains of SimMatrix
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 29
Workflow System: although SimMatrix relies on high level workflow systems (Swift, Charm++) to manage the data-flow and task dependency now, we could develop SimMatrix to simulate workflow system with dependent tasks. We have already run SimMatrix with MTC workload achieved from Swift workflow system up to exascale, and achieved ~87% efficiency
Application Domains of SimMatrixMany-core Simulation: instead of configuring SimMatrix as an exascale system, we also configured it as a single many-core chip node up to thousands of cores with 2D/3D mesh topology. We applied work-stealing at the core level within one many-core node, and found that up to thousand cores level, 2D mesh topology needs at least 13 hops of neighbors, while 3D mesh needs at least 5, in order to achieve high system efficiency.
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 30
Outline
• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 31
Related Work
• Real Job Scheduling Systems: – Condor (University of Wisconsin), Bradley et al, 2013 – PBS (NASA Ames) , Corbatto et al, 2013 – SLURM (LLNL), Danny et al. 2013– Falkon (University of Chicago), Raicu et al, SC07
• Job Scheduling System Simulators:– SimJava (University of Edinburgh), Wheeler et al,
2004 (thread-based) – GridSim (University of Melbourne, Australia), Buyya et
al, 2010 (thread-based)– SimGrid (INRIA), Lucas et al, 2013 (Parallel DES)
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 32
Outline
• Introduction & Motivation• Long-Term Aims and Contributions• SimMatrix Architecture• Implementation• Evaluation• Related Work• Conclusion & Future Work
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 33
Conclusion & Future Work
• Conclusion: – Exascale computing will bring several challenges, which need to
be solved by new programming models. – MTC could potentially address the exascale challenges, however,
efficient job scheduling systems at extreme scales are needed. – SimMatrix is light-weight enough to enable the study of different
scheduling strategies and architectures at exascale• Future Work:
– Explore different network topologies (fat tree, 3D/4D, InfiniBand) – Work flow and task dependency simulation– Different workloads of both HPC and MTC simulation
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 34
• More information:– http://datasys.cs.iit.edu/~kewang/ – http://datasys.cs.iit.edu/projects/SimMatrix/
• Contact:– [email protected]
• Questions?
More Information
SimMatrix: SIMulator for MAny-Task computing execution fabRIc at eXascale 35