The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M....

29
The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum, S. Thrun Pervasive Parallelism Laboratory Stanford University

Transcript of The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M....

Page 1: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

The Stanford

Pervasive Parallelism

Lab

A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M.

Rosenblum, S. Thrun

Pervasive Parallelism LaboratoryStanford University

Page 2: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

The Looming Crisis Software developers will soon face systems with

> 1 TFLOP of compute power

20+ of cores, 100+ hardware threads

Heterogeneous cores (CPU+GPUs), app-specific accelerators

Deep memory hierarchies

Challenge: harness these devices productively Improve performance, power, reliability and security

The parallelism gap Yawning divide between the capabilities of today’s

programming environments, the requirements of emerging applications, and the challenges of future parallel architectures

Page 3: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

The Stanford Pervasive Parallelism Laboratory

Goal: the parallel computing platform for 2012 Make parallel programming practical for the masses Algorithms, programming models, runtimes, and architectures for scalable parallelism (10,000s of threads)

Parallel computing a core component of CS education

PPL is a combination of Leading Stanford researchers across multiple domains

Applications, languages, software systems, architecture Leading companies in computer systems and software

Sun, AMD, Nvidia, IBM, Intel, HP An exciting vision for pervasive parallelism Open laboratory; all result in the open-source

Page 4: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

The PPL Team Applications

Ron Fedkiw, Vladlen Koltun, Sebastian Thrun

Programming & software systems Alex Aiken, Pat Hanrahan, Mendel Rosenblum

Architecture Bill Dally, John Hennessy, Mark Horowitz, Christos Kozyrakis, Kunle Olukotun (Director)

Page 5: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

The PPL Team Research expertise

Applications: graphics, physics simulation, visualization, AI, robotics, …

Software systems: virtual machines, GPGPU, stream programming, transactional programming, speculative parallelization, optimizing compilers, bug detection, security,…

Architecture: multi-core & multithreading, scalable shared-memory, transactional memory hardware, interconnect networks, low-power processors, stream processors, vector processors, …

Commercial success MIPS & SGI, Rambus, VMware, Niagara processors,

Renderman, Stream Processors Inc, Avici, Tableau, …

Page 6: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Guiding Principles Top down research

App & developer needs drive system High-level info flows to low-level system

Scalability In hardware resources (10, 000s threads) Developer productivity (ease of use)

HW provides flexible primitives Software synthesizes complete solutions

Build real, full system prototypes

Page 7: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

The PPL VisionVirtual Worlds

Autonomous Vehicle

Financial Services

Physics DSL

Scripting DSL

Probabilistic DSL

Analytics DSL

Rendering DSL

Parallel Object Language

Hardware Architecture

OOO Cores SIMD Cores Threaded Cores

Scalable Interconnects

PartitionableHierarchies

Scalable Coherence

Isolation & Atomicity

Pervasive Monitoring

Common Parallel Runtime

Explicit / Static Implicit / Dynamic

Page 8: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

The PPL VisionVirtual Worlds

Autonomous Vehicle

Financial Services

Physics DSL

Scripting DSL

Probabilistic DSL

Analytics DSL

Rendering DSL

Parallel Object Language

Hardware Architecture

OOO Cores SIMD Cores Threaded Cores

Scalable Interconnects

PartitionableHierarchies

Scalable Coherence

Isolation & Atomicity

Pervasive Monitoring

Common Parallel Runtime

Explicit / Static Implicit / Dynamic

Page 9: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Demanding Applications

Leverage domain expertise at Stanford CS research groups & national centers for scientific

computing From consumer apps to neuroinformatics

PPL

Media-XMedia-X

NIH NCBCNIH NCBC

DOE ASCDOE ASC

EnvironmentalScience

EnvironmentalScience

GeophysicsGeophysics

Web, MiningStreaming DBWeb, MiningStreaming DB

GraphicsGames

GraphicsGames

MobileHCI

MobileHCI

Existing Stanford research center

Seismic modeling

Existing Stanford CS research groups

AI/MLVisionAI/MLVision

Page 10: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Virtual Worlds Application Next-gen web platform

Immersive collaboration Social gaming Millions of players in vast landscape

Challenges Client-side game engine Server-side world simulation AI, physics, large-scale rendering Dynamic content, huge datasets

More at http://vw.stanford.edu/

Page 11: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Autonomous Vehicle Application

Cars that drive autonomously in traffic

Save lives & money Improve highway throughput Improve productivity

Challenges Client-side sensing, perception,

planning, & control Server-side data merging, pre-processing,

& post-processing, traffic control, model generation

Real-time, huge datasets

More at http://www.stanfordracing.org

Page 12: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

The PPL VisionVirtual Worlds

Autonomous Vehicle

Financial Services

Physics DSL

Scripting DSL

Probabilistic DSL

Analytics DSL

Rendering DSL

Parallel Object Language

Hardware Architecture

OOO Cores SIMD Cores Threaded Cores

Scalable Interconnects

PartitionableHierarchies

Scalable Coherence

Isolation & Atomicity

Pervasive Monitoring

Common Parallel Runtime

Explicit / Static Implicit / Dynamic

Page 13: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Domain Specific Languages (DSL)

Leverage success of DSL across application domains SQL (data manipulation), Matlab (scientific), Ruby/Rails

(web),…

DSLs higher productivity for developers High-level data types & ops tailored to domain

E.g., relations, triangles, matrices, … Express high-level intent without specific implementation

artifacts Programmer isolated from details of specific system

DSLs scalable parallelism for the system Declarative description of parallelism & locality patterns

E.g., ops on relation elements, sub-array being processed, … Portable and scalable specification of parallelism

Automatically adjust data structures, mapping, and scheduling as systems scale up

Page 14: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

DSL Research & Challenges Goal: create the tools for DSL development

Initial DSL targets Rendering, physics simulation, analytics,

probabilistic computations

Challenges DSL implementation embed in base PL

Start with Scala (OO, type-safe, functional, extensible)

Use Scala as a scripting DSL that ties multiple DSLs DSL-specific optimizations telescoping compilers

Use domain knowledge to optimize & annotate code Feedback to programmers ? …

Page 15: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

The PPL VisionVirtual Worlds

Autonomous Vehicle

Financial Services

Physics DSL

Scripting DSL

Probabilistic DSL

Analytics DSL

Rendering DSL

Parallel Object Language

Hardware Architecture

OOO Cores SIMD Cores Threaded Cores

Scalable Interconnects

PartitionableHierarchies

Scalable Coherence

Isolation & Atomicity

Pervasive Monitoring

Common Parallel Runtime

Explicit / Static Implicit / Dynamic

Page 16: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Common Parallel Runtime (CPR)

Goals Provide common, portable, abstract target for all DSLs

Write once, run everywhere model Manages parallelism & locality

Achieve efficient execution (performance, power, …) Handles specifics of HW system

Approach Compile DSLs to common IR

Base language + low-level constructs & pragmas Forall, async/join, atomic, barrier, …

Per-object capabilities Read-only or write-only, output data, private, relaxed coherence, …

Combine static compilation + dynamic management Explicit management of regular tasks & predictable patterns

Implicit management of irregular parallelism

Page 17: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

CPR Research & Challenges

Integrating & balancing opposing approaches Task-level & data-level parallelism Static & dynamic concurrency management Explicit & implicit memory management

Utilize high-level information from DSLs The key to overcoming difficult challenges

Adapt to changes in application behavior, OS decisions, runtime constraints

Manage heterogeneous HW resources

Utilize novel HW primitives To reduce overhead of communication, synchronization, … To understand runtime behavior on specific HW & adapt to it

Page 18: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

The PPL VisionVirtual Worlds

Autonomous Vehicle

Financial Services

Physics DSL

Scripting DSL

Probabilistic DSL

Analytics DSL

Rendering DSL

Parallel Object Language

Hardware Architecture

OOO Cores SIMD Cores Threaded Cores

Scalable Interconnects

PartitionableHierarchies

Scalable Coherence

Isolation & Atomicity

Pervasive Monitoring

Common Parallel Runtime

Explicit / Static Implicit / Dynamic

Page 19: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Hardware Architecture @ 2012 The many-core chip

100s of cores OOO, threaded, & SIMD

Hierarchy of shared memories Scalable, on-chip network

The system Few many-core chips Per-chip DRAM channels Global address space

The data-center Cluster of systems

L2 Memory L2 Memory

L1L1

L1L1 L1

L2 Memory L2 Memory

L3 Memory

L1

L1L1

L1L1 L1 L1

L1 L1 L1 L1L1 L1

I/ODRAM

CTL

TCTC

TC OOOTC

SIMDTC

TCTC OOO

TCSIMD

OOO SIMD OOO SIMDTC TC

Page 20: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Architecture Challenges Heterogeneity

Balance of resources, granularity of parallelism

Support for parallelism & locality management

Synchronization, communication, … Explicit Vs. implicit locality management Runtime monitoring

Scalability On-chip/off-chip bandwidth & latency Scalability of key abstractions (e.g., coherence)

Beyond performance Power, fault-tolerance, QoS, security, virtualization

Page 21: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Architecture Research Revisit architecture & micro-architecture for

parallelism Define semantics & implementation of key primitives Communication, atomicity, isolation, partitioning,

coherence, consistency, checkpoint Fine-grain & bulk support

Software synthesizes primitives into execution systems

Streaming system: partitioning + bulk communication Thread-level spec: isolation + fine-grain communication Transactional memory: atomicity + isolation + consistency Security: partitioning + isolation Fault tolerance: isolation + checkpoint + bulk

communication

Challenges: interactions, scalability, cost, virtualization

Page 22: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Architecture Research Software-managed HW primitives

Exploit high-level knowledge from DSLs & CPR E.g., scale coherence using coarse-grain techniques

Coarse-grain in time: force coherence only when needed

Coarse-grain in space: object-based, selective coherence

Support for programmability & management Fine-grain monitoring, HW-assisted invariants Build upon primitives for concurrency Efficient interface to CPR

Scalable on-chip & off-chip interconnects High-radix network Adaptive routing

Page 23: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Research Methodology Conventional approaches are still useful

Develop app & SW system on existing platforms Multi-core, accelerators, clusters, …

Simulate novel HW mechanisms

Need some method that bridges HW & SW research Makes new HW features available for SW research Does not compromise HW speed, SW features, or scale Allows for full-system prototypes

Needed for research, convincing for industry, exciting for students

Approach: commodity chips + FPGAs in memory system Commodity chips: fast system with rich SW environment FPGAs: prototyping platform for new HW features Scale through cluster arrangement

Page 24: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

MemoryMemoryMemoryMemory

MemoryMemory MemoryMemory

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

FARM: Flexible Architecture Research Machine

Page 25: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

MemoryMemoryMemoryMemory

MemoryMemory MemoryMemory

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

FPGAFPGA

SRAMSRAM

IOIO

FARM: Flexible Architecture Research Machine

Page 26: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

MemoryMemoryMemoryMemory

MemoryMemory MemoryMemory

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

FPGAFPGA

SRAMSRAM

IOIO

GPU/Stream

FARM: Flexible Architecture Research Machine

Page 27: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

MemoryMemoryMemoryMemory

MemoryMemory MemoryMemory

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

Core 0Core 0 Core 1Core 1

Core 2Core 2 Core 3Core 3

FPGAFPGA

SRAMSRAM

(scalable)

InfinibandOr PCIe InterconnectI

OIO

FARM: Flexible Architecture Research Machine

Page 28: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Example FARM Uses Software research

SW development for heterogeneous systems Code generation & resource management

Scheduling system for large-scale parallelism Thread state management & adaptive control

Hardware research Scalable streaming & transactional HW

FPGA extends protocols throughout cluster Scalable shared memory

FPGA provides coarse-grain tracking Hybrid memory systems Custom processors & accelerators HW support for monitoring, scheduling, isolation,

virtualization, …

Page 29: The Stanford Pervasive Parallelism Lab A. Aiken, B. Dally, R. Fedkiw, P. Hanrahan, J. Hennessy, M. Horowitz, V. Koltun, C. Kozyrakis, K. Olukotun, M. Rosenblum,

Conclusions PPL: a full system vision for pervasive

parallelism Applications, programming models, software systems, and

hardware architecture

Key initial ideas Domain-specific languages Combine implicit & explicit management Flexible HW features