3-D Graph Processor · 2019-02-15 · Graph Algorithms and Ubiquitous Commercial Applications Graph...

3-D Graph Processor

William S. Song, Jeremy Kepner, Huy T. Nguyen, Joshua I. Kramer, Vitaliy Gleyzer, James R. Mann, Albert H. Horst, Larry L. Retherford, Robert A. Bond, Nadya T.

Bliss, Eric I. Robinson, Sanjeev Mohindra, Julie Mullen

HPEC 2010

16 September 2010

This work was sponsored by DARPA under Air Force Contract FA8721-05-C-0002. Opinions, interpretations, conclusions d d ti th f th th d t il d d b th U it d St t G t

999999-1XYZ 11/3/2010

MIT Lincoln Laboratoryand recommendations are those of the authors and are not necessarily endorsed by the United States Government.

Outline

• Introduction– Graph algorithm applications– Graph algorithm applications

CommercialDoD/intelligence

– Performance gap using conventional processorsg p g p• 3-D graph processor architecture• Simulation and performance projection• SummarySummary

MIT Lincoln Laboratory999999-2

XYZ 11/3/2010

Graph Algorithms and Ubiquitous Commercial Applications

Graph Representation Applications1 2

Graph Representation Applications

• Finding shortest or fastest paths on maps• Communication, transportation, water supply, electricity, and

traffic network optimizationO ti i i th f il/ k d li b ll ti• Optimizing paths for mail/package delivery, garbage collection, snow plowing, and street cleaning

• Planning for hospital, firehouse, police station, warehouse, shop, market, office and other building placements

• Routing robotsRouting robots• Analyzing DNA and studying molecules in chemistry and physics• Corporate scheduling, transaction processing, and resource

allocation• Social network analysis

XYZ 11/3/2010

Social network analysis

DoD Graph Algorithm Applications

Intelligence Information Analysis ISR Sensor Data Analysis

• Analysis of email, phone calls, financial transactions, travel, etc.

• Post-detection data analysis for knowledge extraction and decision supporttravel, etc.

– Very large data set analysis– Established applications

decision support– Large data set– Real time operations– Small processor size,

weight, power, and cost

XYZ 11/3/2010

weight, power, and cost– New applications

Sparse Matrix Representation of Graph Algorithms

• Many graph algorithms may be represented and solved with sparse matrix algorithmsS

x A xA

• Similar speed processing– Data flow is identical

• Makes good graph processing instruction set

XYZ 11/3/2010

– Useful tool for visualizing parallel operations

Dense and Sparse Matrix Multiplication on Commercial Processors

Sparse/Dense Matrix MultiplyPerformance on One PowerPC

Sparse Matrix Multiply Performance on COTS Parallel Multiprocessors

1 E+10

1.E+09

1.E+10

SystemPowerPC 1.5GHzIntel 3.2GHzIBM P5-570*†

Custom HPC2*†fixed problem size

1.E+07

1.E+08COTS Cluster

Beff = 1 GB/s108

Number of Non-zeros Per Column/Row

1.E+061 10 100 1000

Beff = 0.1 GB/s

Processors

106Spar

Commercial microprocessors 1000 times inefficient in graph/sparse matrix operations in part due to

Communication bandwidth limitations and other inefficiencies limit the

performance improvements in COTS

XYZ 11/3/2010

p ppoorly matched processing flow.

p pparallel processors.

Outline

• Introduction• 3 D graph processor architecture• 3-D graph processor architecture

– Architecture– Enabling technologies

• Simulation and performance projection• Simulation and performance projection• Summary

XYZ 11/3/2010

3-D Graph Processor Enabling Technology Developments

Cache-less MemoryHigh Bandwidth 3-D

Communication NetworkAccelerator Based

ArchitectureProc. Cache Mem.

3 D GRAPH PROCESSOR

• Optimized for sparse matrix processing access patterns

• Dedicated VLSI computation modules

• 3D interconnect (3x)• Randomized routing (6x) 3-D GRAPH PROCESSOR modules

• Systolic sorting technology• 20x-100x throughput

Randomized routing (6x)• Parallel paths (8x)• 144x combined bandwidth while

maintaining low power• 1024 Nodes• 75000 MSOPS*• 10 MSOPS/Watt

Data/Algorithm Dependent Multi-Processor Mapping

Sparse Matrix BasedI t ti S t

Custom Low Power Circuits

Instruction Set• Full custom design for

critical circuitry (>5x power efficiency)

• Efficient load balancing and memory usage

• Reduction in programming

XYZ 11/3/2010

*Million Sparse Operations Per Second

p g gcomplexity via parallelizable array data structure

Sparse Matrix Operations

Operation Distributions CommentsC = A +.* B Works with all

supported matrix di t ib ti

Matrix multiply operation is the throughput driver for many important b h k h l ithdistributions benchmark graph algorithms. Processor architecture highly optimized for this operation.

C = A .± BC = A .* B

A, B, and C has to have identical distribution

Dot operations performed within local memory.

C = A ./ B distribution

B = op(k,A) Works with all supported matrix distributions

Operation with matrix and constant. Can also be used to redistribute matrix and sum columns or rows.

• The +, -, *, and / operations can be replaced with any arithmetic or logical operators

– e g max min AND OR XOR– e.g. max, min, AND, OR, XOR, …• Instruction set can efficiently support most graph

algorithms– Other peripheral operations performed by node controller

XYZ 11/3/2010

Node Processor Architecture

Node Processor Communication Network

Memory Bus

MatrixReader

Row/ColumnReader

MatrixWriter

MatrixALU Sorter CommunicationNode

Controller

Control Bus

Memory Bus

Connection toGlobal

C i tiConnection toMemoryNode Processor Communication

NetworkGlobal Control BusMemory

Accelerator based architecture for high throughput

XYZ 11/3/2010

High Performance Systolic k-way Merge Sorter

4-way Merge Sorting Example1 2 2 2 3 4 4 5 6 6 7 8 9 9 121

12 5 1 3 6 2 4 9 1 2 2 8 9 4 7 6

1 3 5 12 2 4 6 9 1 2 2 8 4 6 7 9

Systolic Merge SorterRS RS RS

• Sorting consists of >90% of graph processing using sparse matrix algebra

– For sorting indexes and identifying elements for accumulation

RB RB RB

RS RBIL IR• Systolic k-way merge sorter can increase the sorter throughput >5x over conventional merge sorter

– 20x-100x throughput over microprocessor based sorting

Multiplexer Network

Comparators & Logic

XYZ 11/3/2010

Communication Performance Comparison Between 2-D and 3-D Architectures

2-D 3-D

P 2 10 100 1,000 10,000 100,000 1,000,000

2 D 1 3 2 10 32 100 320 1 000

Normalized Bisection Bandwidth

2-D 1 3.2 10 32 100 320 1,000

3-D 1 4.6 22 100 460 2,200 10,000

3-D/2-D 1 1.4 2.2 3.1 4.6 6.9 10

• Significant bisection bandwidth advantage for 3-D architecture for large processor count

XYZ 11/3/2010

3-D Packaging

CouplingConnector

Stacked ProcessorBoards

3-D ProcessorChassis

3-D ParallelProcessor

Cold Plate

InsulatorTX/RX Routing Layer

Heat Removal Layer

Processor IC

Cold Plate

Coupler TX/RX

CouplingConnectors

• Electro-magnetic coupling communication in vertical dimension

Enables 3 D packaging and short routing distances– Enables 3-D packaging and short routing distances• 8x8x16 planned for initial prototype

XYZ 11/3/2010

2-D and 3-D High Speed Low Power Communication Link Technology

2-D Communication Link 3-D Communication LinkInductive Couplers

Current Mode TX/RX Signal Simulation

Signal Simulation

Coupler TX/RX

• High speed 2 D and 3 D communication achievable with low powerTime

• High speed 2-D and 3-D communication achievable with low power– >1000 Gbps per processor node compared to COTS typical 1-20 Gbps– Only 1-2 Watts per node communication power consumption– 2-D and 3-D links have same bit rate and similar power consumption

Power consumption is dominated by frequency multiplication and phase

XYZ 11/3/2010

Power consumption is dominated by frequency multiplication and phase locking for which 2-D and 3-D links use common circuitry

• Test chip under fabrication

Randomized-Destination 3-D Toroidal Grid Communication Network

X Standby FIFO 1

Communications

X Standby FIFO 2

3:1 Y Standby FIFO 1

Simulation Results 3-D Toroidal Grid Node RouterInput Queue Size vs. Network

Queue Size vs. Total Throughput

ArbitrationEngine

3:1 Y Standby FIFO 2

5:1 Z Standby FIFO 1

5:1 Z Standby FIFO 2

• Randomizing destinations of packets from source nodes d i ll i k

Randomized Destination pdramatically increases network efficiency

– Reduces contention– Algorithms developed to break

up and randomize any localities

…Randomized Destination

• Randomized destination packet sequence

• 87% full network efficiency achieved

up and randomize any localities in communication patterns

• 6x network efficiency achieved over typical COTS multiprocessor networks with same link bandwidths

Unique Destination• Unique destination for all

packets from one source• 15% full network

XYZ 11/3/2010

bandwidths15% full network efficiency achieved

Hardware Supported Optimized Row/Column Segment Based Mapping

P P P P

• Efficient distribution of matrix elements necessary

P P P P

Efficient distribution of matrix elements necessary– To balance processing load and memory usage

• Analytically optimized mapping algorithm developed– Provides very well balanced processing loads when matrices y p g

or matrix distributions are known in advance– Provides relatively robust load balancing when matrix

distributions deviate from expected– Complex mapping scheme enabled by hardware supportp pp g y pp

• Can use 3-D Graph Processor itself to optimize mapping in real time

– Fast computing of optimized mapping possible

XYZ 11/3/2010

Hardware System ArchitectureProgramming, Control, and I/O Hardware

and Software

Node Array

Node ArrayNode Processor

Node Array

Node ArrayControllerHost

Kernel Optimization

Application Optimization

User Application

Command InterfaceCo-Processor Device

Host API

Middleware Libraries

Node Command Interface

Master Execution KernelDriver Node Execution Kernel

Microcode Interface

Specialized Engines

XYZ 11/3/2010

Outline

• Introduction• 3 D graph processor architecture• 3-D graph processor architecture• Simulation and performance projection

– SimulationPerformance projection– Performance projection

Computational throughputPower efficiency

• SummarySummary

XYZ 11/3/2010

Simulation and Verification

Node Array

Node ArrayNode Processor

Node Array

Node ArrayControllerHost

• Bit level accurate simulation of 1024-node system used for functional verifications of sparse matrix processing

• Memory performance verified with commercial IP simulatorC f f• Computational module performance verified with process migration projection from existing circuitries

• 3-D and 2-D high speed communication performance verified with circuit simulation and test chip design

XYZ 11/3/2010

verified with circuit simulation and test chip design

Performance Estimates(Scaled Problem Size, 4GB/processor)

1.E+07Current Estimate

SystemSystemSystemSystem

1.E+11

Current EstimateSystemPowerPC 1.5GHzSystemPowerPC 1.5GHzSystemPowerPC 1.5GHzSystemPowerPC 1.5GHz

Performance per Watt Computational Throughput

1.E+05

1.E+06

Phase 0 Design Goal

yPowerPC 1.5GHzIntel 3.2GHzIBM P5-570*†

PowerPC 1.5GHzIntel 3.2GHzIBM P5-570*†

yPowerPC 1.5GHzIntel 3.2GHzIBM P5-570*†

PowerPC 1.5GHzIntel 3.2GHzIBM P5-570*†

1 E+08

1.E+09

1.E+10Current Estimate

COTS Cluster

Intel 3.2GHzIBM P5-570*†

Beff = 1 GB/s

1.E+03

1.E+04B

eff = 0.1 GB/s

COTS ClusterModelB

eff = 1 GB/s

103Spar

1.E+06

1.E+07

1.E+08

Beff = 0.1 GB/s

COTS ClusterModel

106Spar

1 10 100 1000Processors

Close to 1000x graph algorithm computational throughput and 100,000x power efficiency projected at 1024 processing nodes

compared to the best COTS processor custom designed for graph processing

XYZ 11/3/2010

processing.

Summary

• Graph processing critical to many commercial, DoD, and intelligence applicationsg pp

• Conventional processors perform poorly on graph algorithms

– Architecture poorly match to computational flow• MIT LL h d l d l 3 D hit t ll• MIT LL has developed novel 3-D processor architecture well

suited to graph processing– Numerous innovations enabling efficient graph computing

Sparse matrix based instruction setpCache-less accelerator based architectureHigh speed systolic sorter processorRandomized routing3-D coupler based interconnect3 D coupler based interconnectHigh-speed low-power custom circuitryEfficient mapping for computational load/memory balancing

– Orders of magnitude higher performance projected/simulated

XYZ 11/3/2010

3-D Graph Processor · 2019-02-15 · Graph Algorithms and Ubiquitous Commercial Applications Graph...

Documents

Transcript of 3-D Graph Processor · 2019-02-15 · Graph Algorithms and Ubiquitous Commercial Applications Graph...

Graph Representation Learning with Graph Convolutional Networkshelper.ipam.ucla.edu/publications/dlt2018/dlt2018_14555.pdf · 2018-02-13 · Graph Convolutional Networks § Graph

Graph representation

Graph and Knowledge Graph Representation Learning · •Google Knowledge Graph •Amazon Product Graph •Facebook Graph API •IBM Watson •Microsoft Satori •Project Hanover/Literome

Hierarchical Graph Representation Learning with Differentiable Poolingpapers.nips.cc/paper/7729-hierarchical-graph... · 2019-02-19 · Hierarchical Graph Representation Learning

Chapter 6 Graphs. 2 Outline Definitions, Terminologies and Applications Graph Representation Elementary graph operations Famous Graph Problems.

Graph Representation Learning: Where Probability Theory ...Bruno Ribeiro Assistant Professor Department of Computer Science Purdue University Graph Representation Learning: Where Probability

Graph Theory and Applications · Graph Theory and Applications Graph Theory and Applications 1 / 8 Graph Theory and Applications Paul Van Dooren Université catholique de Louvain

Graph Representation Learning via Graphical Mutual ...

Hierarchical Graph Representation Learning with ...

Understanding Negative Sampling in Graph Representation Learning · 2020. 6. 26. · Understanding Negative Sampling in Graph Representation Learning Zhen Yang∗†, Ming Ding∗†,

Graph Theory and Applications - Université catholique de ...perso.uclouvain.be/paul.vandooren/DublinCourse.pdf · Graph Theory and Applications Graph Theory and Applications 1

Understanding Negative Sampling in Graph Representation ...

basic definitions and applications graph connectivity and ...wayne/kleinberg-tardos/pdf/03Graphs… · ‣ connectivity in directed graphs ... graph is given by its adjacency representation.

Pre-training Molecular Graph Representation with 3D ...

Lecture 3 Graph Representation for Regular Expressions

Graph Based Knowledge Representation and Reasoning ...

Lecture 3 Graph Representation for Regular Expressions.

Policy-GNN: Aggregation Optimization for Graph …key techniques behind these applications is graph representation learning [13], which aims at extracting information underlying the

Graph Data Representation – the FAIRestof Them All

Graph Representation Learning