High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn...

Post on 26-Dec-2015

218 views 1 download

Tags:

Transcript of High-Quality, Deterministic Parallel Placement for FPGAs on Commodity Hardware Adrian Ludwin, Vaughn...

High-Quality, Deterministic Parallel Placement for FPGAson Commodity HardwareAdrian Ludwin, Vaughn Betz & Ketan Padalia

FPGA Seminar Presentation

Nov 10, 2009

Overview

Motivation Review simulated annealing Approaches Summary

Motivation

Simulated Annealing Placement

Probabilistic approach to finding optimal solution Behavior

Moves through solution space Greedily Randomly

Balance between greediness and randomness is controlled by a temperature

Temperature evolves through time based on a cooling schedule

Simulated Annealing Placement

For a single moveCompute change in

cost: ΔCAccept move:

ΔC < 0 ΔC > 0, with

probability e-ΔC/T

Repeat while gradually decreasing T and window size

c4c1

c5

c2

c3t3

Constraints

Runs on commodity hardware Good quality of results

Robust Determinism

Bug reportingConsistent regression results

Selected Previous Work

Close relatedMove accelerationParallel moves

Other methods Independent setsPartitioned placementsSpeculative

Algorithm #1

Algorithm #2

Objective

Determine efficacy Analyze runtime and categorize

MemorySynchronization InfrastructureEvaluationProposal

Methodology

Parallel equivalent flowSerial flow which mimic parallel flowEmulates behavior of multithreaded

application by using only one thread/core Useful for comparison

Accounts for infrastructure overhead

Methodology

Attributing runtime Two types of measurements

Bottom up (bu) measure each component of a move

End-to-end (e2e) measure runtime for entire run

Methodology

Methodology

Test setsSet of 11 Stratix® II FPGA benchmark

designs IP and customer circuits 10k to 100k logic cells

Also tested on 40 Stratix II FPGA circuits Obtained similar result

Results for Algorithm #1

Moves attribution

Overhead analysis

Observations

Theoretical speedup 1.7xMeasured: 1.3x (best)

Increase in evaluation runtimeDue to reduced cache locality

Proposal time is “hidden”

Analysis

Time spent on stall is negligible Evaluation accounts for most of overhead Little to gain by removing determinism

Serial equivalency is less than 3% runtime

Summary for Algorithm #1

Speedup: 1 – 1.3x Memory inefficiency is the biggest

bottleneck Theoretically algorithm should scale

However, difficult to partition and balance two stages

Speedups for Algorithm #2

Attribution on 2 cores

Attribution on 2 cores

Attribution on 4 core

Attribution on 4 cores

Observations

Memory latency due to inter-processor communicationWorsens with more cores

Summary for Algorithm #2

Parallel moves has better scalability than pipelined moves

Bottleneck is still memory Again serial equivalency costs little

Take Home Messages

Memory is important Good algorithms are even more important