11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath...

23
1 1 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer Architecture Laboratory University of Michigan at Ann Arbor
  • date post

    19-Dec-2015
  • Category

    Documents

  • view

    214
  • download

    0

Transcript of 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath...

Page 1: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

11

1

Hierarchical Coarse-grained Stream Compilation for Software Defined Radio

Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor MudgeAdvanced Computer Architecture Laboratory

University of Michigan at Ann Arbor

Page 2: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

22

2

2University of Michigan

Software Defined Radio

Use software routines instead of ASICs for the physical layer operations of wireless communication system

Advantages: Multi-mode operation

Lower costs Faster time to market

Prototyping and bug fixes

Chip volumes

Longevity of platforms

Enables future wireless communication innovations Complexity favors software-based solutions

UWB EDGE 802.16a

802.16a Bluetooth

802.11b WCDMA 802.11n

SDR

Page 3: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

33

3

3University of Michigan

Case Study: W-CDMA

Key software characteristics Multiple kernels connected together as a system

Streaming computation

Vector-based inter-kernel communications

Mostly static computation patterns

System: 2Mbps W-CDMA Protocol Diagram

Analog Frontend Upper layersTransmitter

Receiver

Scrambler Spreader Interleaver Turbo Encoder

Descrambler Despreader Combiner

DeinteleaverLPF-Rx

Descrambler Despreader

Channel Estimation

ModulationFiltering Error Correction

Searcher

TurboDecoder

LPF-Tx

Page 4: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

44

4

4University of Michigan

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

GlobalMemSystem ArchitectureARM

SODA: A SDR DSP Architecture (ISCA 06)

Control-data decoupled multi-core architecture

1 ARM general purpose control processor Scalar algorithms and protocol controls

4 data processing elements SIMD+Scalar units

Used for high-throughput DSP algorithms

Page 5: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

55

5

5University of Michigan

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

GlobalMemSystem ArchitectureARM

SODA Execution Model

Software managed scratchpad memories Each PE can only access its local memory

DMA operations Access global memory

Inter-PE communications

Algorithms statically mapped onto PEs RPCs from the ARM control processor

Page 6: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

66

6

6University of Michigan

Compilation Challenges for SDR

Compilation support for SDR is essential Flexibility

Lower development cost

More complex protocols

Compilation support for SDR is challenging Heterogeneous multiprocessor hardware

ARM + DSPs

Two level scratchpad memories

Multiple software constraints

Throughput + code & data size + real-time execution + others

Page 7: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

77

7

7University of Michigan

2-Tier Compilation Process

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

LocalMem

ExecutionUnit

PE

GlobalMemSystem ArchitectureARM

System: 2Mbps W-CDMA Protocol Diagram

Analog Frontend Upper layersTransmitter

Receiver

Scrambler Spreader Interleaver Turbo Encoder

Descrambler Despreader Combiner

DeinteleaverLPF-Rx

Descrambler Despreader

Channel Estimation

ModulationFiltering Error Correction

Searcher

TurboDecoder

LPF-Tx

512-bitSIMDReg.File

EX

512-bitSIMDALU+Mult

SIMDShuffle

Net-work(SSN)

WB

ScalarALU

WB

EX

ScalarRF

LocalSIMD

Memory

LocalScalar

Memory

STV

AGURF

EX

WB

AGUALU

1. SIMD pipeline

2. Scalar pipeline

4. AGU pipeline

VTS

Pred.Regs

WB

SIMDto

Scalar(VtoS)ALU

RF

DMA

SODAPE

5. DMA

3. Localmemory

ToSystem

Bus

Multiprocessor system compilation

DSP kernel compilation

This study is focused on system compilation

Kernel compilation is treated as a black box Existing libraries SIMD compilers

Objective Kernel-to-PE assignments Memory allocations

Subject to Throughput constraints Memory constraints

void Turbo_decoder(int* in, int* out) { ...

for (iter = 0; iter < niter; iter++) { descramble(L_a, L_e, alpha); component_decoder(L_all, g, L_a, 1);

for (i = 0; i < FRAME_SIZE; i++) { L_e[i] = L_all[i] * 7 / 10; } }

... }

Page 8: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

88

8

8University of Michigan

System Compilation Outline

SPIR – Function level IR Traditional IR is not adequate Complex inter-function interactions

Backend compilation Scheduling functions instead of

instructions Function-level modulo scheduling

SPEX Frontend

SPIR Backend

Matlab Frontend

SPIRcombiner

descrambler despreader

searcher

descrambler despreader

descrambler despreader

descrambler despreader

LPF-Rx 11

32

32

32

32

32

32

32

32

25

60

25

60

320

4

4

4

4

1

1

1

1

1

1

1

1

Rake receiver

Controlproc’s C

code

PE’sC codePE’s

C codePE’sC code

C++ w.SPEX

Matlab w.Simulink

SPEX Frontend

SPIR Backend

Matlab Frontend

SPIRcombiner

descrambler despreader

searcher

descrambler despreader

descrambler despreader

descrambler despreader

LPF-Rx 11

32

32

32

32

32

32

32

32

25

60

25

60

320

4

4

4

4

1

1

1

1

1

1

1

1

Rake receiver

Controlproc’s C

code

PE’sC codePE’s

C codePE’sC code

C++ w.SPEX

Matlab w.Simulink

SPEX Frontend

SPIR Backend

Matlab Frontend

SPIRcombiner

descrambler despreader

searcher

descrambler despreader

descrambler despreader

descrambler despreader

LPF-Rx 11

32

32

32

32

32

32

32

32

25

60

25

60

320

4

4

4

4

1

1

1

1

1

1

1

1

Rake receiver

Controlproc’s C

code

PE’sC codePE’s

C codePE’sC code

C++ w.SPEX

Matlab w.Simulink

Page 9: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

99

9

9University of Michigan

SPIR Overview

Dataflow programming model Graph consists of nodes and edges

Two types of nodes Kernel (yellow) nodes for modeling functions

Memory (blue) nodes for modeling vector buffers

Buffer stream description + vector stream description

Dataflow edges Synchronous dataflow (in the scope of this paper)

combiner

descrambler despreader

searcher

descrambler despreader

descrambler despreader

descrambler despreader

LPF-Rx 11

32

32

32

32

32

32

32

32

25

60

25

60

320

4

4

4

4

1

1

1

1

1

1

1

1

Rake receiverdelay buffer

inteleaver TurboDecoder

1 640 640 9600 3200

Page 10: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1010

10

10University of Michigan

SPIR Overview

combiner

descrambler despreader

searcher

descrambler despreader

descrambler despreader

descrambler despreader

LPF-Rx 11

32

32

32

32

32

32

32

32

320

4

4

4

4

1

1

1

1

1

1

1

1

inteleaver TurboDecoder

1 640 640 9600 3200

Problems with flat dataflow graph representations Matched to the highest rate

SDR kernels have very different stream rates

Turbo decoder: input rate = 9600; output rate = 3200

LPF: input rate = 1; output rate = 1

Page 11: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1111

11

11University of Michigan

SPIR Overview

combiner

descrambler despreader

searcher

descrambler despreader

descrambler despreader

descrambler despreader

LPF-Rx

38.4K

38.4K

38.4K

38.4K

38.4K

inteleaver TurboDecoder

9600 9600 9600 9600 3200

9600 9600

9600 9600

9600 9600

9600 9600

38.4K

38.4K

38.4K

38.4K

38.4K

38.4K

38.4K

38.4K

38.4K38.4K

Problems with flat dataflow graph representations All must match to 9600 of the Turbo decoder

Minimum LPF rate: input = 38.4K, output = 38.4K

Stream rates translate to memory buffers

Unnecessarily large memory buffers

Page 12: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1212

12

12University of Michigan

SPIR Overview

Hierarchical dataflow graphs Different hierarchy level with different streaming rates

Streaming vectors are modeled as hierarchical communications

Top level: buffer queue descriptions

Bottom level: vector streaming descriptions

TurboDecoder

300 100

9600

9600 node29600 3200node138400 9600

combiner

descrambler despreader

searcher

descrambler despreader

descrambler despreader

descrambler despreader

LPF-Rx

2.56K

2.56K

2.56K

2.56K

2.56K

inteleaver640 640 640

640 640

640 640

640 640

640 640

2.56K

2.56K

2.56K

2.56K

2.56K

2.56K

2.56K

2.56K

2.56K2.56K

Page 13: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1313

13

13University of Michigan

SPIR Overview

W-CDMA Modeled with 3-level hierarchy in SPIR

Memory nodes are inserted between nodes with child graph

4x decrease in memory buffer usage

TurboDecoder

300 100

96

00

96

00

node29600 3200node138400 9600

inter-leaver

640 640Rake2560

640

64

0

64

0combiner

descrambler despreader

searcher

descrambler despreader

descrambler despreader

descrambler despreaderLPF-Rx 256256128

128

128

128

128

128

128

128

320

128

128

128

128

32

32

32

32

32

32

32

32

32

LPF-Rx2560 2560

25

60

25

60

Page 14: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1414

14

14University of Michigan

Coarse-grained System Compilation

Three major tasks Resource allocation (processor, memory and DMA) Kernel execution ordering Kernel execution timing

Static or dynamic? Static – compiler

Less flexible, more efficient Dynamic – run-time scheduler or OS

More flexible, less efficient

For SDR applications Resource allocation: static Kernel execution ordering: static Kernel execution timing: dynamic

Page 15: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1515

15

15University of Michigan

Software Pipelining Streaming Kernels

Problem with coarse-grained compilation Requires kernel-level parallelism to utilize the PEs

SDR protocols do not have many data-independent kernels

Compiler optimization: coarse-grained software pipelining Stream computation: pipeline parallelism

Modulo scheduling

FIR

Rake

Turbo

in[0..N]

PE1 PE2 PE3FIR

Rake

Turbo

PE1 PE2 PE3FIR Rake Turbo

in[i]

in[i+1]

Turbo

in[i+2]

FIR

Rake FIR

Turbo Rake FIR

Turbo Rake

Page 16: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1616

16

16University of Michigan

Coarse-grained System Compilation

Input Hierarchical graph

Step 1 Dataflow rate matching

Step 2 Stream size selection

Step 3 Modulo scheduling

Step 4 Hierarchical compilation

DMA1

GMEM to PE1

GMEM to PE2

PE2 to PE1

PE2 to PE1

PE1 to GMEM

PE1 PE2

descrambler descrambler

despreader despreader

II1 descrambler descrambler

despreader despreader

combiner

PE1 PE2

2 descr.

2 desp.

PE3

FIR2

PE4

searchercombiner

DMA1

GMEM to PE1

PE1 to GMEM

2 descr.

2 desp.

FIR1

DMA2GMEM to PE2

GMEM to PE3

PE2 to GMEM

PE3 to GMEM

DMA3

GMEM to PE4

II2

Modulo compilation

Dataflow rate matching

Stream size selection

Hierarchical scheduling

combiner

descrambler despreader

descrambler despreader32

32

32

3225

60

25

60

4

4

1

1

1

11

64

0

combiner

descrambler despreader

descrambler despreader32

32

32

322560

2560

32

32

8

8

8

88

640

combiner

descrambler despreader

descrambler despreader128

128

128

1282560

2560

128

128

32

32

32

3232

640

Page 17: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1717

17

17University of Michigan

Coarse-grained System Compilation

Step 1: Dataflow rate matching

Producer and consumer pair must have the same ratesEdges are memory buffers

Well studied with many existing algorithmsSingle appearance schedule

Dataflow rate matching

combiner

descrambler despreader

descrambler despreader32

32

32

32

4

4

1

1

1

11

combiner

descrambler despreader

descrambler despreader32

32

32

32

32

32

8

8

8

88

Page 18: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1818

18

18University of Michigan

Coarse-grained System Compilation

Step 2: Stream size selection

Pick optimal input/output buffer sizeMultiple of the base rate

Binary search algorithmModulo schedule each candidate

buffer size

Stream size selection

combiner

descrambler despreader

descrambler despreader32

32

32

32

32

32

8

8

8

88

combiner

descrambler despreader

descrambler despreader128

128

128

128

128

128

32

32

32

3232

DMA in 1

DMA_out 1

kernel(1)

loop N

DMA in N

DMA_out N

kernel(N)

Case 1 Case 2

DMA in M

DMA_out M

kernel(M)

loop N/M

Case 3

Rate = 1, Streaming N elements Case 1: N iterations

Too much DMA overhead Case 2: 1 iteration

Cannot software pipeline Case 3: N/M iterations

Page 19: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

1919

19

19University of Michigan

Coarse-grained System Compilation

Step 3: Function-level modulo scheduling

II selection (Initiation Interval) Interval between the start of successive iterations MinII = Max(ResMII, RecMII) ResMII: total latency of all nodes divided by # of PEs RecMII: maximum latency of feedback paths

Constraint-based modulo scheduling SMT-based algorithm

DMA1

GMEM to PE1

GMEM to PE2

PE2 to PE1

PE2 to PE1

PE1 to GMEM

PE1 PE2

descrambler descrambler

despreader despreader

II1 descrambler descrambler

despreader despreader

combiner

Modulo compilation

combiner

descrambler despreader

descrambler despreader128

128

128

128

128

128

32

32

32

3232

Page 20: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

2020

20

20University of Michigan

SMT-based Modulo Scheduling Using Satisfiability Modulo Theory (SMT) solver Yices

Input: a set of constraints expressed as equations

Output: a set of conditions where the constraints evaluate to true

Constraints Throughput constraints

i.e. total execution time must be less than or equal to II

Memory constraints

i.e. buffer size less than PE’s scratchpad memories

Communication constraints

i.e. DMA added for communicating kernels on different PEs

status of kernel vi assigned to processor j (1 or 0)

number of kernels

Page 21: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

2121

21

21University of Michigan

Coarse-grained System Compilation

DMA1

GMEM to PE1

GMEM to PE2

PE2 to PE1

PE2 to PE1

PE1 to GMEM

PE1 PE2

descrambler descrambler

despreader despreader

II1 descrambler descrambler

despreader despreader

combiner

PE1 PE2

2 descr.

2 desp.

PE3

FIR2

PE4

searchercombiner

DMA1

GMEM to PE1

PE1 to GMEM

2 descr.

2 desp.

FIR1

DMA2GMEM to PE2

GMEM to PE3

PE2 to GMEM

PE3 to GMEM

DMA3

GMEM to PE4

II2

Hierarchical scheduling

combiner

descrambler despreader

descrambler despreader128

128

128

128

128

128

32

32

32

3232

Rake2560

640LPF-Rx2560 2560

2560

2560

Step 4: Hierarchical scheduling

Bottom up scheduling

Treat each child graph as a single node

Memory nodes assigned to global memory

Page 22: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

2222

22

22University of Michigan

Conclusion

Compilation support for SDR is essential

2-tiered compilation process System compilation

DSP compilation

System compilation is function-level scheduling Hierarchical dataflow IR

~4x saving in memory buffer allocation

SMT-based modulo scheduling

Linear speedup up to 8 PEs

Resulting in ~23% faster schedules than greedy

Page 23: 11 1 Hierarchical Coarse-grained Stream Compilation for Software Defined Radio Yuan Lin, Manjunath Kudlur, Scott Mahlke, Trevor Mudge Advanced Computer.

2323

23

23University of Michigan

Questions