Placement Strategies for 2.5D FPGA Fabric...

Placement Strategies for 2.5D FPGA FabricArchitectures

Chirag RavishankarXilinx Inc.

3100 Logic Dr.Longmont, Colorado

Email: [email protected]

Dinesh GaitondeXilinx Inc.

2100 Logic Dr.San Jose, California


Trevor BauerXilinx Inc.

3100 Logic Dr.Longmont, Colorado


Abstract—FPGAs take advantage of 2.5D stackingtechnology to manufacture large capacity and highperformance heterogenous devices at reasonable costs.EDA tools need to be aware of and exploit physicalcharacteristics of such devices, for example the reducedconnection count between SLRs, the infrequency ofSLL channel occurence in the fabric, and the aspectratios of individual SLRs. We implement a partitiondriven placer to explore various EDA options to takeadvantage of architectural features in 2.5D FPGAs.We improve the routability of designs by optimizingthe placer for discrete SLL channels and reduced con-nection counts. We propose a cut schedule for thepartitioner to orient the placement with awareness ofthe aspect ratio of SLRs to improve track demandswithin each SLR.

I. Introduction

2.5D stacking [1] enables FPGAs to meet the twindemands of higher logic capacity and heterogeneity. 2.5Dstacking also permits lower latency communication be-tween dies than competing technologies [2]. Devices withlogic capacities that are impossible to build on a single dieare made feasible by assembling multiple, better yielding,smaller dice on a passive interposer [3]. Market demandsfor heterogeneity and specialized functionality can be metby integrating application specific dies with FPGAs on asingle package [4].

In this paper, we address the EDA challenges specificto implementing multi-die FPGA systems. Betz et al. [5]investigate the placement and routing challenges in multi-die FPGAs by enhancing the open-source academic VPRCAD tool to model and optimize for 2.5D FPGAs. Inthis paper, we consider current customer designs andsynthetic designs to outline key EDA challenges beyondthose studied in [5] and propose techniques to addressthem. We limit our study to manufacturable 2.5D FPGAsas constrained by current technology and economic factors.

A. Terminology

We refer to a single monolithic FPGA die as a SuperLogic Region or SLR. A ”2.5D” or multi-SLR device isassembled on a passive silicon interposer and connections

are made through micro-bumps, or uBumps. The inter-SLR connections, called Super Long Lines (SLLs), aremade on the silicon interposer.

In this paper, we refer to SLL capacity as a percentageof the capacity of tracks that exist within the SLR. Forexample, 25% SLL capacity means that the number ofSLLs that cross the SLR boundary is 25% of the wiresthat exist if the cut was observed in an arbitrary regionwithin the SLR.

B. 2.5D Stacking in FPGAs

Stacking technology is especially interesting for FPGAsdue to the regularity in logic cells and interconnect, allow-ing identical arrays to be connected on an interposer withfine pitch wiring. The interposer consists of metal layersthat enable wire traces that connect the individual FPGASLRs.

We illustrate the physical limitations on 2.5D FPGAsby working through an example based on the 4-SLR 28nmtest vehicle described in [6]. The number of uBumps avail-able on of each SLR limits SLL counts. Assuming a uBumppitch of 45um and an FPGA die size of 7mm × 12mm, wecompute the maximum number of uBumps to be about155 × 267 ≈ 41K uBumps per SLR. Assuming 30% ofuBumps are unuseable due to power and global signalconsiderations, and using half the uBump rows to commu-nicate with adjacent SLRs, we can use (155×0.7) ÷2 ≈ 54uBump rows. Assuming we meet the latency requirementsand provision for sufficient number of metal layers on theinterposer, we can achieve 54 × 267 ≈ 14.4K inter-SLRconnections. Compared to a virtual monolithic device withidentical logic capacity, this is about 25% of the verticalwires that would exist in the same region. Interfaces to theinterposer (refered to as ”SLL Channels”) on the FPGAfabric need to appear at discrete intervals since the uBumppitch is coarser than fabric routing channels. The primarychallenges in supporting placement and routing on multi-SLR devices arise from these two characteristics of SLLs -their relative infrequency and reduced count compared totraditional interconnect resources.

II. Evaluation PlatformA. Device Model and Designs

We generate device models for 2.5D FPGAs with similarfeature mixes and logic counts as commercial FPGAs [7]and implement them in the Vivado R© Design suite [8]modified to handle experimental architectures. We developa global partition driven placer, described in section IV,combined with a packer and simple move based optimizer.We use the Vivado R© router to implement designs.

Implementation tools trade-off the number of wires cutbetween each SLR and how balanced the utilization ofeach SLR is. If the utilization of each SLR is balanced, theprobability of routing failure within each SLR is reducedwhile the number of inter-SLR cuts is increased. We usesynthetic designs to understand this tradeoff because theyallow us to incrementally control the design size, topologyand complexity as described in [9]. Synthetic designs offerseveral benefits. They allow us to:

1) Analyze the incremental impact of utilization anddesign complexity.

2) Create designs with expected logic capacity andcomplexities that may not exist in current customerdesigns.

3) Identify the ”breaking point” (i.e. point where de-signs become impossible to implement) of architec-tural decisions.

B. Estimated Channel DemandTo analyze routing demand, we compute the estimated

channel demand (ECD) based on placement of designs.For each net, we look at the connectivity between variousblocks and implement a stochastic model [10] to computethe probability of using horizontal or vertical tracks ona two dimensional grid. This metric lets us understandthe placement quality of a design in terms of routingcongestion independent of routing architecture.

The ECD computation is enhanced to be multi-SLRaware by identifying SLL channel locations in each SLRand splitting multi-SLR nets into subnets. We recursivelypartition the net for each SLR crossing and computeseparate ECDs for each subnet.

III. EDA Issues for 2.5D FPGAsA. Feasibility of Multi-SLR Devices

In this section we attempt to understand the feasibilityand effects of implementing current customer designs onmulti-SLR devices. These designs utilize 20% to 95% ofall the various tile types available on modern FPGAs,including CLBs, RAM Blocks, and DSPs. The numberof nets in these designs range from 500K to 3.6 million.We recursively bisect each design into 4 partitions, whereeach partition must fit in one SLR of a 4-SLR targetdevice and compute the number of SLLs demanded acrossthe SLR boundaries. The cut demand ranges from 0.5%to 10%, while physical limitations allow SLL capacity of

Fig. 1: Impact of Reduced SLL Counts

upto 25% (refer section I-B). This is the primary resultwhich motivates the possibility of multi-SLR devices. Inour experience, current customer designs can always bepartitioned such that the SLL demand is less than theSLL tracks that we can physically supply.

While the SLL demand is well below the supply, westill observe degradation in various design implementationmetrics as the SLL supply is reduced. To illustrate this, wecreate 6 variants of a 4 SLR device, each having differentSLL counts, and implement the partitioned customer de-signs on each variant. In fig 1, 100% SLL Capacity refersto a device where there is no reduction of tracks betweenthe SLR boundaries and hence, the 4-SLR device can betreated as a large monolithic device. We arbitrarily trimthe interconnect resources that cross SLR boundaries andcreate devices with SLL capacities ranging from 75% to5%.

In fig 1, we show that all device variants with 12.5%SLL capacity or more can succesfully route the designs.As expected, with 5% supply, majority of designs fail dueto oversubscription of SLLs. There is a 3-4% increase ofrouted wirelength, and a 2% impact to critical path delayas the SLL supply is reduced to 12.5% compared to themonolithic variant.

B. Inter-SLR Connections and SLL ChannelsWe now illustrate the impact of inter-SLR cuts on

placement quality and routability by experimenting withsynthetic designs with controlled SLL demand. We imple-ment the benchmarks on a 2-SLR FPGA with 25% SLLcapacity. We create a device model with realistic uBumppitches, resulting in relatively infrequent SLL channels onthe FPGA fabric.

We generate the synthetic designs in the following man-ner:

1) Create 180 designs of varying utilization and routingcomplexity that are placeable in one SLR. These de-signs are neither too easy nor completely impossibleto implement.

2) Create a duplicate instance of each design.

(a) ECD with increasing SLL demand

(b) ECD heat map and SLL subscription of a single design

Fig. 2: Impact of SLL demand on placement quality

3) Connect the two design instances at the top level4) Constrain each design instance to a single SLRWe control the SLL demand between SLRs by varying

the top level IO ports on each design instance in step (3).Since we constrain the instances to SLRs in step (4), weare guaranteed to have SLL demand that is equal to thenumber of connections between the two design instances.

In fig 2a, we show the average ECD increase across thebenchmark suite at various SLL counts. The plot showsthat as the SLL demand increases, both horizontal andvertical ECD grows indicating an increase in routing con-gestion, resource usage and routability degradation withineach SLR. This is because more tracks are consumedfor routing to and from the SLL channels. In fig 2b, weillustrate the ECD heat map of a 80% resource utilizeddesign with 80% SLL demand where we see that most ofthe nets demanding SLLs are concentrated in the middle.The bar chart shows that there are several SLL channelsthat are oversubscribed by more than 3x of the availableSLLs in the channel. This is because the placer is unawareof the capacity and location of the SLL channels. Insection IV-A, we explore strategies to make the placer SLLchannel aware to improve SLL access and routability.

C. SLR Aspect RatiosTo minimize development costs, commercial FPGA ven-

dors normally design a single routing architecture foran entire family of devices. Modern FPGA families offerboth monolithic and 2.5D FPGAs on the same packagetechnology [7]. To maintain reliability of packages, theyhave to be of reasonable size and aspect ratios [2]. Hence,the aspect ratios of monolithic dies are relatively squareas shown in fig 3a. In 2.5D stacking, as we add more SLRsto a device to increase logic capacity or heterogeneity, we

(a) Monolithic die

(b) SLRs in 2.5D FPGAs

Fig. 3: Aspect Ratio in Monolithic & 2.5D FPGAs

increase size of the package in a single dimension (eg.height), while the other dimension (eg. width) remainsconstant. To enable flexibility in SLR integration, wenaturally migrate towards SLR aspect ratios that arebiased in one dimension. This can result in SLRs withnoticeably different aspect ratios than monolithic dies.

In fig 3b, we illustrate the variance in aspect ratiosbetween SLRs in a 4-SLR device compared to a monolithicdie. As the ratio of wSLR : hSLR increases, the designplaced in each SLR is forced in the horizontal orientationresulting in an increased demand for horizontal tracks.In section IV-B, we discuss strategies to optimize fordifferent SLR aspect ratios by orienting the placementwith awareness of total tracks available in each dimension.

IV. Partition-Driven Placer

We implemented a partition driven placer similar to onein [11]. We start by bi-partitioning the entire device intobins that have some capacity for placeable instances (eg.CLBs, DSPs, etc). We take the design and partition it intotwo such that the number of placeable instances in eachpartition is less than the bin’s capacity and the numberof connections between the two partitions is minimized.At the end of the first partition, we assign each instanceto one of two bins. We then continue by partitioning eachbin and the design placed in that bin recursively using thepartitioner described in [12] for this task. This technique ofrecursive partitioning is an effective way of reducing wire-length and thus reducing the expected routing resourceutilization. The flow for multi-SLR placement is as follows:

1) For an n-SLR device, partition the design into npartitions and constrain each partition to an SLR.

2) Invoke the partition driven placer on each SLR3) Perform local optimization and legalize the place-

ment generated by the partitionerStep (1) guarantees that the SLL demand between

partitions are minimized. At this stage we can predict thenumber of SLLs that will be demanded. If this exceeds thecapacity of the device, we terminate early.

A. SLL Channel AwarenessThe placer as described in the previous section is un-

aware of the location and capacity of each SLL channel.As a result, even though the total cut demanded is lessthan the total SLL capacity of the device, it is possiblethat in a small region, more SLLs are demanded than areavailable. For example, fig 2b shows the high demand forSLLs near the middle of the device. Hence, it is importantfor the placer to be aware of the location and capacityof SLL channels to enable reasonable placements of logicwith inter-SLR connections. In our flow, after the first par-titioning step, we identify connections that cross the SLRboundary and modify them to go through placeable inter-SLR instances with fixed displacements corresponding tothe length of SLL wires. The netlist is modified such thateach inter-SLR net goes through an SLL instance and theSLL channel region in the device is added to the availablecapacity where these instances can be placed. Similar tothe capacity of other placeable instances, the partitionernow considers the SLL instance demand and ensures thatno partition bin overutilizes the available SLLs. Therefore,we guarantee that no SLL channel gets oversubscribed.

It is possible to constrain the router to use the SLL in-stances assigned by the placer. In our evaluation platformhowever, we allow the router to choose the SLL instancesbased on the larger routing picture. To enable this, oncethe placer completes, we undo our netlist modifcations andrestore the inter-SLR nets that were previously split.

Figure 4a shows the ECD heat map and SLL subscrip-tion of the same design shown in fig 2b placed with SLLawareness. The ECD of inter-SLR nets shows that thedemand is more uniformly distributed along the width ofthe SLR boundary. Also, in the bar chart we see that no

(a) ECD heatmap and SLL Subscription of a single design

(b) Routing congestion for same design

Fig. 4: Impact of SLL Awareness in Placer

Fig. 5: Successfully Routed Designs and Wirelengths forSLL Aware and SLL Unaware flows

SLL channel is oversubscribed. The infrequent occurenceof SLL channels causes the placement to be spread alongthe width of each SLR. Since we guarantee no contentionfor each SLL resource, routing congestion due to inter-SLRnets is reduced. In fig 4b, we plot router congestion forboth SLL unaware and SLL aware placements as it triesto resolve oversubscription of routing resources. Firstly,it is apparent that SLL aware placement poses much lesschallenge to the router than the SLL unaware placement.In this example, the SLL aware placement eventuallyresolves all contention, while the SLL unaware placementfails to route. Secondly, we see that the bulk of congestionin both placements is caused due to routing to/from SLLchannels. We can further reduce this by decreasing theplacer’s view of SLL capacity, effectively overconstraintingit as it legalizes the partition bins.

In fig 5, we report on the full suite of benchmarks atvarious SLL demands. We show the normalized number ofsuccesfully routed designs in each suite. Not surprisingly,we see a decline in routing success as we increase theSLL demand in both SLL unaware and SLL aware flows.However, we are able to route more designs with SLLaware placements. As SLL demand increases, the needfor SLL channel awareness and legalization becomes moreimportant. Therefore, we see that as SLL demand grows,we see more benefit of doing SLL aware placements. In the96% SLL demand suite, we are able to route almost 2x thenumber of designs compared to the SLL unaware flow.

We also see an increase in wirelength as SLL demandincreases for both flows. The degradation is higher forSLL aware placements because SLL legalization causes theplacer to spread the design resulting in an overall increasein wirelength to route the average net.

B. Cut ScheduleWe explore the impact of the variation in SLR aspect

ratio in 2.5D FPGAs caused due to the reasons describedin section III-C. We create 7 device models with identi-cal resource capacities and varying aspect ratios. Theseinclude devices which are ”tall and narrow” (h � w) on

Fig. 6: Max ECD of Design Placements

one end of the spectrum and ”short and wide” (h � w)on the other. We take a 75% utilized synthetic designwith reasonable routing complexity and place it on eachdevice with the partition driven placer. In fig 6a, weshow the horizontal and vertical max ECD for each designplacement. We vary the h/w ratios from 1/8 to 8/1 andthe ECD is normalized to ECD of placement on a devicewith 1/1 aspect ratio. As the device becomes ”shorter andwider”, horizontal ECD increases indicating the increasein horizontal routing tracks required to route the design.As the device becomes ”taller and narrower”, horizontaldemand reduces, while the vertical ECD increases.

We consider aspect ratio in the partition driven placerby modifying the orientation of cuts at each partition step.We refer to the sequence of partitioning cut orientations asthe ”cut schedule” of the partitioner. At each iteration, theorientation of the cut is determined based on algorithm 1

Algorithm 1: Aspect Ratio aware Cut scheduleInput: partition area (P), tracks per channel (T)Output: Partition Orientation OR Stop Partition

1: if P < min area then2: return Stop Partition3: end if4: if (P.width × T.vertical) <

(P.height × T.horizontal) then5: return Horizontal Cut6: else7: return Vertical Cut8: end if

This algorithm chooses cut orientation such that weminimize for vertical (horizontal) track demand if thenumber of vertical (horizontal) tracks over the width(height) of the partition bin is less than the numberof horizontal (vertical) tracks over the height (width) ofthe partition bin. The partitioner as a result, minimizesdemand in the direction that is most resource scarce. Forexample, for the 1/8 AR device, the first four cuts made

are vertical, which creates partition bins that are 1/1. Theschedule then alternates between horizontal and verticalcuts. The algorithm is therefore able to naturally respondto device aspect ratio. It also handles architectures wherewe supply different number of vertical and horizontaltracks per routing channel.

Fig 6b shows the resulting ECD for the same designimplemented with the aspect ratio aware cut schedule. Asthe device becomes ”shorter and wider”, the partitionermakes more vertical cuts to manage the horizontal trackdemand. At 1/8 AR, we are able to reduce horizontal ECDby 71% while increasing vertical ECD by 7%. On the otherend of the spectrum, we reduce the vertical ECD by 20%while increasing horizontal ECD by 7%.

V. Conclusion2.5D stacking is a promising technology for FPGAs, and

the architectural decisions made to implement such devicesgive rise to several challenges and opportunities in EDA.The placer must specifically consider SLL capacity, SLLchannel locations on fabric, as well as the SLR aspect ra-tios in order to improve wirelength and routability. In thispaper, we implemented and enhanced a partition drivenplacer for multi-SLR FPGAs and showed that we canimprove results by considering the specific architecturalfeatures of such devices.

References[1] T. G. Lenihan, L. Matthew, and E. J. Vardaman, “Develop-

ments in 2.5d: The role of silicon interposers,” in 2013 IEEE15th EPTC, Dec 2013, pp. 53–55.

[2] X. Zhang, T. C. Chai et al., “Development of through silicon via(tsv) interposer technology for large die (21x21mm) fine-pitchcu/low-k fcbga package,” in Electronic Components and TechnConf, May 2009, pp. 305–312.

[3] K. Saban, “Xilinx stacked silicon interconnect technology de-livers breakthrough fpga capacity, bandwidth, and power effi-ciency,” Xilinx White Paper, 2012.

[4] M. Wissolik and D. Z. et al., “Virtex ultrascale+ hbm fpga: Arevolutionary increase in memory performance,” Xilinx WhitePapers, 2017.

[5] E. Nasiri, J. Shaikh, A. H. Pereira, and V. Betz, “Multiple diceworking as one: Cad flows and routing architectures for siliconinterposer fpgas,” IEEE Transactions on VLSI Systems, vol. 24,no. 5, pp. 1821–1834, May 2016.

[6] R. Chaware, K. Nagarajan, and S. Ramalingam, “Assembly andreliability challenges in 3d integration of 28nm fpga die on alarge high density 65nm passive interposer,” in IEEE ElectronicComponents & Tech Conf, May 2012, pp. 279–283.

[7] Ultrascale Architecture and Product Data Sheet, Xilinx.[8] Vivado Design Suite User Guide, Xilinx.[9] P. Verplaetse, J. V. Campenhout, and D. Stroobandt, “On

synthetic benchmark generation methods,” in 2000 IEEE IntlSymp on CAS, vol. 4, 2000, pp. 213–216 vol.4.

[10] J. Lou, S. Thakur et al., “Estimating routing congestion usingprobabilistic analysis,” IEEE Trans on CAD of ICs & Systems,vol. 21, no. 1, pp. 32–41, Jan 2002.

[11] J. A. Roy, D. A. Papa, and I. L. Markov, “Capo: Congestion-driven placement for standard-cell and rtl netlists with incre-mental capability,” in Modern Circuit Placement: Best Practicesand Results, G.-J. Nam and J. Cong, Eds., 2007, pp. 97–133.

[12] C. Ababei, S. Navaratnasothie, K. Bazargan, and G. Karypis,“Multi-objective circuit partitioning for cutsize and path-baseddelay minimization,” in ICCAD 2002., Nov 2002, pp. 181–185.

Placement Strategies for 2.5D FPGA Fabric...

Documents

Transcript of Placement Strategies for 2.5D FPGA Fabric...