Pushing 3GHz Performance of 7nm Arm Neoverse N1 CPU using … · 2019. 10. 22. · Neoverse N1 CPU...

Pushing 3GHz Performance of 7nm Arm Neoverse N1 CPU using the Cadence Digital Flow

Presenter Name and Title (Arial 16pt)Event NameLocationDate

2 © 2019 Cadence Design Systems, Inc. All rights reserved.

Agenda

• Arm® Neoverse™ N1 CPU - Introduction and Context

• Arm Neoverse N1 CPU Implementation Overview in SoC Tapeout

• Advanced Implementation/Signoff Techniques

• Neoverse N1 CPU Rapid Adoption Kit Overview

• Stylus Infrastructure

• Summary and Conclusions

• Q & A

2

© 2

01

8A

rm L

imit

ed

1st Generation System

Scalable High-End Compute from Edge, to Cloud Hyperscale, to Supercomputing


The cloud-to-edge foundation for a world of 1 trillion connected devices

Neoverse N1 CPU 2.8GHz

CosmosNeoverseN1 CPU

Arm

Cortex®-A72

Cortex-A55

Arm

Cortex-A75

2018

Data plane optimized

SMT

Neoverse

E1 CPU

Cloud to edge performance and efficiency

DDR47nm

2014-2017


Neoverse N1 CPU – Transforming Hyperscale and Edge Compute

Configuration 16nm 7nm

Performance 36.4 SPECint2k6 1T @ 2.8GHz 64C system

Freq (Vnom-

Vmax)

2.5GHz – 3.1GHz

Power (core) 319 mW/GHz 195 mW/GHz

Ptot

(2.6GHz,105C)

1450 mW 650 mW

Area (core, 1MB) 3.64 mm2 1.47 mm2

64C System: CMN -600 32xMP2 1MB L2 @ 2.8GHz, 64MB SLC @2GHz, 8xDDR4-3200

Performance: bare-metal SimPts emulation measurement in 64C system

Freq: 16nm Vnom=0.8V, Vmax=1V, 7nm Vnom=0.75V, Vmax=1V

Power: Aarch64 Dhrystone dynamic power, leakage at Tj=105C

Area: post-scaled in mm2

2

256b AMBA5 CHI / 2x128b AMBA4 ACE

SCU

Arm Neoverse N1 CPU Processor

Armv8.2-A

32b/64b CPU

NEON™ SIMD

engine 2x128b

Crypto extensions

64k I-Cache

w/parity

64k D-Cache

w/ECC

Core 1

Optional Shared L3 (1, 2, 4MB) w

DSUACP

ARM CoreSight™ Multicore Debug and Trace

Private L2 cache (512kB~1MB) w/ECC

Peripheral

Port

Async Bridges

Direct Connect to CMN ® -600 Mesh CHI

Arm’s 1st generation purpose-built server class CPU

• Delivering 3X compute density advantage over x86and +25% efficiency gain over Cortex-A72/A75

• New uArch introducing I-cache coherency, 1MB private L2, and direct connect to CMN-600 mesh

• Full Armv8.2 architectural support with SPE, 48b PA

Linux comparison, similar DRAM bandwidth per thread resourcing, turbo off, using GCC7 compiler


Neoverse N1 CPU Introduces Direct-Connect Connectivity

• Targeted at systems which benefit from runtime allocation of a shared system cache, and a large low latency L2 cache private to each core

• Removes shared L3 cluster cache rather than provisioning cache to small subsets of cores

• Reduces load-to-use DRAM latency (3.5ns for 2GHz CMN-600 system)

• Neoverse N1 CPU adds a larger 1MB private L2 cache option with 11-cycle load-to-use, over the 512kB option with 9-cycle load-to-use

• Supports CHI.C protocol which optimizes system buffer usage in the CMN-600 mesh

• MP2 cluster is deployed in 1st generation of Neoverse N1 SoC

Shared L3$, SF

MP ClusteringDirect Connect

System Cache Slice

CMN-600 CAL

…

CHI CHI

MP2 ClusterMP1 MP1

CAL = Core Aggregation Layer(CMN-600 product component)

7Copyright © 2019 Arm TechCon, All rights reserved.

Neoverse N1 CPU Implementation

Overview in SoC Tapeout


Neoverse N1 CPU Implementation Closure

Area: 1.3mm x 1.24mm, 1MB L2

Performance across corners for Neoverse N1 CPU

Voltage Process Temp Extraction Target (GHz) Achieved (GHz,

PBA)

1.0V TT 85C typical 3.0 MET

0.9V SSGNP 125CCworst_CCworst_T 2.2 MET

RCworst_CCworst_T 2.2 MET

0.75V TT 85C typical 2.3 2.27

0.675V SSGNP 125CCworst_CCworst_T 1.7 1.68

RCworst_CCworst_T 1.7 1.69

Performance for ssgnp-m40c-nom corners was timed and well achieved at the baseline target 1.5GHz target.

TSMC 7FF PDK 1.0, H360 library, ULVT-C8 with multi-vt backfill

Metal Stack: 13m_1x1xa1ya5y2yy2z, using up to M11 for signal routing

Cadence Tempus PBA timing

Dynamic Voltage Frequency Scaling

Unswitched power grid

Month Description Freq Notes

T0 Initial runs 2.5GHz Most modules were critical

T0+1 Floorplan

updates

2.6GHz Finalized floorplan based on

CPG squarish floorplan

T0+2 Fences/

Regions

2.75GH

z

Similar set of fences and

regions to CPG

T0+3 Final Build 2.8GHz Flow tweaks; changes to

optimization iterations etc..

T0+4

(mid)

Out of

Innovus

2.91GH

z

20 paths to 3GHz

T0+4

(end)

Final TOC db 3.0

GHz

Hand ECOs, mainly fixing SI

issues that provided the uplift


Neoverse N1 CPU Implementation Flow

Place and Route Setup(Cadence Innovus™)

RTL Synthesis to Gates(Cadence Genus™)

Scan Insertion/Compression

Incremental Optimization

Mapping/Multibit Flop Insertion

Libraries

DEF/Floorplan

UPF

SDC

LEF Files

QRC Tech Files

Bounds

Blockages

Design RTL

Netlist/ScanDEF

Power Grid Construction

Floorplan Initialization

MMMC Setup

PlacementTiming-Aware Placement

Scan Reordering

Early Clock Flow

PRECTS Opt with Useful Skew

+ Additional

Constraints

- Route Guides

- NDR

- CTS Constraints

Clock Tree Synthesis

CTS with Useful Skew Enabled

Clock Tree Routing w/via pillar insertion

POSTCTS Setup Opt

POSTCTS Hold Fixing

Signal RoutingTiming-aware Signal Routing

Post-Route OptimizationAVMF-enabled Post-route Setup Opt

(Incremental Post-route Optimization)

AVMF-enabled Post-route Hold Fixing

ECO Via-Pillar insertion

Libraries

LEF files

Signoff RC Extraction

( Cadence Quantus™)

Multi-Corner RC Extraction

STA(Cadence Tempus)

- MCMM

- STA Signoff Settings

- Graph-Based STA

- Path-Based STA

on Failing Paths

Signoff

Constraints

(SDC)

Path-Based Signoff Optimization/ECO

(Cadence Innovus/Tempus™)Setup, Hold, DRV, Power Recovery

SPE

F

Netlis

t

Timing/Power Optimized Netlist

SDF, Timing, and Physical Models

Ready for downstream signoff analysis

and tapeout closure steps (IR/EM, PV,

Annotated sim, functional ECO etc..)

Synthesis

Multibit Level Shifter Insertion

SPE

F

DEF

EC

O

DB

All 7nm Arm Cortex-A76 uArch and flow recipe collateral are

applicable in Neoverse N1 CPU, but there are some more

advanced techniques introduced to expedite the aggressive

frequency push for tape-out closure within 5 months!


Advanced Implementation and Signoff Techniques


7nm Key Challenge – Mitigating Metal and Via Resistance

• Metal and via resistance is significantly higher than 16FF especially

on the double pattern layers

– Affects both the signal routing and power grid design• NDRs on double pattern layers are

too expensive for signal routing.

Must be used for power grid only.– Ex: 40nm M3 blocks 5 tracks

• Critical nets must get to non-double

pattern layers as quickly as

possible– Cadence GigaOpt™ will layer promote

critical nets preroute

• Moving up layers does incur a

penalty because of the via

resistance– 7nm introduces newly constructed via

pillar


7nm Power Grid Optimization for High-Performance Core

The introduction of opportunistic post-route M1

stitch insertion for HP designs like Neoverse

N1 CPU

Innovus™ Implementation System has

support for automated stitch insertion

which was used

Merit

• Cells have more placement flexibility as M3

straps are disconnected from stitch location

so that M3 straps can be much closer

together to enable the tightest grid

• IR drop is substantially improved for 3+GHz

designs

• Disjoint M2 strap takes advantage of 7FF

EM rules


7nm Key Challenge – Signal EM Limitation for High-Speed Core

• M0, M1, and M2 have EM issues

• M0 affects the density of the power grid

• M2 EM issues restricted the usage of large

drive strength cells highly demanded for

Neoverse N1 CPU for performance uplift

• Redundant M1/2 output pins are

specifically created in these cells to limit

the max current in M0 when used together

along with …...

• Via pillar is introduced to alleviate the

noticeable signal EM in HPC in a way of

defining a stack via array using default shapes

from disjoint output pins to upper metal layers

• Defined in LEF format and tcl commands to

enable its use inside Innovus™– Can be assigned to specific lib cells, route types

(i.e. clocks), or instances

– Can be automatically inserted based on timing

criticality and routability

• Whilst pre-route does understand via pillars

and can optimize against them, the

deployment for clock route or post-route data

ECO only is recommended


Best Practice of Via Pillar Enablement for CPU Clock

• Via pillar applied on top clock net only

– Expand the usable buffer list to include much larger driving

strength accordingly

– Cell padding may be necessary

– NDR and shielding for upper layers are needed as usualPost-route comparison

Without via pillar With via pillar

• Comparable average clock

latency

• Marginally increased max

latency

• Substantially reduced local

skew and clock cell edge rate

• Improved post-route timing

metrics


Sophisticated Electromigration ControllingArm-proprietary EMG (EM model generator) provides the evolved methodology to more elegantly control signal electromigration to justify the design margin

• Neoverse N1 CPU and SoC is the first 7FF test chip with successful deployment in both implementation and signoff stages

• Perfectly enabled by both Innovus™ and Tempus™ solutions for DRV reporting and fixing

Arm-supplied methods to control eletromigration in a design

Option Description Measure Advantages Disadvantages

(1) Liberty

max_capacitance

• Keeps load within timing tables across PVT

corners

• Timing tables indexed by input transition time (the

same for all cells at a given PVT) and output load

(capacitance)

Transition <= max

index

• Ease of use • Timing constraint, not EM constraint

• Values set indirectly via max transition per

architecture, per PVT

• Not frequency or design specific

(2) AF*DF tables in

User Guide

• Keeps below wire EM limit in worst cell,

constrained by transition time limit defined as a

fraction of the clock period

• Users select a max transition based on their

design-specific voltage domain, Vt and L

Javg,peak,rms <= EM

limit

• More accurate than

#1;

• Can save area and

power

• Pessimism can be significant for low

drive-cells

• May also turn out to be optimistic for

high-drive cells at maximized frequency

if the applied fraction is inadequately

eased.

• Large lookup table referenced to determine

appropriate max transition across corners

• Keeps load below wire EM limit in each cell

indexed by frequency

• Removes design-level constraint on output edge

rate due to EM

Javg,peak,rms, <= EM

limit

• Most accurate and

design-specific;

• Maximum area and

power savings

• User must run utility for all NLM, CCS,

ECSM PVTs needed for their design


Extraction Correlation: Fill-Less - AVMF - Real Colored Fill

• AVMF (Advanced Virtualized Metal Fill) is introduced to

shrink the noticeable 7nm parasitic RC miscorrelation

derived from the color agnostic extraction with fill

• Maturely supported in Innovus™ signoff extraction-based

post-route optimization

• Successfully deployed in Neoverse N1 CPU implementation

for 7FF SoC tapeout with the effective mitigation for timing

ECO iteration


Neoverse N1 CPU Power Integrity Recap

Static IR

heatmap/

histogram

Power Analysis

Analysis Type Results Notes

Static IR Clean under 4% DIV threshold • 2.17% Effective drop at MP2 cluster level

• Using 3x3 plocs to emulate SoC-level FC

bump distribution

Vector-based

Dynamic IR

• Clean under 8% worstavg threshold

• 17% worst instantaneous drop

over Neoverse N1 CPU with

max_pwr64 vector on FFG corner

• 16.45% worst instantaneous drop

over Neoverse N1 CPU with

Dhrystone64 Vector on FFG corner

• Memory internal violations are deemed

bogus (Worked around by less layer

demotion)

• Handful worst instantaneous drop on

clock out of bound: 14% for max_pwr64

and 10.5% for dhrystone64

Power EM Clean

Signal EM Clean • No Avg/Peak violation with 3-year

continuous operation

• Handful worst RMS violations under 150%

limit

• Temperature: IR@125c, EM@105c

• Process Corner: Power Maximum FFG

• Voltage: 0.825v/1.05v/1.8v (VDD/VDD_CPU/VDDA for TPIP PVT sensor)

• RC Corner: typical

• The analysis is performed at DSU MP2 cluster level and further validated in SoC-level with the actual package RLC

• 8% on-die EIV threshold for TWF-based worstavg and 10% for worst instantaneous drop on clock cells


Dynamic Rail Analysis

Dhrystone6

4 heatmap/

histogram

Max power

64 heatmap/

histogram


More Signoff Gimmicks

• IR drop impact on timing can cause 7nm

silicon failures

• The extra blanket OCV derates accounting for

voltage and temperature variation derived from

the lookup table SOCV based on the worst

average instance-based dynamic on-die

voltage threshold at power dominant corner

with the interpolation for STA corners in lieu of

static IR

The advancement of SOCV statistical

wire derating

• Similar in context in that the voltage out of

the characterized PVT bound should not be

derated; it’s irrational to derate outside of the

range of maximum global-local shifted

extraction

– The different single-sided BEOL derate should

be applied to these extraction corners

specifically for hold analysis only

• The double-sided wire derates are still

necessary with the rest global and global-

local shift tightened corners in order to

account for local on-die parasitic differences

between launch/data and capture paths


Challenge / Key Takeaways

Technology 7nm • Developing libraries/flows on new node

• SI impact on signals and clocks even with NDR

• IR drop closure and more accurate impact analysis on

timing

• Static timing and margining revisited

Constraint verification/management

• Cross-domain crossings (CDCs)

– Lack of the effective approach to check and fix CDC

failing paths for Neoverse N1 CPU bridge with MP2 DSU

cluster. The effort for convoluted scripting is imperative.

• Modal vs modeless constraints

– The considerable work overhead to validate modeless

CPU constraint due to DFT complexity


Neoverse N1 CPU Rapid Adoption Kit Overview


What is a Cadence Rapid Adoption Kit?

• Complete Cadence RTL-to-GDS digital implementation flow

– Example flow scripts

– Example floorplan

– Application note explaining how to setup the RAK

– Application notes showing how the flow works

• Customized to deliver the power and performance goals associated with specific Arm IP

Genus™(RTL Synthesis)

Tempus™/Quantus™(Signoff STA)

Pegasus™(LVS/DRC/DFM)

Innovus™(Implementation)

Co

nfo

rma

l®(F

orm

al E

q)

Mo

du

s(B

IST

/AT

PG

)

Joules™(RTL Power)

Vo

ltus™

(Sig

n-O

ff I

R/E

M)

IP

RTLIP

Cons

Cell

Lib

Example

Floor

planExample

Flow

App

Notes RT

L-to

-GD

S F

low


Stylus: Basis for Neoverse N1 CPU RAK

• Enables script reuse

• Consistent behavior for core commands

• Key reports and metrics share similar formats

• Improve data exchange between products through common objects and commands

Streamline synthesis to signoff flow by providing a consistent set of commands,

objects, reports, metrics, and flows across tools

Desig

n

Imp

lem

en

tati

on

Innovus™Implementation System

Stratus™High Level Synthesis

Genus™RTL Synthesis

Conformal ®

LEC, ECO, LP

Modus™Test Solution

Joules™RTL Power

Quantus™Signoff Extraction

Tempus™Signoff STA

Pegasus™Verification System

Voltus™Signoff PowerS

ign

off

Desig

n

Cre

ati

on

Co

mm

on U

se

r Inte

rface


Stylus: Components

Common User Interface

Improved usability and

productivity

Unified Metrics

Holistic metrics from

synthesis to signoff

Common UI

Common Commands

Across Tools

Common Reports

and Logs

Common Initialization Sequences

Common

Database Access

Common GUI Across

Tools

Flow Kits

Flow Process

Tool Directives

Design Inputs

Quickly capture and deploy

digital flows

1 2 3


Describing Flow Content with YAML

• Base Recipe– Flow content required to create a

DRC design (base + wire)– Physical

– Floorplanning, PG insertion, place, routing, metal fill, Via fill

– STA

PPA RecipeSynthesis Optimizations

Early Clock Flow CTS techniques

Route optimizations

Skew Optimizations

Base Recipe @ Process Node

Extreme Effort

PPA Recipe

• Tool options via a flow step are inserted into the base recipe to improve the PPA


flowtool(submit flow to run)

create_flow_step –name init_floorplan {

read_def [get_flow_condig def_files]

}

create_flow_step –name place {

place_opt_design

}

create_flow_step –name cts {

ccopt_design

}

create_flow_step –name route {

route_design

}

create_flow_step –name postroute {

opt_design -postroute

}

create_flow –tool innovus {

floorplan place cts route postroute

}

flow_kit(flow language)

Stylus

The Flow Kit

• Flow generation from within the tool

• Organizes the flow (create_flow) and flow_steps (create_flow_step) into an executable task for flowtool(make replacement) manipulate the flow


How to Get a Cadence Arm-Specific RAK

cadence.com RAK web page provides interface to request RAKs

https://www.cadence.com/go/arm-rak

https://www.cadence.com/go/arm-rak


Cadence Arm IP Rapid Adoption Kit

• Complete Cadence® RTL-to-GDS digital implementation flow

• Tuned for the power and performance objectives of different Arm IP

• Enables designers to rapidly deploy an optimized Cadence flow for Arm IP

• Requested using a simple form on cadence.com

• Simplifies implementation of Arm IP products

Genus™(RTL Synthesis)

Innovus™(Implementation)

Place/CTS/

Optimization/Routing

Voltus™(Sign-off IR/EM)

Quantus™(Extraction)

Tempus™(STA)

Quantus(Extraction)

Co

nfo

rma

l®(F

orm

al E

q)

Voltus(Signoff IR/EM)

Tempus(STA)

Design

Implementation

Signoff-driven

Implementation

Signoff

Analysis

© 2019 Cadence Design Systems, Inc. All rights reserved worldwide. Cadence, the Cadence logo, and the other Cadence marks found at www.cadence.com/go/trademarks are trademarks or registered trademarks of

Cadence Design Systems, Inc. All other trademarks are the property of their respective owners.

http://www.cadence.com/go/trademarks

Pushing 3GHz Performance of 7nm Arm Neoverse N1 CPU using … · 2019. 10. 22. · Neoverse N1 CPU...

Documents

Transcript of Pushing 3GHz Performance of 7nm Arm Neoverse N1 CPU using … · 2019. 10. 22. · Neoverse N1 CPU...