Five Families of ARM Processor IPtylee/Courses/91_2/SOC...THE ARCHITECTURE FOR THE DIGITAL WORLD TM...

9
Eric Schorn CPU Product Manager ARM Austin Design Center ARM1026EJ-S™ Synthesizable ARM10E™ Family Processor Core THE ARCHITECTURE FOR THE DIGITAL WORLD TM 2 THE ARCHITECTURE FOR THE DIGITAL WORLD TM Five Families of ARM Processor IP Performance Time ARM7™ - Low Power - Area Efficient - Code Density - 3 Stage Pipe - 4 product offerings ARM9™ - Dual Caches - Hard-IP Design - Performance - 5 Stage Pipe - 3 product offerings ARM9E™ - Config $/TCM - Soft-IP Design - DSP Instr. - Java™ in HW - 3 product offerings Future - See “New ARM Microarchitecture Implementing the ARMv6 Architecture” EPF presentation SecurCore™ - Secure Apps - Performance - Power, Area - Code Density - 4 product offerings ARM preserves SW & HW investment through code and process portability ARM10E™ - Multiple $/TCM - Float in HW - Dual 64bit I/O - 6 Stage Pipe - 2 product offerings - New: ARM1026EJ-S This presentation

Transcript of Five Families of ARM Processor IPtylee/Courses/91_2/SOC...THE ARCHITECTURE FOR THE DIGITAL WORLD TM...

1

Eric SchornCPU Product Manager

ARM Austin Design Center

ARM1026EJ-S™Synthesizable ARM10E™ Family Processor Core

THE ARCHITECTURE FOR THE DIGITAL WORLD TM

2THE ARCHITECTURE FOR THE DIGITAL WORLD TM

Five Families of ARM Processor IP

Perf

orm

ance

Time

ARM7™- Low Power - Area Efficient- Code Density - 3 Stage Pipe- 4 product offerings

ARM9™- Dual Caches - Hard-IP Design- Performance - 5 Stage Pipe- 3 product offerings

ARM9E™- Config $/TCM - Soft-IP Design - DSP Instr. - Java™ in HW- 3 product offerings

Future- See “New ARM

Microarchitecture Implementing the ARMv6 Architecture”EPF presentation

SecurCore™- Secure Apps - Performance- Power, Area - Code Density- 4 product offerings

ARM preserves SW & HW investment through code and process portability

ARM10E™- Multiple $/TCM - Float in HW- Dual 64bit I/O - 6 Stage Pipe- 2 product offerings- New: ARM1026EJ-S

This presentation

2

3THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-S Objectives

Microarchitecture performance optimizations based on detailed benchmark analysis

EEMBC®, Dhrystone, OS Boot, key apps

Maximize area efficiency through process-specific compiled SRAM memory arrays

Configurable cache and TCM sizes on a case-by-case basis

Soft-IP design delivery, with hard-IP availableRapid process migration and design flow compatibility

Significant new functionality and flexibility enhancements to the ARM10E family of processor cores

Ideal extension of ARM9E based technology

4THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-S OverviewJazelle™ technology-enhanced ARM10E family processor for platform and embedded applications

Convergence of proven ARM1020E™and ARM926EJ-S™ technologyARMv5TEJ architecture: ARM, Thumb®, DSP, and Java™ instructionsMMU and MPU supportConfigurable caches and TCMsExtensive 64-bit internal bussingDual 64-bit AHB I/O interfacesIEEE 754 FP support with VFP10™Real-time trace with ETM10RV™

266-325 MHz worst-case on 0.13μ325-400+ Dhrystone 2.1 MIPS** According to strict interpretation

of Dhrystone v2.1 rules

3

5THE ARCHITECTURE FOR THE DIGITAL WORLD TM

New Features in ARM1026EJ-S

Jazelle technologyIndustry leading Java bytecode execution in HW Compatible with ARM7EJ-S™, ARM926EJ-S, and future cores

MMU and MPU support via shared logicEnables both platform OS and RTOS solutions on the same partWindows® CE, Symbian OS™, VxWorks®, Linux, …

Direct-attach vector interrupt controller (VIC) portSupports more interrupt sources and vectorsSignificant reduction in interrupt handling latencySolution is compatible with future cores

6THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-S in ContextIndustry-leading feature set

32k/32k CachesConfig Caches/TCMs16k/16k CachesConfig Caches/TCMsCPU Cache/TCM-NoNoYesVector interrupt supportASSPPortable Soft-IPPortable Hard-IPPortable Soft-IPDesign delivery1 x 322 x 32 AHB1 x 32 AHB2 x 64 AHBCPU core I/O width(s)

-Yes (VFP9)NoYes (VFP10)IEEE floating-point in HW-YesNoYesJava bytecodes in HW-Yes (ETM9)Yes (ETM9)Yes (ETM10RV)Real-time traceMMUMMUMMUMMU & MPUMemory managementDynamicNoNoStaticBranch prediction7, 8556Core pipeline depthv5TEv5TEJv4Tv5TEJARM ArchitectureIntel® XScale™ARM926EJ-SARM920T™ARM1026EJ-S

Industry-leading support infrastructure

4

7THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-SPerformance Analysis and Tuning

8THE ARCHITECTURE FOR THE DIGITAL WORLD TM

Dhrystone has a long history with well defined rulesSee http://www.arm.com for links to discussion, references, and repeatable code examples

Yet, confusion appears prevalent even todayARM1020E is 1.23 MIPS/MHz according to the rules of v2.1Auto-inlining improves the score to 1.45 MIPS/MHzFile merging improves the score further to 1.71 MIPS/MHzVersion 1.1 on an unreleased compiler improves the score furtherto 1.92 MIPS/MHz

Better benchmarks are needed for accurate comparisonsIdeally would also be more relevant to the real worldReliable, consistent, repeatable …

Squeezing MIPS from a Dhrystone

5

9THE ARCHITECTURE FOR THE DIGITAL WORLD TM

The ARM1020E processor was thestarting point for ARM1026EJ-Sdefinition and RTL development

Completed EEMBC certification ofARM1020E with Verilog RTL

Base measurements are ECL certifiedand publicly available from EEMBC

Developed a highly-accurate cyclemodel of ARM1020E in C

Verified against RTL using Dhrystone and EEMBCEnables what-if analysis for ARM1026EJ-S

ARM1020E Certified EEMBC Benchmarks

19.3*EEMBCConsumerMark

206.2EEMBCOAMark

4.14EEMBC TeleMark

120.5EEMBC AutoMark

5.42EEMBC NetMark

ARM ADS 1.2Tool Chain

SingleInstruction Issue

NoneL2 Cache Size

ARM v5TEISA

ARM1020EPart #

ARMVendor

*with the exception of ConsumerMark on cycle model due to simulation time

10THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1020E branch predictor inspectsonly single entry of prefetch queue

Predictor can miss a branch since codealignment, CPU stalls, and branchsequences can affect queue fills

ARM1026EJ-S branch predictoroperates on two queue entries

No branches missed by predictor11% improvement on NetMark3% improvement on Dhrystone 2.1

ObservationsDramatic (~40%) benefit on NetMark relative to ARM926EJ-SBranch prediction on entire Telecom suite is over 82% correctStatic predictor is now over 70% accurate on average

ARM1026EJ-S Branch PredictionInstruction

Cache

Entry AEntry BEntry CEntry D

Integer CPU Core

Branch

Predictor

Prefetch “Queue”

64b: 2 instr/cycle

Prediction

Instr+predec

New

6

11THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-S Memory SubsystemDhrystone loops don’t cause linefills…but real code does

Optimized BIU to further increase BW of 2x64bit AHB I/OMost improvement seen on larger NetMark packet flows38% improvement on the first iteration of DhrystoneTremendous improvement in OS boot cycle count over ARM9

Optimized write buffer for sequential storesCollapse sequential stores into AHB burst transactionBiggest change in any single component – 68% in Consumer:CMY

ObservationsAll Net/Auto/Tele code fits in the ICache – see code size statsData cache hit rate for Net/Auto/Tele upwards of 95%

12THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-S Return Stack

R14 is our link register to hold return addresses

New feature to predict MOV PC, R14 and BX R14 branch addresses – common return from subroutine

Longer ARM10E pipeline allows this logic to fit nicely

ObservationsImprovement varied wildly across algorithmsAutomotive PWM shows over 10% improvementNetworking Open Shortest Path First shows 0% improvementVarious stack depths were investigated … returns diminish quickly as complexity increases

7

13THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-S MHz OptimizationMigrating from Hard-IP to Soft-IP is non-trivial

Objective is to maintain both MHz and IPCPrevious IPC enhancements helped to open up headroom

Extensive timing analysis was done to determine what features of the design were penalizing speed

Objective is to analyze total performance tradeoff, on a feature by feature basis, as measured by EEMBC, Dhrystone, and OS boots

Several places to tradeoff against the uncommon caseUnpredicted and mispredicted branches … mostly hidden by improved branch prediction and return stackData to LS address forwarding … 2-5% impactByte/half-word load forwarding … 8% impact on TeleMark only

14THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-S Design SummaryARM1026EJ-S builds upon ARM1020E, by adding:

Soft-IP option with configurable caches and TCMsJazelle, VIC port, and both MMU/MPU supportImproves both performance and area efficiency7% improvement on geometric mean of EEMBC suites5% improvement on Dhrystone v2.1

ARM1026EJ-S is the idealextension of ARM9E family technology

Additional pipeline stage toincrease frequencyBranch prediction, branch folding, andreturn stack to improve CPIExtensive 64-bit internal bussing and dual64-bit AHB I/O for increased bandwidth

ARM1020EARM9E

ARM1026EJ-S

8

15THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-SSystem Overview

16THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-S System Solution

ARM1026EJ-S Integer CPUARM, Thumb, DSP, and JavaConfigurable caches and TCMsDual 64-bit AHB I/O400+ Dhrystone 2.1 MIPS

ARM1026EJ-S CPU

VFP10 FP Coprocessor

ETM10RV Trace Macrocell

Vector Interrupt Controller

L2 Cache/TCM Controller

L2 SRAM Arrays

AMBA® AHB System Busses

VFP10 FP CoprocessorIEEE 754 SP/DP, vector support

ETM10RV Embedded Trace MacroReal-time, zero perf overhead

9

17THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ARM1026EJ-S System Solution

Optional L2 cache/TCM controllerMultiple AHB ports in multiple clock domains up to CPU speedConfigurable SRAM array sizes and latencies, set associativity, line lengths, write back/through modes, and line allocation policyFlexible cache and TCM combinations are supportedAllows optimization for both performance and cost

Vector Interrupt ControllerApplies prioritized interrupt and associated vector to CPUAddresses significant latency overhead in systems with a large number of frequent interrupts

18THE ARCHITECTURE FOR THE DIGITAL WORLD TM

ConclusionARM1026EJ-S is a large step along the ARM roadmap

Further extending flexibility, functionality, and compatibilityPerformance tuning that reflects real-world applications

Benchmarking remains a difficult and subjective artMore and more data enables intelligent judgmentsNot all important features and effects make it to the bottom lineFinal performance is the sum of a lot of small improvements

ARM1026EJ-S demonstrates ARM’s continued focus on the entire solution

CPU performance AND flexibility, compatibility, functionality, area, power, integration, software, tools, support …Minimizing total development and production cost