Overcoming The Challenges Of Multimedia System Designrtcgroup.com/arm/2007/presentations/114 -...

23
Overcoming The Challenges Of Multimedia System Design Jem Davies Director of Technology ARM Media Processing Division

Transcript of Overcoming The Challenges Of Multimedia System Designrtcgroup.com/arm/2007/presentations/114 -...

Overcoming The Challenges Of Multimedia System Design

Jem DaviesDirector of Technology

ARM Media Processing Division

222Confidential

AgendaThe challenges

Memory bandwidthPower consumptionCostContent/applications

A multi-disciplinary approachIt’s all about software, stupid

333Confidential

The Challenges – No Surprises HereMemory Bandwidth

Video requires a lot of memory bandwidthHDTV resolution at 30 fps is close to 200MB/s for frame-buffers only

3D Graphics in WVGA and beyondEasily consumes hundreds of MB/s – some architectures even up to 6-7GB/s for ”low-end” WVGA – won’t work well in mobile!

The performance bottleneck for user experience expectationsPower Consumption / Energy Capacity

Mobile applications processors power budget is no more than ~ 250 mWThis is not a PC - this is not the PC marketUsers expect better battery life – not worse! No fuel cells yet

CostCustomers will not pay more than $500 for very high-end mobiles

ContentMobile content is currently limited – new technologies require new investmentsTools are under-developed as yet

444Confidential

Solving the ProblemsSolving the problems requires a multi-disciplinary, multi-faceted approach

So that will be easy then, won’t it?At the core (not just CPU)

Dynamic vs. static powerLocal memories vs. cost to save bandwidthGate count vs. performance

At the interconnect and the fabricBus protocolsSystem level caches to reduce memory bandwidthMemory controllersSystem architecture

Across the systemSoftware stackHardware <=> software interaction

At the content and application levelLittle is done here today – what could be done to help power?

555Confidential

Memory Bandwidth - 1At the core (CPU)

Architecture (incl. ISA)Micro-architectureProven track recordCache(s)TCMs

We’ve learned a lotAbout powerAbout IPAbout standardsAbout scaleAbout value

Debug &trace

interface

666Confidential

Memory Bandwidth - 2At the core (media accelerators): video, graphics and audioAll consume significant memory bandwidth

Some use more than the CPUWith different characteristics, too

At first, audio doesn’t seem too bad, then…… users want MP3 for 100 hours…… and multi-channel audio (games mixing sources, radio etc. etc.)… and 3-D audio, Dolby 5.1, other processing…

Video resolutions get bigger and biggerSome common industry designs do not scale well (PPA)

Not all graphics architectures are equal“Do more with less”Mali™ graphics hardware designed for low bandwidth and low power

777Confidential

High System PerformanceKeeping data close

Support up to 50 outstanding transfersCaching system for all data streamsAll bus transactions are bursts and do cache line fills for future operationsOn-chip buffers for intermediate results prevents unnecessary read/modify/write cycles to memory

Colour (blending, multisampling)Z / depthStencil

Pre-fetching of state data

Mali™ hardware is a team player in an SoC environment

AXI Outstandingtransfers

Frame bufferOn-chip buffers / caches

Compute units

On-chip buffers and caches ensure Mali hardware is

active even if system latency is extremely high

888Confidential

Power-efficient GraphicsMemory bandwidth is significant use of power

Large proportion is off-chip at 10x the power

Mali™ architecture significantly reduces memory bandwidthCombines the best of immediate-mode flow and tile-based rendering

Significant savings for both low and high complexity scenes

0.020.040.060.080.0

100.0

Softwareonly

Immediatemode

Traditionaltile-based

Mali55 Mali200

mW

per

fram

e

Advanced UI 3,000 vertices Gaming Hi 30,000 vertices

999Confidential

Memory Bandwidth - 3At the interconnect and fabric level

Generate burst trafficNumbers of outstanding transactions appropriate to data(Multi-level) caches required – OS supportSystem-level caches reduce bandwidth to off-chip memoryCache coherency protocols allow inter-core communication without touching external memory

System-levelDrivers need to be written with memory usage (power) in mind

E.g. (software)-cache internal resultsApps/content software have to be produced through tools that create efficient code (e.g. cache-optimized loops to save memory bandwidth)

CompilersHigh-level content-generation tools

101010Confidential

Example Mali™ System Architecture

Cortex-A9

AMBA® AXI™ BUS (PL301)

L2 Cache (PL310)

Mali™GP2

DRAM controller (PL341)

Snoop Ctrl Unit ACP

Memory

Here’s a (simplified) example of the system and data flow We need to minimise external memory transactions

Dualports

Mali200 Mali200

LCD controller (PL111)

MaliL2

(Accelerator Coherence Port)

L1 Caches

111111Confidential

System Approach is Key to Performance

AMBA® AXI™ BUS Fabric

SIM Interface

IO

HDDInterface

SATAPHY

Video CodecSub-system

Pre & Post Processing on

Mali200

DDR MemoryController

Mobile DDRPHY

Mali2003D GraphicsSub-system

ImageProcessing

System Level Cache

AudioDE™Audio CodecSub-system

Audio IO

NAND FlashInterface

IO

InterruptController

CoreSight™Debug/Trace

ARM CortexProcessor

NEON™

L2 CacheController

DMAController

TouchscreenInterface

USB IR UART GPIOTimers

SPIx2 FM and TV ReceiverI2S I2C

Cam

era

Inte

rface

Peripheral subsystem

Latency toleranceTraffic from other IP creates challenges for real-time graphicsDevelopers need system knowledge and toolsSoftware API stack

Per frame autonomous renderingMinimise HW / SW interaction

System bus bandwidthDo more for less

Mali200™ GPUUp to 40 GFLOPSOn-chip buffers and cachesBurst optimised bus transactions

Memory bandwidthDo more for less

121212Confidential

High System PerformancePer-frame autonomous renderingNo overhead in HW/SW interactionVertices and control data pushed to memory by API driversMali™ hardware automatically manages frame rendering without S/W interferenceResults

CPU is not kept idle or trapped in interrupt handling routines – traditionally a performance killer for graphicsClean system architecture that eliminates HW/SW interface bottlenecks

Caches /Buffers

Per frameControl logic

Memory

MMU

Read data structures and produce frame buffer

Vertex ArraysPer Frame Config.Textures

ARMProcessor

131313Confidential

How Do We Measure Performance?(Performance equals power)

More efficiency = more performance = less powerAs discussed, performance is affected by multiple factors

Need accurate measurementsNeed “what-if” capabilityNeed realistic systems

E.g. not perfect memoryOn graphics, performance is deeply related to content

Need to agree content – benchmarksBut need to avoid the Dhrystone effect

Some of our recent optimisations didn’t affect SPMarkAll IP suppliers simulate their own IP

As we have all the IP, we simulate/model entire systemsChallenge your IP provider!

141414Confidential

Low Power By DesignCore design for low power

Clock domains, clock gatingOn-chip buffers and area vs. leakage powerIs Silicon free – is it just powered-up Silicon that you have to pay for?

Interconnect design for low powerEfficient, low bandwidth = low power

System designKeeping memory bandwidth low

Content and applicationsWhat can programmers do to save power?

151515Confidential

It’s All About Software, Stupid!

OS/RTOS

Native EnvironmentJava Execution Environment

JSR184 JSR239

HAL

2D/VG Midlet

JSR226 JSR135 JSR234

Java VM

JSR287

3D Midlets

JSR297

Media Midlet

SVGt Flash

Native App

ARM CPUARM CPU Mali GPUMali GPU Video H/WVideo H/W Audio AccelAudio Accel

MIDI 3D Audio

OpenKode

Audio & Video

CodecsTrustZone®

Framework

Content Ecosystem

Content Creation

Tools

Sand or Life?See our demos

161616Confidential

The Software ChallengesStandards are good but …

Who controls compliance?What about “extensions”?Who do you want to do the integration?

Does compliance alone guarantee high performance?Particularly when working with other componentsHow do you verify/validate at this level of complexity?

We believe you want a pre-verified, integrated solution

171717Confidential

Power and Software DesignSome components work together better by designing them together

For example: avoiding data copying saves energy

The more parts of the puzzle you are in control of, the easier this is:

Content, Java VM, M3G2, OpenGL ES 2.0 drivers...It is possible to optimise the flow and remain standards-compliantVendor-specific extensions are a nightmare!

181818Confidential

CostMeeting demand for user experience while keeping cost of devices for mass market

What are the cost drivers?IPSiliconValidationLost market windowEtc.

Obtaining more pre-verified, pre-integrated IP from one supplier will accelerate developmentWill that reduce costs/increase profits overall?

191919Confidential

Cost vs. User ExperienceInevitably, the user exerience has to be tailored to fit the market requirements:

Software-rendered graphicsCPU-rendered low-resolution videoMali55 OpenGL ES/OpenVG-accelerated user interfaceMali200/Sif OpenGL ES 2.0 hardware1080p video, H.264, VC-1 ...

What you want are unified stacksWhat you don’t want is to redesign everything between differing platforms

202020Confidential

ContentContent owners are excited with possible numbers of mobilesThe mobile computing revolution continues:

What is a smartphone?What is a mobile computer?How will content differ between these types of platform?

Challenges will include portability and securityWe need to make it easy to adjust content and make it more efficient on mobile platforms

212121Confidential

Extended Tools Offering for DevelopersRealView® System Generator

Complete model of a platformExecutes ARM binaries in real timeProvides full debug visibility Enables visualisation of contentReduces cross platform compilation issuesCost effective and safe distribution model

Performance Analysis Tools Profiles graphics contentHelps identify system bottle necksEnables content to be tuned to the graphics cores

222222Confidential

SummaryThere are no magic bulletsSound engineering will still have great valueBuilding good systems will still have great valueSolving “the problem” requires work in a number of disciplinesLife for suppliers of individual components gets harderThe world still needs great software, great hardware and great tools

232323Confidential

Thank You