Designing Memory Systems for Tiled Architectures

40
Designing Memory Systems for Tiled Architectures Anshuman Gupta September 18, 2009 1

description

Designing Memory Systems for Tiled Architectures. Anshuman Gupta September 18, 2009. Multi-core Processors are abundant. Multi-cores increase the compute resources on the chip without increasing hardware complexity. Keeps power consumption within the budgets. Sun Niagara 2 (8-core). - PowerPoint PPT Presentation

Transcript of Designing Memory Systems for Tiled Architectures

Design Decisions for Tiled Architecture Memory Systems

Designing Memory Systems for Tiled ArchitecturesAnshuman GuptaSeptember 18, 20091Multi-core Processors are abundantMulti-cores increase the compute resources on the chip without increasing hardware complexityKeeps power consumption within the budgets.2

AMD Phenom (4-core)Sun Niagara 2 (8-core)

Tile64 (64-core)

Intel Polaris (80-core)

2Multi-Core Processors are underutilized3b = a + 4 (0)c = b * 8 (1)d = c 2 (2)e = b * b (3)f = e * 3 (4)g = f + d (5)032511121Single thread codeParallel Execution1423131413032521143546Serial ExecutionSoftware gets the responsibility of utilizing the cores with parallel instruction streamsHard to parallelize applications.3Tiled Architectures increase Utilization by enabling Parallelization

4The OCN communication latencies are of the order of 2+(distance between tiles) cycles**Latency for RAW inter-ALU OCNTiled architectures are of class of multi-core architecturesProvide mechanisms to facilitate automatic parallelization of single-threaded programsFast On Chip Networks (OCNs) to connect cores4Automatic Parallelization on Tiled Architectures5b = a + 4 (0)c = b * 8 (1)d = c 2 (2)e = b * b (3)f = e * 3 (4)g = f + d (5)032511121Single thread codeMulti-coresTiled ArchitectureIn tiled architectures, dependent instructions can be placed on multiple cores with low penalty in tiled architectures due to cheap inter-ALU communication.1423131413032521143546032523114234545Why arent tiled architectures used everywhere?6Automatic parallelization is still very difficult due to slow resolution of remote memory dependenciesTiled Architecture Memory systems have a special requirement Fast Memory Dependence Resolution(*b) = a + 4 (0)c = (*b) * 8 (1)(*d) = c 2 (2)e = (*h) * 4 (3)f = e * 3 (4)g = f + (*i) (5)032511121Single thread codeMulti-coresTiled Architecture14231314130325211435460325111211423131413What if we add some memory instructions?6OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing WorkAnalysis of Existing WorkFuture Work and Conclusion77Memory DependenceStatic AnalysisTypea addressb addressStatic placementNoNo0x10000x2000MustTrue0x10000x1000MayTrue0x10000x1000False0x10000x2000*a = = *bfoo (int * a, int * b){ *a = = *b}*a = = *b8*a = = *b8Memory CoherenceCoherent space provides an abstraction of a single data buffer with a single read write portHierarchical implementation of shared memoryRequire coherence protocols to provide the same abstractionCore 0

Core 1

Shared Memory

Core 0

Write A = 1Core 1

Read AShared Memory

CacheCacheShared BufferWrite A = 1Read AA = 0A = 1A = 19DependenceSignal9Improving Memory Dependence ResolutionMemory Dependence Resolution Performance depends on True Dependence PerformanceFalse Dependence PerformanceCoherence System Performance10Very Very important slide.

Motivate the summary part.

10True Dependence ResolutionDelay 1 Determined by Signaling StageEarlier is betterDelay 2 Determined by signaling delay inside the ordering mechanismFaster is betterDelay 3 Determined by Stalling StageLater is betterDelays 1 and 3 are determined by the resolution model11SourceDestinationSignalStall StageSignal Stage123Delay11False Dependence ResolutionFalse Dependencies occur whenStatic analysis cannot disambiguateMemory Dependence encoding is not partialFor false dependencies, dependent instruction should ideally not wait for any signalRuntime DisambiguationThe address comparison done in hardware to declare the dependent instruction as freeSpeculationDependent instruction is issued speculatively assuming the dependence is false12Fast Data AccessLocal L1 caches can help decrease average latenciesNo network delaysCache Coherence (CC)Dynamic access data location not known staticallyExpensive dynamic access in the absence of CC13What features to look out for?14L1 LocalCCOrdering PointResolutionEncodingSpecRuntime DisambiguationOutlineMotivationPreserving Memory OrderingMemory Ordering in Existing WorkRAWWaveScalarEDGEAnalysis of Existing WorkFuture Work and Conclusion1515RAWA highly static tiled architectureArray of simple in-order MIPS coresScalar Operand Network (SON) for fast inter-ALU communicationShared address space, local caches and shared DRAMsNo cache coherence mechanismSoftware cache management through flush and invalidation16

*Taylor et al, IEEE Micro 200216Artifacts of Software Cache ManagementDifficult to keep track of the most up-to-date version of a memory addressAll memory accesses can be categorized as -Static AccessThe location of the cache line is known staticallyDynamic AccessA runtime lookup is required for determining the location of the cache lineThese are really expensive (36 vs 7)17Have symmetry in introducing archs.

17Static-Dynamic Access OrderingTwo static accessesSynchronization over SONDependence between a static and a dynamic accessSynchronizing over SON betweenStatic accessStatic requestor or receiver for dynamic accessExecute side resolutionNo speculative runaheadFalse dependencies are as expensive as true dependence1818SummaryArchL1 LocalCCOrdering PointResolutionEncodingSpecRuntime DisambiguationRAWsdYesNoOCNExec-sidePartialNoNo1919Dynamic Access OrderingExecute side resolution very expensiveResolution done late in the memory systemStatic ordering point Turnstile tileOne per equivalence classEquivalence class - set of all memory operations that can access the same memory addressRequests sent on static SON to turnstileReceives in memory orderIn-order dynamic network channels20

20SummaryArchL1 LocalCCOrdering PointResolutionEncodingSpecRuntime DisambiguationRAWsdYesNoOCNExec-sidePartialNoNoRAWddYesNoTurnstileSecondary Mem-sidePartialNoYes2121OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing WorkRAWWaveScalarEDGEAnalysis of Existing WorkFuture Work and Conclusion2222WaveScalarA fully dynamic Tiled Architecture with Memory Ordering

Clusters arranged in 2D array connected by mesh dynamic network

Each tile has a store buffer and banked data cache

Secondary memory system made up of L2 caches around the tiles

Cache coherence23*Swanson et al, Micro 200323Memory OrderingLoad AStore BLoad CStore BufferWaveScalar preserves memory ordering by using a sequence number for each memory operation in a wave UniqueIndicates ageEach memory operation also stores its predecessors and successors sequence numberUse ? if not known at compile timeThere cannot be a memory operation whose possible predecessor has its successor marked as ? and vice-versaMEM-NOPsA request is allowed to go ahead if its predecessor has issuedIn hardware this ordering is managed in the store buffersA single store buffer is responsible to handle all memory requests for a dynamic waveLoad A Store B Load C Load A Store B Load C Load C Load A 2424Removing False Load DependenciesSequence number based ordering is highly restrictiveLoads are stalled on previous loadsEach memory operation has ripple number as last stores sequence numberMemory operation can issue if op with ripple number has issuedLoads can issue OoOStores still have total ordering

2525SummaryArchL1 LocalCCOrdering PointResolutionEncodingSpecRuntime DisambiguationRAWsdYesNoOCNExec-sidePartialNoNoRAWddYesNoTurnstileSecondary Mem-sidePartialNoYesWaveScalarNoYesStore BufferPrimary Mem-sideStore TotalNoNo2626OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing WorkRAWWaveScalarEDGEAnalysis of Existing WorkFuture Work and Conclusion2727EDGEA partially dynamic Tiled Architecture with block execution

Array of tiles connected over fast OCNs

Primary memory system is distributed over tiles

Each such tile has address interleavedData cacheLoad Store Queue

Distributed Secondary Memory System

Cache Coherence28*S. Sethumadhavan et al, ICCD 06Memory OrderingUnique 5 bit tag called LSIDCompletion of block executionOrdering of memory operationsDTs get a list of all LSIDs in a block during fetch stageMemory operations reach a DTLSID sent to all the DTsRequest issued if all requests with earlier LSIDs completedmemory side dependence resolutionWhen all memory operations have completed, block is committed

Ld A Ld B St C Ld C Ld A Ld B St C Ld C

, 1

,0, 1

,0, 1,0, 1, 3

,0, 1, 3,2

, 3,2, 3,229Control TileExecution TilesInterleaved Data TilesTest the colors on projectors29Dependence SpeculationEDGE memory ordering is very restrictiveTotal memory orderLoads execute speculativelyEarlier store to the same address causes squashPredictor used to reduce squashes30Dependence spec numbers30SummaryArchL1 LocalCCOrdering PointResolutionEncodingSpecRuntime DisambiguationRAWsdYesNoOCNExec-sidePartialNoNoRAWddYesNoTurnstileSecondary Mem-sidePartialNoYesWaveScalarNoYesStore BufferPrimary Mem-sideStore TotalNoNoEDGENoYesLSQPrimary Mem-sideTotalYesNo3131OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing WorkAnalysis of Existing WorkFuture Work and Conclusion3232True Dependence OptimizationArchL1 LocalCCOrdering PointResolutionEncodingSpecRuntime DisambiguationRAWsdYesNoOCNExec-sidePartialNoNoRAWddYesNoTurnstileSecondary Mem-sidePartialNoYesWaveScalarNoYesStore BufferPrimary Mem-sideStore TotalNoNoEDGENoYesLSQPrimary Mem-sideTotalYesNo3333Memory Side Resolution allows more OverlapRequestor ARequestor BHome NodeRequestor ARequestor BHome NodeRequestor ARequestor BHome NodeTag BufferTurnstileRAWsdEDGE/WaveScalarRAWddRAWsdE/WSRAWdd34*The length of the bars do not indicate delaysRequest AResponse ACoherence delay ARequest BResponse BCoherence delay B34Network Stalls should be avoidedExecute Side Resolution - eRAWsdMemory Side Resolution - mEdge, WaveScalarRAW dynamic ordering - mtNetwork delay to memory system is overlappedeemmmtF Na E N$ Tp Nm Ts M Nc Nr Wm,mtemmtE,W,N$,NrTp,Nm3535False Dependence OptimizationArchL1 LocalCCOrdering PointResolutionEncodingSpecRuntime DisambiguationRAWsdYesNoOCNExec-sidePartialNoNoRAWddYesNoTurnstileSecondary Mem-sidePartialNoYesWaveScalarNoYesStore BufferPrimary Mem-sideStore TotalNoNoEDGENoYesLSQPrimary Mem-sideTotalYesNo36Partial Ordering reduces false depsSpeculation on false deps reduces stallsDisambiguation should be done early36OutlineMotivationPreserving Memory OrderingMemory Ordering in Existing WorkAnalysis of Existing WorkFuture Work and Conclusion37Whats a Good Tiled Architecture Memory System?Local caches for fast L1 hitCache Coherence support for ease in programmability and no dynamic access delaysFast True Dependence ResolutionPerformance comparable to same core placement of operationsLate stallsEarly signalingReduction of false dependencies through partial memory operation orderingFast False Dependence resolutionPerformance comparable to same core placement of operationsEarly runtime memory disambiguationSpeculative memory requests3838ConclusionAuto-parallelization on tiled architecture can benefit from fast Memory Dependence resolutionMulti-core memory system were not designed with this goalPerformance of both true and false dependence resolution should be comparable dependent memory instructions placed on the same coreISA should support partial memory operation ordering to avoid artificial false dependenciesMemory system should have local caches and cache coherence for performance and programmability39Thank You!Questions?39Dynamic Accesses are expensive

X looks up a global address list and sends a dynamic request to owner YY is interrupted, data is fetched and dynamic request sent to ZZ is interrupted, data is stored in local cacheOne table lookup, two interrupt handlers and two dynamic requests make dynamic loads expensiveLifted portions represent processor occupancy,while unlifted portions portion represents network latency4040